浅析hashCode_前嗅大数据

浅析hashCode

时间: 2020-06-08来源：OSCHINA

前景提要

写此文的起因是看到一个面试题：默认hashCode()返回的值是什么？GC后hashCode会改变吗？
在以前的认知里，默认hashCode()的返回值是java对象的内存地址没错。然而这又是一个经不起推敲的结论。
首先把一个对象作为key存入hashmap，如果hashCode返回内存地址，那么GC后地址必然改变，那么用这个对象就取不到值了，这显然是不合理的。
不知为何这样的谬误会存在脑子里这么久，从来没有怀疑过它。
既然想到这里，就再简单剖析一下hashCode原理吧。
1、预备知识
1.1 safepoint
GC相关。VM线程触发GC时，需要先 stop-the-world ，说人话就是设置一个safepoint。工作线程会时不时检查safepoint状态，一旦发现safepoint被设置就将自己挂起。VM线程发现工作线程都挂起以后就执行GC，完事后再唤醒工作线程。
用现实情况举例就是：leader在群里发了一条"一会儿开会"的消息（safepoint），每个成员时不时查看消息，看到消息第一时间放下手中工作前往会议室（线程挂起），然而有个人沉浸编码无法自拔，十分钟后才看到信息，结果导致会议延迟（GC期间程序暂停时间过长）
至于底层具体如何实现，呃，我没看懂，感兴趣的同学自行了解 https://www.jianshu.com/p/c79c5e02ebe6
1.2 Mark Word
就是Java对象头里面的 Mark Word 转载，下面给出原文地址 |-------------------------------------------------------|--------------------| | Mark Word (32 bits) | State | |-------------------------------------------------------|--------------------| | identity_hashcode:25 | age:4 | biased_lock:1 | lock:2 | Normal | |-------------------------------------------------------|--------------------| | thread:23 | epoch:2 | age:4 | biased_lock:1 | lock:2 | Biased | |-------------------------------------------------------|--------------------| | ptr_to_lock_record:30 | lock:2 | Lightweight Locked | |-------------------------------------------------------|--------------------| | ptr_to_heavyweight_monitor:30 | lock:2 | Heavyweight Locked | |-------------------------------------------------------|--------------------| | | lock:2 | Marked for GC | |-------------------------------------------------------|--------------------|
需要了解的几个点： Mark Word 存储的数据会随着状态（右列）的变化而改变 identity_hashcode 是延迟加载的当对象被锁定时（非Normal状态）， identity_hashcode 会移动到管程 Monitor 中 https://www.jianshu.com/p/3d38cba67f8b
2、源码分析
2.1 FastHashCode
从hashCode源码一路往下找，很容易就找到了 FastHashCode
openjdk11\src\hotspot\share\runtime\synchronizer.cpp intptr_t ObjectSynchronizer::FastHashCode(Thread * Self, oop obj) { if (UseBiasedLocking) { // == safepoint相关，英文不好就不翻译了 == // NOTE: many places throughout the JVM do not expect a safepoint // to be taken here, in particular most operations on perm gen // objects. However, we only ever bias Java instances and all of // the call sites of identity_hash that might revoke biases have // been checked to make sure they can handle a safepoint. The // added check of the bias pattern is to avoid useless calls to // thread-local storage. if (obj->mark()->has_bias_pattern()) { // Handle for oop obj in case of STW safepoint Handle hobj(Self, obj); // Relaxing assertion for bug 6320749. assert(Universe::verify_in_progress() || !SafepointSynchronize::is_at_safepoint(), "biases should not be seen by VM thread here"); BiasedLocking::revoke_and_rebias(hobj, false, JavaThread::current()); obj = hobj(); assert(!obj->mark()->has_bias_pattern(), "biases should be revoked by now"); } } ObjectMonitor* monitor = NULL; markOop temp, test; intptr_t hash; markOop mark = ReadStableMark(obj); // == 读对象头MarkWord == if (mark->is_neutral()) { // == 判断是否normal状态 == hash = mark->hash(); // == 从mark里取hash == if (hash) { // == 已经存在，直接返回 == return hash; } hash = get_next_hash(Self, obj); // == 真正生成hash的方法 == temp = mark->copy_set_hash(hash); // == 复制mark串并将新的hash写入 == // use (machine word version) atomic operation to install the hash // 下面这段是CAS常规操作，Java源码里也随处可见，一定要熟悉 // 结合ConcurrentHashMap可以给某些业务加分段锁，非常好用 test = obj->cas_set_mark(temp, mark); // == 通过CAS将新Mark设置给obj，返回设置前的值 == if (test == mark) { // == 如果设置前的值和刚开始读到的值相等，表明期间没被其他线程修改过 == return hash; } // If atomic operation failed, we must inflate the header // into heavy weight monitor. We could add more code here // for fast path, but it does not worth the complexity. } else if (mark->has_monitor()) { // == 判断加锁（重量级锁） == // == 前文提到加锁的对象`identity_hashcode`会移动到管程`Monitor`中 == monitor = mark->monitor(); // == 从monitor中获取hashcode == temp = monitor->header(); assert(temp->is_neutral(), "invariant"); hash = temp->hash(); if (hash) { return hash; } // Skip to the following code to reduce code size } else if (Self->is_lock_owned((address)mark->locker())) { // == 判断是否锁重入 == temp = mark->displaced_mark_helper(); // this is a lightweight monitor owned assert(temp->is_neutral(), "invariant"); hash = temp->hash(); // by current thread, check if the displaced if (hash) { // header contains hash code return hash; } // WARNING: // The displaced header is strictly immutable. // It can NOT be changed in ANY cases. So we have // to inflate the header into heavyweight monitor // even the current thread owns the lock. The reason // is the BasicLock (stack slot) will be asynchronously // read by other threads during the inflate() function. // Any change to stack may not propagate to other threads // correctly. } // == 无锁或轻量级锁无法获得hash，则升级为重量级锁 == // == 基本逻辑和上面相似，不再赘述 == // Inflate the monitor to set hash code monitor = ObjectSynchronizer::inflate(Self, obj, inflate_cause_hash_code); // Load displaced header and check it has hash code mark = monitor->header(); assert(mark->is_neutral(), "invariant"); hash = mark->hash(); if (hash == 0) { hash = get_next_hash(Self, obj); temp = mark->copy_set_hash(hash); // merge hash code into header assert(temp->is_neutral(), "invariant"); test = Atomic::cmpxchg(temp, monitor->header_addr(), mark); if (test != mark) { // The only update to the header in the monitor (outside GC) // is install the hash code. If someone add new usage of // displaced header, please update this code hash = test->hash(); assert(test->is_neutral(), "invariant"); assert(hash != 0, "Trivial unexpected object/monitor header usage."); } } // We finally get the hash return hash; }
2.2 get_next_hash
核心逻辑。根据全局hashCode的值决定使用哪种算法。可以通过配置 -XX:hashCode=[0-5] 修改。
openjdk11\src\hotspot\share\runtime\synchronizer.cpp static inline intptr_t get_next_hash(Thread * Self, oop obj) { intptr_t value = 0; // == hashCode是一个全局变量，指明要使用的hashCode算法 == if (hashCode == 0) { // This form uses global Park-Miller RNG. // On MP system we'll have lots of RW access to a global, so the // mechanism induces lots of coherency traffic. value = os::random(); // == 0 使用系统随机数 == } else if (hashCode == 1) { // This variation has the property of being stable (idempotent) // between STW operations. This can be useful in some of the 1-0 // synchronization schemes. // == 1 在内存地址基础上进行二次运算 == intptr_t addrBits = cast_from_oop<intptr_t>(obj) >> 3; value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom; } else if (hashCode == 2) { // == 2 测试用途 == value = 1; // for sensitivity testing } else if (hashCode == 3) { value = ++GVars.hcSequence; // == 3 自增序列 == } else if (hashCode == 4) { value = cast_from_oop<intptr_t>(obj); // == 4 使用内存地址 == } else { // == 5 某个"名字说了你也记不住"的算法，默认使用这个算法 == // Marsaglia's xor-shift scheme with thread-specific state // This is probably the best overall implementation -- we'll // likely make this the default in future releases. unsigned t = Self->_hashStateX; t ^= (t << 11); Self->_hashStateX = Self->_hashStateY; Self->_hashStateY = Self->_hashStateZ; Self->_hashStateZ = Self->_hashStateW; unsigned v = Self->_hashStateW; v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)); Self->_hashStateW = v; value = v; } value &= markOopDesc::hash_mask; if (value == 0) value = 0xBAD; assert(value != markOopDesc::no_hash, "invariant"); TEVENT(hashCode: GENERATE); return value; }
2.3 MarkWord
openjdk11\src\hotspot\share\oops\markOop.cpp // 通过源码可以简单看出32位与64位的区别 // 本次分析以32位为例 class markOopDesc: public oopDesc { private: // Conversion uintptr_t value() const { return (uintptr_t) this; } public: // 字段大小 // Constants enum { age_bits = 4, // age lock_bits = 2, // lock biased_lock_bits = 1, // biased_lock max_hash_bits = BitsPerWord - age_bits - lock_bits - biased_lock_bits, // 32-4-2-1=25 hash_bits = max_hash_bits > 31 ? 31 : max_hash_bits, // identity_hashcode:25 cms_bits = LP64_ONLY(1) NOT_LP64(0), // 64位包含1位unused，32位没有 epoch_bits = 2 // Biased状态下的epoch }; // 字段所在位置偏移 // The biased locking code currently requires that the age bits be // contiguous to the lock bits. enum { lock_shift = 0, biased_lock_shift = lock_bits, // << 2 age_shift = lock_bits + biased_lock_bits, // << 3 cms_shift = age_shift + age_bits, // << 7 hash_shift = cms_shift + cms_bits, // << 7 (32位) epoch_shift = hash_shift }; // 计算掩码 enum { lock_mask = right_n_bits(lock_bits), // b11 lock_mask_in_place = lock_mask << lock_shift, // b11 biased_lock_mask = right_n_bits(lock_bits + biased_lock_bits), // b111 biased_lock_mask_in_place= biased_lock_mask << lock_shift, // b111 biased_lock_bit_in_place = 1 << biased_lock_shift, // b100 age_mask = right_n_bits(age_bits), // b1111 age_mask_in_place = age_mask << age_shift, // b111_1000 epoch_mask = right_n_bits(epoch_bits), // b11 epoch_mask_in_place = epoch_mask << epoch_shift, // b1_1000_0000 cms_mask = right_n_bits(cms_bits), // b1 cms_mask_in_place = cms_mask << cms_shift // b1000_0000 #ifndef _WIN64 // 32位 ,hash_mask = right_n_bits(hash_bits), // b1_11111111_11111111_11111111 hash_mask_in_place = (address_word)hash_mask << hash_shift // b11111111_11111111_11111111_10000000 #endif }; .... } // 没找到right_n_bits源码，随便写了个Java版的 public static int right_n_bits(int n) { int x = 1; for (int i = 1; i < n; i++) { x = (x << 1) + 1; } return x; }
2.4 copy_set_hash
openjdk11\src\hotspot\share\oops\markOop.cpp // 拷贝Mark值的副本，并将hash填入 // (不确定，从逻辑上看应该是拷贝了，但看tmp变量名和value()的定义又像指针操作) markOop copy_set_hash(intptr_t hash) const { // 假设传入hash值为 b01010101 /* hash_mask_in_place b11111111_11111111_11111111_10000000 ~ b00000000_00000000_00000000_01111111 value() b00010010_00010001_00011000_00001001 假定值 & b00000000_00000000_00000000_00001001 => tmp */ intptr_t tmp = value() & (~hash_mask_in_place); /* hash b0_00000000_00000000_01010101 hash_mast b1_11111111_11111111_11111111 & b0_00000000_00000000_01010101 截掉超出的位数 << 7 b00000000_00000000_00101010_10000000 tmp b00000000_00000000_00000000_00001001 |= b00000000_00000000_00101010_10001001 > tmp */ tmp |= ((hash & hash_mask) << hash_shift); return (markOop)tmp; }
2.5 cas_set_mark markOop oopDesc::cas_set_mark(markOop new_mark, markOop old_mark) { // atomic_cmpxchg_at 未找到源码 // 推测返回值为修改前的值 return HeapAccess<>::atomic_cmpxchg_at(new_mark, as_oop(), mark_offset_in_bytes(), old_mark); }
这段代码没什么好说的，和我们常用的 AtomicXXX 很相似 AtomicReference<String> ref = new AtomicReference<>("A"); String old = ref.compareAndExchange("A", "B"); // V compareAndExchange(V expectedValue, V newValue) System.out.println(old); // A System.out.println(ref.get()); // B
3、总结默认情况下，hashCode跟内存地址无关，是通过某种（叫不上名字的）算法计算得来的 hashCode是延迟计算的，一旦设置，不会改变可以通过配置 -XX:hashCode=[0-5] 更改算法
4、Q&A
Q：我了解这个有用吗？
A：初/中级工作中：没啥用高级工作中：不知道，没达到面试中：能把不了解的面试官说懵逼。当然他不懂也不会问其他：把程序启动命令改成 -XX:hashCode=2 来坑队友（误。要用在正当用途比如学习）