《redis設計與實現》-4字典中hash方法
一 序
本文打算整理hash,暫時看不懂大神的演算法,先整理下來,至於redis字典結構啥的,文章很多了。大神的原著:http://www.concentric.net/~Ttwang/tech/inthash.htm 貼出來表示敬意,其實已經打不開了,好在有好心人有搬到github 了:
https://gist.github.com/badboy/6267743 非常推薦。對了它才知道Bob Jenkins提出了多個基於字串通用Hash演算法。Thomas Wang在Jenkins的基礎上,針對固定整數輸入做了相應的Hash演算法。也就是redis採用的。當然,如果喜歡看推到過程的細節:
二 設計原則
先看大神原文:
Hash Function Construction Principles
A good mixing function must be reversible. A hash function has form h(x) -> y. If the input word size and the output word size are identical, and in addition the operations in h() are reversible, then the following properties are true.
- If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
- Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
The case of h(x1) == y1, and h(x2) == y1 is called a collision. Using only reversible operations in a hash function makes collisions impossible. There is an one-to-one mapping between the input and the output of the mixing function.
Beside reversibility, the operations must use a chain of computations to achieve avalanche. Avalanche means that a single bit of difference in the input will make about 1/2 of the output bits be different. At a point in the chain, a new result is obtained by a computation involving earlier results.
For example, the operation a = a + b is reversible if we know the value of 'b', and the after value of 'a'. The before value of 'a' is obtained by subtracting the after value of 'a' with the value of 'b'.
大概意思就是兩個原則:
1、一個好的雜湊函式應該是可逆的。即,對於雜湊函式輸入值x和輸出值y,如果存在f(x) = y,就一定存在g(y) = x。說白了,就是雜湊函式可以將某一個值x轉換成一個key,也可以把這個key還原回成x。
2、一個好的雜湊函式應該容易造成雪崩效應。這裡的雪崩效應是從位元位的角度出發的,它指的是,輸入值1bit位的變化會造成輸出值1/2的bit位發生變化。
三 hash
3.1 Thomas Wang's 32 bit Mix Function
大神原文介紹了Knuth's Multiplicative Method,Robert Jenkins' 96 bit Mix Function。
接著提出了自己的演算法:
public int hash32shift(int key)
{
key = ~key + (key << 15); // key = (key << 15) - key - 1;
key = key ^ (key >>> 12);
key = key + (key << 2);
key = key ^ (key >>> 4);
key = key * 2057; // key = (key + (key << 3)) + (key << 11);
key = key ^ (key >>> 16);
return key;
}
By taking advantages of the native instructions such as 'add complement', and 'shift & add', the above hash function runs in 11 machine cycles on HP 9000 workstations.翻譯下就是:這樣設計可以利用CPU的native指令,在HP 9000 workstations機器上只需要11個時鐘週期,速度很快;
好吧,原諒我數學基礎太差,沒有體會到關鍵,比如上面的常量值為啥是12,2,4這種。如果結合大神開始說的:上面的每一步應該都是可逆的,然後採用key自身傳值就是為了傳遞可逆性,至於第二條的雪崩效應,裡面的加法,異或等位操作是可以達到的。說了跟沒說一樣,程式碼拆分開都認識,為什麼大神寫的這麼牛逼就不知道了。
再看看redis程式碼:
/* Thomas Wang's 32 bit Mix Function */
unsigned int dictIntHashFunction(unsigned int key)
{
key += ~(key << 15);
key ^= (key >> 10);
key += (key << 3);
key ^= (key >> 6);
key += ~(key << 11);
key ^= (key >> 16);
return key;
}
有所調整,再看.可逆性(input ==> hash ==> inverse_hash ==> input)
uint64_t inverse_hash(uint64_t key) {
uint64_t tmp;
// Invert key = key + (key << 31)
tmp = key-(key<<31);
key = key-(tmp<<31);
// Invert key = key ^ (key >> 28)
tmp = key^key>>28;
key = key^tmp>>28;
// Invert key *= 21
key *= 14933078535860113213u;
// Invert key = key ^ (key >> 14)
tmp = key^key>>14;
tmp = key^tmp>>14;
tmp = key^tmp>>14;
key = key^tmp>>14;
// Invert key *= 265
key *= 15244667743933553977u;
// Invert key = key ^ (key >> 24)
tmp = key^key>>24;
key = key^tmp>>24;
// Invert key = (~key) + (key << 21)
tmp = ~key;
tmp = ~(key-(tmp<<21));
tmp = ~(key-(tmp<<21));
key = ~(key-(tmp<<21));
return key;
}
程式碼上逆hash就比hash更復雜些。照例看不懂啊。
3.2 MurmurHash2
MurmurHash是一種很出名的非加密型雜湊函式,適用於一般的雜湊檢索操作。目前有三個版本(MurmurHash1、MurmurHash2、MurmurHash3)。最新的是MurmurHash3,可以產生出32-bit或128-bit雜湊值。redis中應用的是MurmurHash2,能產生32-bit或64-bit雜湊值。與上面介紹的整數雜湊不同,MurmurHash是針對一個字串進行雜湊的,對於規律性較強的key,該演算法能表現出較好的離散性特徵。同樣,我們來看看redis的原始碼。
static uint32_t dict_hash_function_seed = 5381;
void dictSetHashFunctionSeed(uint32_t seed) {
dict_hash_function_seed = seed;
}
uint32_t dictGetHashFunctionSeed(void) {
return dict_hash_function_seed;
}
/* MurmurHash2, by Austin Appleby
* Note - This code makes a few assumptions about how your machine behaves -
* 1. We can read a 4-byte value from any address without crashing
* 2. sizeof(int) == 4
*
* And it has a few limitations -
*
* 1. It will not work incrementally.
* 2. It will not produce the same results on little-endian and big-endian
* machines.
*/
unsigned int dictGenHashFunction(const void *key, int len) {
/* 'm' and 'r' are mixing constants generated offline.
They're not really 'magic', they just happen to work well. */
uint32_t seed = dict_hash_function_seed;
const uint32_t m = 0x5bd1e995;
const int r = 24;
/* Initialize the hash to a 'random' value */
uint32_t h = seed ^ len;
/* Mix 4 bytes at a time into the hash */
const unsigned char *data = (const unsigned char *)key;
while(len >= 4) {
uint32_t k = *(uint32_t*)data;
k *= m;
k ^= k >> r;
k *= m;
h *= m;
h ^= k;
data += 4;
len -= 4;
}
/* Handle the last few bytes of the input array */
switch(len) {
case 3: h ^= data[2] << 16;
case 2: h ^= data[1] << 8;
case 1: h ^= data[0]; h *= m;
};
/* Do a few final mixes of the hash to ensure the last few
* bytes are well-incorporated. */
h ^= h >> 13;
h *= m;
h ^= h >> 15;
return (unsigned int)h;
}
Murmur可以計算字串的hash code,基本思想就是把key分成n組,每組4個字元,把這4個字元看成是一個uint_32,進行n次運算,得到一個h,然會在對h進行處理,得到一個相對離散的hash code;真正有技術含量的比如:dict_hash_function_seed取值,m.r那些,把這些講明白才是大神,可惜我看不懂。
而redis則藉助djb函式實現了不區分大小寫的雜湊函式dict.c/dictGenCaseHashFunction:
/* And a case insensitive hash function (based on djb hash) */
unsigned int dictGenCaseHashFunction(const unsigned char *buf, int len) {
unsigned int hash = (unsigned int)dict_hash_function_seed;
while (len--)
hash = ((hash << 5) + hash) + (tolower(*buf++)); /* hash * 33 + c */
return hash;
}
該演算法的思想是利用字串中的ascii碼值與一個隨機seed,通過len次變換,得到最後的hash值。
最後:
Java裡面其實也有hash。hash(),hashcode等等
舉個常見的hashmap.
jdk1.7 物件象的hashCode的32位值只要有一位發生改變,整個hash()返回值就會改變,高位的變化會反應到低位裡
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
jdk 1.8 做了折中,相比較而言減少了過多的位運算,極端情況沒有處理。
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
還有就是取模的first = tab[(n - 1) & hash]。就是位運算,也是 tablesize = 2 ^n,key_addr =hash_value % tablesize =?hash_value?& (tablesize - 1)。這裡用位運算來替代取餘。相同的效果。
繼續看程式碼。
參考:
https://blog.csdn.net/jasper_xulei/article/details/18364313