HashMap defaultLoadFactor = 0.75和泊松分佈沒有關係
很多人說HashMap的DEFAULT_LOAD_FACTOR = 0.75f是因為這樣做滿足泊松分佈,這就是典型的半知半解、誤人子弟、以其昏昏使人昭昭。實際上設定預設load factor為0.75和泊松分佈沒有卵關係,隨機雜湊的存放資料方式本身就是滿足泊松分佈的。
java8及以上版本中開頭這一段的註釋,是為了解釋在java8 HashMap中引入Tree Bin(也就是放資料的每個陣列bin從連結串列node轉換為red-black tree node)的原因。
一、二項分佈
二項分佈就是重複n次獨立的伯努利試驗。在每次試驗中只有兩種可能的結果,而且兩種結果發生與否互相對立,並且相互
兩個重點:
- 每次試驗獨立:第n次試驗不受n-1次試驗的影響,也不影響n+1次試驗;
- 結果有且只有兩個,並且互相對立:要麼成功,要麼失敗,成功的概率+失敗的概率=1;
至於二項分佈圖的繪製,也就是做n次試驗,期望成功k次的概率分佈。根據二項分佈函式,你只需要知道試驗總次數n、期望成功的次數k,以及每次試驗成功的概率p,即可很快的求出成功k次的概率。
二、泊松分佈
泊松分佈: 是離散隨機分佈的一種,通常被使用在估算在 一段特定時間/空間內發生成功事件的數量的概率。
泊松分佈是二項分佈p趨近於零、n趨近於無窮大的極限形式,泊松分佈的概率質量函式如下圖。
這個概率質量函式在HashMap的註釋中也出現了。
三、註釋中load factor和泊松分佈的關係
接第二部分HashMap中的註釋截圖。其中也給出了泊松分佈的概率質量函式公式,除此之外還要重點關注劃線的需要翻譯的部分。註釋原文如下。
Ideally, under random hashCodes, the frequency of nodes in bins follows a Poisson distribution(http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing threshold of 0.75, although with a large variance because of resizing granularity.
該部分註釋的翻譯如下。
在理想的隨機hashCodes下,容器中節點的頻率遵循泊松分佈(http://en.wikipedia.org/wiki/Poisson_.),對於0.75的預設調整閾值,泊松分佈的概率質量函式中引數λ(事件發生的平均概率)的值約為0.5,儘管λ的值會因為load factor值的調整而產生較大變化。
所以,這一段註釋(甚至是HashMap開頭的這一大段註釋都和load factor無關)不是說設定load factor為0.75的原因,而是說在預設調整閾值為0.75的情況下,泊松分佈概率質量函式中的引數λ=0.5,註釋中隨後給出的泊松分佈PMF公式: (exp(-0.5) * pow(0.5, k)/factorial(k)),其中0.5指代λ的值。再次對照泊松分佈的質量概率函式PMF公示如下。
這一段乃至HashMap開頭的一大段註釋都沒有解釋load factory預設值是0.75的原因,而是說load factor的值會影響泊松分佈PMF函式公式中的引數λ的值,例如load factor=0.75f時λ=0.5。按照泊松分佈公式來看,期望放入bin中資料的數量k=8,e是一個無理常數,λ的值受load factor的值的影響(泊松分佈是用來估算在 一段特定時間或空間內發生成功事件的數量的概率,即在長度為length的陣列中hash地放入0.75*length數量的資料,陣列中某一個下標放入k個數據的概率)。
java8及以上版本的HashMap中這段註釋的解釋:
- 這一段註釋的內容和目的都是為了解釋在java8 HashMap中引入Tree Bin(也就是放入資料的每個陣列bin從連結串列node轉換為red-black tree node)的原因
- 原註釋如上圖劃線部分:Because TreeNodes are about twice the size of regular nodes, we use them only when bins contain enough nodes to warrant use(see TREEIFY_THRESHOLD).
- TreeNode雖然改善了連結串列增刪改查的效能,但是其節點大小是連結串列節點的兩倍
- 雖然引入TreeNode但是不會輕易轉變為TreeNode(如果存在大量轉換那麼資源代價比較大),根據泊松分佈來看轉變是小概率事件,價效比是值得的
- 泊松分佈是二項分佈的極限形式,兩個重點:事件獨立、有且只有兩個相互對立的結果
- 泊松分佈是指一段時間或空間中發生成功事件的數量的概率
- 對HashMap table[]中任意一個bin來說,存入一個數據,要麼放入要麼不放入,這個動作滿足二項分佈的兩個重點概念
- 對於HashMap.table[].length的空間來說,放入0.75*length個數據,某一個bin中放入節點數量的概率情況如上圖註釋中給出的資料(表示陣列某一個下標存放資料數量為0~8時的概率情況)
- 舉個例子說明,HashMap預設的table[].length=16,在長度為16的HashMap中放入12(0.75*length)個數據,某一個bin中存放了8個節點的概率是0.00000006
- 擴容一次,16*2=32,在長度為32的HashMap中放入24個數據,某一個bin中存放了8個節點的概率是0.00000006
- 再擴容一次,32*2=64,在長度為64的HashMap中放入48個數據,某一個bin中存放了8個節點的概率是0.00000006
所以,當某一個bin的節點大於等於8個的時候,就可以從連結串列node轉換為treenode,其價效比是值得的。
四、DEFAULT_LOAD_FACTOR =0.75f的真正原因
load factory=0.75的真正原因,在java7、8等中均有註釋(這段註釋在public class HashMap類定義之前,附註的註釋即本文所討論的註釋是在HashMap類定義之後),如下圖所示,負載因子太小了浪費空間並且更容resize,太大了雜湊衝突增加會導致效能不好,所以0.75只是一個折中的選擇,和泊松分佈沒有什麼卵關係。
五、其他補充
六、附錄——程式碼註釋原文
Implementation notes.
This map usually acts as a binned (bucketed) hash table, but when bins get too large, they are transformed into bins of TreeNodes, each structured similarly to those in java.util.TreeMap. Most methods try to use normal bins, but relay to TreeNode methods when applicable (simply by checking instanceof a node). Bins of TreeNodes may be traversed and used like any others, but additionally support faster lookup when overpopulated. However, since the vast majority of bins in normal use are not overpopulated, checking for existence of tree bins may be delayed in the course of table methods.
Tree bins (i.e., bins whose elements are all TreeNodes) are ordered primarily by hashCode, but in the case of ties, if two elements are of the same "class C implements Comparable<C>", type then their compareTo method is used for ordering. (We conservatively check generic types via reflection to validate this -- see method comparableClassFor). The added complexity of tree bins is worthwhile in providing worst-case O(log n) operations when keys either have distinct hashes or are orderable, Thus, performance degrades gracefully under accidental or malicious usages in which hashCode() methods return values that are poorly distributed, as well as those in which many keys share a hashCode, so long as they are also Comparable. (If neither of these apply, we may waste about a factor of two in time and space compared to taking no precautions. But the only known cases stem from poor user programming practices that are already so slow that this makes little difference.)
Because TreeNodes are about twice the size of regular nodes, we use them only when bins contain enough nodes to warrant use(see TREEIFY_THRESHOLD). And when they become too small (due to removal or resizing) they are converted back to plain bins. In usages with well-distributed user hashCodes, tree bins are rarely used. Ideally, under random hashCodes, the frequency of nodes in bins follows a Poisson distribution(http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing threshold of 0.75, although with a large variance because of resizing granularity. Ignoring variance, the expected occurrences of list size k are (exp(-0.5) * pow(0.5, k)/factorial(k)). The first values are:
0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006
more: less than 1 in ten million
The root of a tree bin is normally its first node. However, sometimes (currently only upon Iterator.remove), the root might be elsewhere, but can be recovered following parent links (method TreeNode.root()).
All applicable internal methods accept a hash code as an argument (as normally supplied from a public method), allowing them to call each other without recomputing user hashCodes. Most internal methods also accept a "tab" argument, that is normally the current table, but may be a new or old one when resizing or converting.
When bin lists are treeified, split, or untreeified, we keep them in the same relative access/traversal order (i.e., field Node.next) to better preserve locality, and to slightly simplify handling of splits and traversals that invoke iterator.remove. When using comparators on insertion, to keep a total ordering (or as close as is required here) across rebalancings, we compare classes and identityHashCodes as tie-breakers.
The use and transitions among plain vs tree modes is complicated by the existence of subclass LinkedHashMap. See below for hook methods defined to be invoked upon insertion, removal and access that allow LinkedHashMap internals to otherwise remain independent of these mechanics. (This also requires that a map instance be passed to some utility methods that may create new nodes.)
The concurrent-programming-like SSA-based coding style helps avoid aliasing errors amid all of the twisty pointer operations.