1. 程式人生 > >關於散列表的大小設定

關於散列表的大小設定

資料庫課上老師提出的問題,大意是給一個集合S,給一個雜湊函式和相應的散列表,長為m,從S對映到表,問 使得給一個x,通過散列表判斷其不在S中的概率小於0.05,這個m該是多少?
老師說這個問題是美國大學生都會證的問題,這也是中國大學生研究生缺乏的思考能力。
我完全沒頭緒。。只是在想這跟m有什麼關係,下課後也沒找到合適的資料。這裡整理一下我查到的一些關於雜湊表的長度設定問題的英文資料和機翻。
想看知識點的直接翻到最後即可。

USCD_EDU

http://cseweb.ucsd.edu/~kube/cls/100/Lectures/lec16/lec16-8.html
Hash table size

  • By “size” of the hash table we mean how many slots or buckets it has [ 雜湊表的“大小”是指它有多少個槽或桶]

  • Choice of hash table size depends in part on choice of hash function, and collision resolution strategy [ 散列表大小的選擇部分取決於雜湊函式的選擇和衝突解決策略]

  • But a good general “rule of thumb” is: [ 但一個好的一般“經驗法則”是:]
    The hash table should be an array with length about 1.3 times the maximum number of keys that will actually be in the table, and [ 雜湊表應該是一個數組,其長度約為表中實際存在的最大鍵數的1.3倍]

  • Size of hash table array should be a prime number [ 雜湊表陣列的大小應該是素數]

  • So, let M = the next prime larger than 1.3 times the number of keys you will want to store in the table, and create the table as an array of length M [ 因此,讓M =下一個素數大於您想要儲存在表中的鍵數的1.3倍,並將表建立為長度為M的陣列]

  • (If you underestimate the number of keys, you may have to create a larger table and rehash the entries when it gets too full; if you overestimate the number of keys, you will be wasting some space) [ (如果你低估了鍵的數量,你可能需要建立一個更大的表,並在條目太滿時重新輸入條目;如果你高估了鍵的數量,你將浪費一些空間)]

How is the size of a hash table determined How should optimization be done for it to be fast?

https://www.quora.com/How-is-the-size-of-a-hash-table-determined-How-should-optimization-be-done-for-it-to-be-fast

  • HOW

It’s a tuning parameter - it depends what you’re trying to optimize and what resources you have or are willing to commit but thinking performance will be proportional to average collision chain length is the right thing to be managing. [ 這是一個調整引數 - 它取決於您要優化的內容以及您擁有或願意承諾的資源,但思考效能與平均衝突鏈長度成正比是正確的管理方式。]

I don’t know what your application is but assuming you optimize collision handling 3-4 chains will be blisteringly fast on any modern laptop (up). [ 我不知道您的應用程式是什麼,但假設您優化了碰撞處理3-4鏈條在任何現代膝上型電腦上都會非常快(上)。] If you’re on a phone or smaller device you might find this is more of a size/speed trade-off. [ 如果您使用的是手機或小型裝置,您可能會發現這更多的是尺寸/速度權衡。]

It’s a common myth for this kind of hash-table that you should pick a prime number but because your hash value has a tendency to have a fixed remainder modulo 33 you should pick something co-prime with 33. [ 對於這種雜湊表來說,你應該選擇素數是一個常見的神話,但是因為你的雜湊值有一個固定餘數模33的傾向,你應該選擇一個33的共同素數。]

A smart choice is a power of 2. [ 聰明的選擇是2的力量。] That’s because you can obtain the remainder by masking bits with & and avoid a (relatively) costly / implicit in your %. [ 那是因為你可以通過用&遮蔽位來獲得餘數,並避免在你的%中(相對)代價高昂/隱含。]

NB: A side-effect of using powers of 2 is that it’s easy to divide or combine collision chains if you resize the table dynamically. [ 注意:使用2的冪的副作用是,如果動態調整表的大小,則很容易劃分或組合碰撞鏈。] However I get the impression you have a static dictionary and won’t be re-sizing. [ 但是我得到的印象是你有一個靜態字典,不會重新調整大小。]

  • Optimized? [ 優化?]

First, make sure you retain the full hash (an unsigned 32-bit int will be likely suitable). [ 首先,確保保留完整的雜湊值(無符號的32位int可能是合適的)。]
Second when traversing a collision chain compare hash before value. [ 第二,當遍歷碰撞鏈時比較值之前的雜湊。] If the hashes don’t match you don’t need a (relatively) expensive string comparison. [ 如果雜湊不匹配,則不需要(相對)昂貴的字串比較。]
The hash you’ve chosen is known to have good performance with English text you should find few if any collisions at full 32-bit hash comparison and make next to zero failed string comparisons. [ 您已選擇的雜湊已知具有良好的英文文字效能,如果在完全32位雜湊比較中發生任何衝突,則應該找到很少,並且接下來的零字串比較失敗。]

Third consider ordering the collision chain. [ 第三,考慮訂購碰撞鏈。] If access is random order it by hash value. [ 如果訪問是隨機的,則按雜湊值排序。]
That way you can dive out of a chain when you realize the look-up value can’t be held. [ 這樣,當您意識到無法保持查詢值時,您可以跳出鏈條。]

Alternatively if access isn’t random consider ordering the collision chains by a static or dynamic frequency. [ 或者,如果訪問不是隨機的,則考慮通過靜態或動態頻率對碰撞鏈進行排序。] Static frequency would be based on occurrence of a word in some text “corpus”. [ 靜態頻率將基於某些文字“語料庫”中單詞的出現。] That is you’d want ‘the’ to appear at the front of its collision chain and ‘wayzgoose’ likely towards the end! [ 那就是你希望’the’出現在它碰撞鏈的前面,並且’wayzgoose’可能會到達終點!]
Dynamic frequency would involve moving words that are ‘hit’ to the front of their collision chain knowing words recur in a given text. [ 動態頻率將涉及將“擊中”的單詞移動到其碰撞鏈的前面,知道單詞在給定文字中重複出現。]

If you are writing a spell checker (and I’ve somewhat assumed that’s the application) I really do recommend finding a corpus. [ 如果你正在寫一個拼寫檢查器(我有點認為這是應用程式)我真的建議找一個語料庫。] It doesn’t even have to be very big because (of course) the common words are common and will sort to the front very quickly and even if the ‘uncommon’ words aren’t optimized - they have less impact because they’re uncommon! [ 它甚至不必非常大,因為(當然)常見的詞很常見並且會很快排在前面,即使“不常見”的詞語沒有被優化 - 它們的影響也很小,因為它們並不常見!]

PS: I also know a practically perfect (in the formal sense) hash for English words however I think you’ll find your hash is pretty good. [ PS:我也知道一個幾乎完美的(在正式意義上)英語單詞的雜湊,但我想你會發現你的雜湊非常好。]

總結

  1. 散列表的大小的選擇部分取決於雜湊函式的選擇和衝突解決的策略

  2. 其實討論的是平衡效能與平均衝突長度平衡效能與平均衝突長度
    一般情況下,越短搜的越快越省空間,但相應衝突機會就越大(雜湊函式影響力更大)

  3. 處理碰撞是非常重要的一環,比散列表的大小重要多了。

  4. 經驗:

  • 約為表中實際存在的最大的鍵值的1.3倍
  • 素數
  • 使用2的冪也是個方法,好處是可以通過直接位操作降低代價;壞處是:如果需要動態調整表的大小,很容對碰撞鏈進行劃分或者合併。
  1. 其實最後也沒解決關於概率,證明這些的原問題,不過總感覺老師說的是雜湊、碰撞那些的概率。。可能我沒聽清,當然不排除老師口誤了哈哈哈。