python dict 原始碼解析

阿新 • • 發佈：2019-02-04

雜湊表和雜湊衝突概念

python的字典是一種雜湊表，是根據關鍵碼值(Key value)而直接進行訪問的資料結構。也就是說，它通過把關鍵碼值對映到表中一個位置來訪問記錄，以加快查詢的速度。這個對映函式叫做雜湊函式(雜湊函式)，存放記錄的陣列叫做散列表(雜湊表/hash table)。

在理想的狀態下,不同的物件經過雜湊函式計算出來的雜湊值是不一樣的，但是隨著儲存資料的增多,不同的物件經過雜湊函式計算出的雜湊值可能是一樣的，這種情況就是雜湊衝突

python 解決雜湊衝突的方案開放定址法（open addressing）

python 採用的是開放定址法（open addressing)來解決雜湊衝突,其原理是產生雜湊衝突時, python 會通過一個二次探測函式 f, 計算下一個候選位置,當下一個位置可用，則將資料插入該位置,如果不可用則再次呼叫探測函式 f,獲得下一個候選位置，因此經過不斷探測,總會找到一個可用的位置

開放定址法存在的問題

通過多次使用二次探測函式f，從一個位置出發就可以依次到達多個位置,我們認為這些位置形成了一個 ‘衝突探測鏈’ 當需要刪除探測鏈上的某個資料時問題就產生了, 假如這條鏈路上的首個元素是 a 最後的元素是 c 現在需要刪除處於中間位置的 b ，這樣就會導致探測鏈斷裂, 當下一次搜尋 c 時會從 a 出發沿著鏈路一步步出發，但是中途的鏈路斷了導致無法到達 c 的位置, 因此無法搜尋到c
所以在採用開放定地法解決雜湊衝突的策略，刪除鏈路上的某個元素時，不能真正的刪除元素，只能進行 ‘偽刪除’

python字典的三種狀態 Unused, Active, Dummy

1. Unused.  index == DKIX_EMPTY
   Does not hold an active (key, value) pair now and never did.  Unused can
   transition to Active upon key insertion.  This is each slot's initial state.

2. Active.  index >= 0, me_key != NULL and me_value != NULL
   Holds an active (key, value) pair.  Active can transition to 
 Dummy or
   Pending upon key deletion (for combined and split tables respectively).
   This is the only case in which me_value != NULL.

3. Dummy.  index == DKIX_DUMMY  (combined only)
   Previously held an active (key, value) pair, but that was deleted and an
   active pair has not yet overwritten the slot.  Dummy can transition to
   Active upon key insertion.  Dummy slots cannot be made Unused again
   else the probe sequence in case of collision would have no way to know
   they were once active.

Unused 狀態下也就是當該字典中還沒有儲存key 和 value 每個字典初始化時都是該狀態
Active 當字典中儲存了key 和 value 時狀態就進入到了 Active
Dummy 當字典中的 key 和 value 被刪除後字典不能從Active 直接進入 Unused 狀態否則就會出現之前提到的衝突鏈路中斷,實際上python進行刪除字典元素時，會將key的狀態改為Dummy ,這就是 python的 ‘偽刪除技術’

dict

python 原始碼定義的字典

typedef struct {
    PyObject_HEAD

    /* Number of items in the dictionary */
    Py_ssize_t ma_used;

    /* Dictionary version: globally unique, value change each time
       the dictionary is modified */
    uint64_t ma_version_tag;

    PyDictKeysObject *ma_keys;

    /* If ma_values is NULL, the table is "combined": keys and values
       are stored in ma_keys.
       If ma_values is not NULL, the table is splitted:
       keys are stored in ma_keys and values are stored in ma_values */
    PyObject **ma_values;
} PyDictObject;

建立字典

通過PyDict_New(void) 方法來實現，原始碼如下:

PyObject *
PyDict_New(void)
{
    PyDictKeysObject *keys = new_keys_object(PyDict_MINSIZE);
    if (keys == NULL)
        return NULL;
    return new_dict(keys, NULL);
}

其中 new_keys_object 方法主要是做容量檢查以便根據容量申請記憶體

new_keys_object 程式碼如下

static PyDictKeysObject *new_keys_object(Py_ssize_t size)
{
    PyDictKeysObject *dk;
    Py_ssize_t es, usable;

    assert(size >= PyDict_MINSIZE);
    assert(IS_POWER_OF_2(size));

    usable = USABLE_FRACTION(size);
    if (size <= 0xff) {
        es = 1;
    }
    else if (size <= 0xffff) {
        es = 2;
    }
#if SIZEOF_VOID_P > 4
    else if (size <= 0xffffffff) {
        es = 4;
    }
#endif
    else {
        es = sizeof(Py_ssize_t);
    }

    if (size == PyDict_MINSIZE && numfreekeys > 0) {
        dk = keys_free_list[--numfreekeys];
    }
    else {
        dk = PyObject_MALLOC(sizeof(PyDictKeysObject)
                             - Py_MEMBER_SIZE(PyDictKeysObject, dk_indices)
                             + es * size
                             + sizeof(PyDictKeyEntry) * usable);
        if (dk == NULL) {
            PyErr_NoMemory();
            return NULL;
        }
    }
    DK_DEBUG_INCREF dk->dk_refcnt = 1;
    dk->dk_size = size;
    dk->dk_usable = usable;
    dk->dk_lookup = lookdict_unicode_nodummy;
    dk->dk_nentries = 0;
    memset(&dk->dk_indices.as_1[0], 0xff, es * size);
    memset(DK_ENTRIES(dk), 0, sizeof(PyDictKeyEntry) * usable);
    return dk;
}

然後通過 new_dict 方法建立字典
該方法程式碼如下

new_dict(PyDictKeysObject *keys, PyObject **values)
{
    PyDictObject *mp;
    assert(keys != NULL);
    if (numfree) {
        mp = free_list[--numfree];
        assert (mp != NULL);
        assert (Py_TYPE(mp) == &PyDict_Type);
        _Py_NewReference((PyObject *)mp);
    }
    else {
        mp = PyObject_GC_New(PyDictObject, &PyDict_Type);
        if (mp == NULL) {
            DK_DECREF(keys);
            free_values(values);
            return NULL;
        }
    }
    mp->ma_keys = keys;
    mp->ma_values = values;
    mp->ma_used = 0;
    mp->ma_version_tag = DICT_NEXT_VERSION();
    assert(_PyDict_CheckConsistency(mp));
    return (PyObject *)mp;
}

該方法用於建立字典
主要做了以下工作

檢查緩衝池是否有緩衝如果有緩衝則不需要再申請記憶體，之前從快取池返回資料
當沒有緩衝時使用 PyObject_GC_New 建立字典物件
初始化key 和 value

字典搜尋元素, 根據 key 搜尋元素

python 字典搜尋元素是依靠 lookdict來實現的但是對於不同的情況 python 又提供了，很多其他版本的搜尋方法比如
lookdict_unicode (Specialized version for string-only keys),
lookdict_unicode_nodummy (Faster version of lookdict_unicode when it is known that no ‘dummy’ keys),
lookdict_split (
* Version of lookdict for split tables.
* All split tables and only split tables use this lookup function.
* Split tables only contain unicode keys and no dummy keys,
* so algorithm is the same as lookdict_unicode_nodummy.)

這裡我們只考慮 lookdict 方法
其程式碼定義如下

static Py_ssize_t _Py_HOT_FUNCTION
lookdict(PyDictObject *mp, PyObject *key,
         Py_hash_t hash, PyObject **value_addr)
{
    size_t i, mask, perturb;
    PyDictKeysObject *dk;
    PyDictKeyEntry *ep0;

top:
    dk = mp->ma_keys;
    ep0 = DK_ENTRIES(dk);
    mask = DK_MASK(dk);
    perturb = hash;
    i = (size_t)hash & mask;

    for (;;) {
        Py_ssize_t ix = dk_get_index(dk, i);
        if (ix == DKIX_EMPTY) {
            *value_addr = NULL;
            return ix;
        }
        if (ix >= 0) {
            PyDictKeyEntry *ep = &ep0[ix];
            assert(ep->me_key != NULL);
            if (ep->me_key == key) {
                *value_addr = ep->me_value;
                return ix;
            }
            if (ep->me_hash == hash) {
                PyObject *startkey = ep->me_key;
                Py_INCREF(startkey);
                int cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
                Py_DECREF(startkey);
                if (cmp < 0) {
                    *value_addr = NULL;
                    return DKIX_ERROR;
                }
                if (dk == mp->ma_keys && ep->me_key == startkey) {
                    if (cmp > 0) {
                        *value_addr = ep->me_value;
                        return ix;
                    }
                }
                else {
                    /* The dict was mutated, restart */
                    goto top;
                }
            }
        }
        perturb >>= PERTURB_SHIFT;
        i = (i*5 + perturb + 1) & mask;
    }
    Py_UNREACHABLE();
}

發表於 <2018-03-18 23:24> 未完待續

2018-3-20 續

1 字典底層通過 i = (size_t)hash & mask; 來進行進行定位探測衝突鏈
2 使用 dk_get_index 方法來搜尋key對應的值，如果搜尋不到key 即 DKIX_EMPTY
則直接返回null

3 然後將查詢的key 和與字典中的key 比較成功則返回資料
if (ep->me_key == key) { *value_addr = ep->me_value; return ix; }
4 再比較兩個key 之間的hash 是否相同如果相同使用 PyObject_RichCompareBool 方法比較成功則返回資料 1 失敗則是 0 錯誤則是 -1 因此當 PyObject_RichCompareBool 方法返回1 則說明去到了資料

5 如果key 經過前面的比較都不相同，則在探測鏈上繼續往下尋找
perturb >>= PERTURB_SHIFT; i = (i*5 + perturb + 1) & mask;

python dict 原始碼解析

python dict 原始碼解析

Python 列表原始碼解析

python unittest原始碼解析四----關於原始碼中的__unittest的用處

Python核心原始碼解析與C/CPP-API拓展程式設計（一）PyObject

Python Numpy gradient原始碼解析

從原始碼解析Python的Flask框架中request物件的用法

python 原始碼解析（一）

跳一跳python輔助軟體思路及原始碼解析

【Python測試】unittest原始碼解析一----測試用例是如何被執行的

Python 列舉類原始碼解析

python字典建構函式dict(mapping)解析

python原始碼解析（二）

深入理解python 命令列解析模組optparse(optparse原始碼解讀)

6.python的set、list和dict的解析

Python記憶體管理機制-《原始碼解析》

Python優秀開源專案Rich原始碼解析

Python dict 按鍵和值排序

機器學習完整過程案例分布解析，python代碼解析

python -dict

【python Dict】 python 字典操作

python dict 原始碼解析

相關推薦