資料結構 Roaring Bitmaps 介紹
阿新 • • 發佈:2018-11-04
背景:
BitMap 是一種比較常用的資料機構,點陣圖索引被廣泛應用與資料庫和搜尋引擎中,能快速定位一個數值是否在存在,是一種高效的資料壓縮演算法,能顯著加快查詢速度。但是BitMap還是會佔用大量記憶體(線性增長),所以我們一般還需要對BitMap進行壓縮處理。Roaring BitMaps (簡稱RBM) 就是一種壓縮演算法。
所以:BitMap 是一種資料結構/壓縮演算法,RBM 是一種基於BitMap思想的資料結構/壓縮演算法。
原理:
附上一段論文原文
- We partition the range of 32-bit indexes ([0; n)) into chunks of 216 integers sharing the same 16 most significant digits. We use specialized containers to store their 16 least significant bits.
- When a chunk contains no more than 4096 integers, we use a sorted array of packed 16-bit integers. When there are more than 4096 integers, we use a 216-bit bitmap. Thus, we have two types of containers: an array container for sparse chunks and a bitmap container for dense chunks. The 4096 threshold insures that at the level of the containers, each integer uses no more than 16 bits: we either use 216 bits for more than 4096 integers, using less than 16 bits/integer, or else we use exactly 16 bits/integer.
- The containers are stored in a dynamic array with the shared 16 most-significant bits: this serves as a first-level index. The array keeps the containers sorted by the 16 most-significant bits.We expect this first-level index to be typically small: when n = 1 000 000, it contains at most 16 entries. Thus it should often remain in the CPU cache. The containers themselves should never use much more than 8 kB.
白話文:
1、將0-32-bit [0, n) 內的資料劈成 高16位和低16位兩部分資料
2、高16位用於查詢資料儲存位置,低16位存在在一個容器中(不就是一個類似HashMap的結構麼)
容器補充:容器是一個動態的陣列,當資料小於4096個時,使用16bit的short陣列儲存,多餘4096個時,使用216bits的BitMap儲存;
為什麼使用兩種資料結構來儲存低16位的值:
short陣列:2bit * 4096 = 8KB
BitMap:儲存16位範圍內資料 65536/8 = 8192b,
所以低於 4096個數,short 陣列更省空間。