Frame of Reference and Roaring Bitmaps
https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps
2015年2月18日Engineering
Frame of Reference and Roaring Bitmaps
作者 Adrien GrandPostings lists
While it may surprise you if you are new to search engine internals, one of the most important building blocks of a search engine is the ability to efficiently compress and quickly decode sorted lists of integers. Why is this useful? As you may know, Elasticsearch shards,
Why are these doc IDs useful? An inverted index needs to map terms to the list of documents that contain this term, called apostings list, and these doc IDs that we just discussed are a perfect fit since they can be compressed efficiently.
Frame Of Reference
In order to be able to compute intersections and unions efficiently, we require that these postings lists are sorted. A nice side-effect of this decision is that postings lists can be compressed with delta-encoding.
For instance, if your postings list is[73, 300, 302, 332, 343, 372]
, the list of deltas would be[73, 227, 2, 30, 11, 29]
. What is interesting to note here is that all deltas are between 0 and 255, so you only need one byte per value. This is the technique that Lucene is using in order to encode your inverted index on disk: postings lists are split into blocks of 256 doc IDs and then each block is compressed separately using delta-encoding and bit packing: Lucene computes the maximum number of bits required to store deltas in a block, adds this information to the block header, and then encodes all deltas of the block using this number of bits. This encoding technique is known asFrame Of Reference(FOR) in the literature and has been usedsince Lucene 4.1.
Here is an example with a block size of 3 (instead of 256 in practice):