比特幣原始碼情景分析之bloom filter精讀
阿新 • • 發佈:2019-01-02
上一篇SPV錢包裡utxos同步提到了bloom filter,這一章節我們將從原始碼分析角度來個深度解剖Bloom filter基本原理 An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3. 下面的bitarray是一個m位的位陣列。 filter的key集合是{x,y,z}, x新增到filter時,演算法會將x進行三次不同的hash而生成3個值,將這個值當做bitarray的index,並將對應index的內容置位1, 藍色的3個箭頭代表x的3個index,w驗證時同樣經過3次hash, 最後會生成3個index,然後從bitarray中查詢這3個index的內容,如果都為1,則證明存在,有一個不為1,說明不存在. hash演算法的特點是,相同輸入產生固定的輸出(index),不同的輸入可能會得到相同的輸出(index), 所以bloom filter能完全確定不屬於集合的Key值,但是可能會錯誤的將不屬於集合的key值認為是屬於集合的。
為了降低錯誤率,其實就是要降低不同key值再幾次hash產生相同輸出的概率。Bitmap的長度我們定為m,幾次hash我們定義為k. m增大能降低一次hash不同輸入產生相同輸出的概率,k增大能降低所有hash都相同的概率。所以合適的m和k值對降低錯誤率很關鍵.具體怎麼選取m, k值有相關的數學公式,大家可以參閱Bitcoin bloom filter流程1)load filter (net_processing.cpp)
else if (strCommand == NetMsgType::FILTERLOAD) { CBloomFilter filter; vRecv >> filter; if (!filter.IsWithinSizeConstraints()) { // There is no excuse for sending a too-large filter LOCK(cs_main); Misbehaving(pfrom->GetId(), 100); } else { LOCK(pfrom->cs_filter); pfrom->pfilter.reset(new CBloomFilter(filter)); pfrom->pfilter->UpdateEmptyFull(); pfrom->fRelayTxes = true; } }filter的資料序列化 template <typename Stream, typename Operation> inline void SerializationOp(Stream& s, Operation ser_action) { //vData是bloom filter的集合key READWRITE(vData); //幾次hash函式 READWRITE(nHashFuncs); READWRITE(nTweak); READWRITE(nFlags); }2)新增filter else if (strCommand == NetMsgType::FILTERADD) { std::vector<unsigned char> vData; vRecv >> vData; // Nodes must NEVER send a data item > 520 bytes (the max size for a script data object, // and thus, the maximum size any matched object can have) in a filteradd message bool bad = false; if (vData.size() > MAX_SCRIPT_ELEMENT_SIZE) { bad = true; } else { LOCK(pfrom->cs_filter); if (pfrom->pfilter) { pfrom->pfilter->insert(vData); } else { bad = true; } } if (bad) { LOCK(cs_main); Misbehaving(pfrom->GetId(), 100); } }其實就是按照bloom filter的演算法對新增的key做幾次hash然後修改bitArrayvoid CBloomFilter::insert(const std::vector<unsigned char>& vKey){ if (isFull) return; //n次不同hash,不代表需要n個不同的hash函式,直接根據index更改hash seed即可實現 for (unsigned int i = 0; i < nHashFuncs; i++) { unsigned int nIndex = Hash(i, vKey); // Sets bit nIndex of vData vData[nIndex >> 3] |= (1 << (7 & nIndex)); } isEmpty = false;}上面的 vData[nIndex >> 3] |= (1 << (7 & nIndex)); 每一次key hash生成的結果對應到bitArray的1bit的index, 而vData是char物件,總共有4 bit,所以nIndex >> 3先找到對一個char的index, 1 << (7 & nIndex) 找到index對應4位中的哪一位class CBloomFilter{private: std::vector<unsigned char> vData; unsigned int nHashFuncs; unsigned int nTweak;}nHashFuncs是int, 說好的不同的hash函式呢?inline unsigned int CBloomFilter::Hash(unsigned int nHashNum, const std::vector<unsigned char>& vDataToHash) const{ // 0xFBA4C795 chosen as it guarantees a reasonable bit difference between nHashNum values. return MurmurHash3(nHashNum * 0xFBA4C795 + nTweak, vDataToHash) % (vData.size() * 8);}從這裡可以看出,n個不同的hash函式,其實確實可以通過n個不同int即可實現,這裡直接通過‘nHashNum * 0xFBA4C795 + nTweak’就達到了不同hash的效果3)filter應用場景我們以FILTERED_BLOCK訊息為例,該訊息的意思是獲取指定blockhash中滿足bloom filter的block 內容 else if (inv.type == MSG_FILTERED_BLOCK) { bool sendMerkleBlock = false; CMerkleBlock merkleBlock; { LOCK(pfrom->cs_filter); if (pfrom->pfilter) { sendMerkleBlock = true; //merkleBlock只包含包頭,符合條件的娥txhash及partial merklepath //是一種被過濾掉的block content merkleBlock = CMerkleBlock(*pblock, *pfrom->pfilter); } } if (sendMerkleBlock) { //返回merkleBlock connman->PushMessage(pfrom, msgMaker.Make(NetMsgType::MERKLEBLOCK, merkleBlock)); // CMerkleBlock just contains hashes, so also push any transactions in the block the client did not see // This avoids hurting performance by pointlessly requiring a round-trip // Note that there is currently no way for a node to request any single transactions we didn't send here - // they must either disconnect and retry or request the full block. // Thus, the protocol spec specified allows for us to provide duplicate txn here, // however we MUST always provide at least what the remote peer needs typedef std::pair<unsigned int, uint256> PairType; for (PairType& pair : merkleBlock.vMatchedTxn) //返回符合filter條件的transaction 資料 connman->PushMessage(pfrom, msgMaker.Make(SERIALIZE_TRANSACTION_NO_WITNESS, NetMsgType::TX, *pblock->vtx[pair.first])); } // else // no response }}filter具體過濾過程CMerkleBlock::CMerkleBlock(const CBlock& block, CBloomFilter* filter, const std::set<uint256>* txids){ header = block.GetBlockHeader(); std::vector<bool> vMatch; std::vector<uint256> vHashes; vMatch.reserve(block.vtx.size()); vHashes.reserve(block.vtx.size()); for (unsigned int i = 0; i < block.vtx.size(); i++) { const uint256& hash = block.vtx[i]->GetHash(); if (txids && txids->count(hash)) { vMatch.push_back(true); } else if (filter && filter->IsRelevantAndUpdate(*block.vtx[i])) { vMatch.push_back(true); vMatchedTxn.emplace_back(i, hash); } else { vMatch.push_back(false); } vHashes.push_back(hash); } txn = CPartialMerkleTree(vHashes, vMatch);}bool CBloomFilter::IsRelevantAndUpdate(const CTransaction& tx){ bool fFound = false; // Match if the filter contains the hash of tx // for finding tx when they appear in a block if (isFull) return true; if (isEmpty) return false; //獲取txhash,看是否在bloom filter集合中const uint256& hash = tx.GetHash(); if (contains(hash)) fFound = true; for (unsigned int i = 0; i < tx.vout.size(); i++) { const CTxOut& txout = tx.vout[i]; // Match if the filter contains any arbitrary script data element in any scriptPubKey in tx // If this matches, also add the specific output that was matched. // This means clients don't have to update the filter themselves when a new relevant tx // is discovered in order to find spending transactions, which avoids round-tripping and race conditions. CScript::const_iterator pc = txout.scriptPubKey.begin(); std::vector<unsigned char> data; while (pc < txout.scriptPubKey.end()) { opcodetype opcode; //獲取鎖定指令碼中的資料,以用於驗證這些資料是否在bloom filter集合中 if (!txout.scriptPubKey.GetOp(pc, opcode, data)) break;//驗證是否在在bloom filter集合中 if (data.size() != 0 && contains(data)) { fFound = true; break; } } } if (fFound) return true; for (const CTxIn& txin : tx.vin) { // Match if the filter contains an outpoint tx spends //txin.prevout是否在bloom filter集合中 if (contains(txin.prevout)) return true; // Match if the filter contains any arbitrary script data element in any scriptSig in tx CScript::const_iterator pc = txin.scriptSig.begin(); std::vector<unsigned char> data; while (pc < txin.scriptSig.end()) { opcodetype opcode; //獲取解鎖指令碼以驗證是否在在bloom filter集合中if (!txin.scriptSig.GetOp(pc, opcode, data)) break; //驗證是否在在bloom filter集合中 if (data.size() != 0 && contains(data)) return true; } } return false;}bool CBloomFilter::contains(const std::vector<unsigned char>& vKey) const{ for (unsigned int i = 0; i < nHashFuncs; i++) { unsigned int nIndex = Hash(i, vKey); // Checks bit nIndex of vData if (!(vData[nIndex >> 3] & (1 << (7 & nIndex)))) return false; } return true;}總結,用來filter的資料可以是tx.hash,也可以是txout.scriptPubKey中的data,也可以是txin.scriptSig中的data
比如根據交易的publicKey來過濾交易,就可以在transaction的txin, txout的上做文章.想P2PK的解鎖和鎖定指令碼中都有pubKey,可以用來filter.
