1. 程式人生 > >布隆過濾器-Bloom Filter

布隆過濾器-Bloom Filter

/**
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU Lesser General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 */

package com.skjegstad.utils;

import java.io.Serializable;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.BitSet;
import java.util.Collection;

/**
 * Implementation of a Bloom-filter, as described here:
 * http://en.wikipedia.org/wiki/Bloom_filter
 *
 * For updates and bugfixes, see http://github.com/magnuss/java-bloomfilter
 *
 * Inspired by the SimpleBloomFilter-class written by Ian Clarke. This
 * implementation provides a more evenly distributed Hash-function by
 * using a proper digest instead of the Java RNG. Many of the changes
 * were proposed in comments in his blog:
 * http://blog.locut.us/2008/01/12/a-decent-stand-alone-java-bloom-filter-implementation/
 *
 * @param <E> Object type that is to be inserted into the Bloom filter, e.g. String or Integer.
 * @author Magnus Skjegstad <
[email protected]
> */ public class BloomFilter<E> implements Serializable { private BitSet bitset; private int bitSetSize; private double bitsPerElement; private int expectedNumberOfFilterElements; // expected (maximum) number of elements to be added private int numberOfAddedElements; // number of elements actually added to the Bloom filter private int k; // number of hash functions static final Charset charset = Charset.forName("UTF-8"); // encoding used for storing hash values as strings static final String hashName = "MD5"; // MD5 gives good enough accuracy in most circumstances. Change to SHA1 if it's needed static final MessageDigest digestFunction; static { // The digest method is reused between instances MessageDigest tmp; try { tmp = java.security.MessageDigest.getInstance(hashName); } catch (NoSuchAlgorithmException e) { tmp = null; } digestFunction = tmp; } /** * Constructs an empty Bloom filter. The total length of the Bloom filter will be * c*n. * * @param c is the number of bits used per element. * @param n is the expected number of elements the filter will contain. * @param k is the number of hash functions used. */ public BloomFilter(double c, int n, int k) { this.expectedNumberOfFilterElements = n; this.k = k; this.bitsPerElement = c; this.bitSetSize = (int)Math.ceil(c * n); numberOfAddedElements = 0; this.bitset = new BitSet(bitSetSize); } /** * Constructs an empty Bloom filter. The optimal number of hash functions (k) is estimated from the total size of the Bloom * and the number of expected elements. * * @param bitSetSize defines how many bits should be used in total for the filter. * @param expectedNumberOElements defines the maximum number of elements the filter is expected to contain. */ public BloomFilter(int bitSetSize, int expectedNumberOElements) { this(bitSetSize / (double)expectedNumberOElements, expectedNumberOElements, (int) Math.round((bitSetSize / (double)expectedNumberOElements) * Math.log(2.0))); } /** * Constructs an empty Bloom filter with a given false positive probability. The number of bits per * element and the number of hash functions is estimated * to match the false positive probability. * * @param falsePositiveProbability is the desired false positive probability. * @param expectedNumberOfElements is the expected number of elements in the Bloom filter. */ public BloomFilter(double falsePositiveProbability, int expectedNumberOfElements) { this(Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2))) / Math.log(2), // c = k / ln(2) expectedNumberOfElements, (int)Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))); // k = ceil(-log_2(false prob.)) } /** * Construct a new Bloom filter based on existing Bloom filter data. * * @param bitSetSize defines how many bits should be used for the filter. * @param expectedNumberOfFilterElements defines the maximum number of elements the filter is expected to contain. * @param actualNumberOfFilterElements specifies how many elements have been inserted into the <code>filterData</code> BitSet. * @param filterData a BitSet representing an existing Bloom filter. */ public BloomFilter(int bitSetSize, int expectedNumberOfFilterElements, int actualNumberOfFilterElements, BitSet filterData) { this(bitSetSize, expectedNumberOfFilterElements); this.bitset = filterData; this.numberOfAddedElements = actualNumberOfFilterElements; } /** * Generates a digest based on the contents of a String. * * @param val specifies the input data. * @param charset specifies the encoding of the input data. * @return digest as long. */ public static int createHash(String val, Charset charset) { return createHash(val.getBytes(charset)); } /** * Generates a digest based on the contents of a String. * * @param val specifies the input data. The encoding is expected to be UTF-8. * @return digest as long. */ public static int createHash(String val) { return createHash(val, charset); } /** * Generates a digest based on the contents of an array of bytes. * * @param data specifies input data. * @return digest as long. */ public static int createHash(byte[] data) { return createHashes(data, 1)[0]; } /** * Generates digests based on the contents of an array of bytes and splits the result into 4-byte int's and store them in an array. The * digest function is called until the required number of int's are produced. For each call to digest a salt * is prepended to the data. The salt is increased by 1 for each call. * * @param data specifies input data. * @param hashes number of hashes/int's to produce. * @return array of int-sized hashes */ public static int[] createHashes(byte[] data, int hashes) { int[] result = new int[hashes]; int k = 0; byte salt = 0; while (k < hashes) { byte[] digest; synchronized (digestFunction) { digestFunction.update(salt); salt++; digest = digestFunction.digest(data); } for (int i = 0; i < digest.length/4 && k < hashes; i++) { int h = 0; for (int j = (i*4); j < (i*4)+4; j++) { h <<= 8; h |= ((int) digest[j]) & 0xFF; } result[k] = h; k++; } } return result; } /** * Compares the contents of two instances to see if they are equal. * * @param obj is the object to compare to. * @return True if the contents of the objects are equal. */ @Override public boolean equals(Object obj) { if (obj == null) { return false; } if (getClass() != obj.getClass()) { return false; } final BloomFilter<E> other = (BloomFilter<E>) obj; if (this.expectedNumberOfFilterElements != other.expectedNumberOfFilterElements) { return false; } if (this.k != other.k) { return false; } if (this.bitSetSize != other.bitSetSize) { return false; } if (this.bitset != other.bitset && (this.bitset == null || !this.bitset.equals(other.bitset))) { return false; } return true; } /** * Calculates a hash code for this class. * @return hash code representing the contents of an instance of this class. */ @Override public int hashCode() { int hash = 7; hash = 61 * hash + (this.bitset != null ? this.bitset.hashCode() : 0); hash = 61 * hash + this.expectedNumberOfFilterElements; hash = 61 * hash + this.bitSetSize; hash = 61 * hash + this.k; return hash; } /** * Calculates the expected probability of false positives based on * the number of expected filter elements and the size of the Bloom filter. * <br /><br /> * The value returned by this method is the <i>expected</i> rate of false * positives, assuming the number of inserted elements equals the number of * expected elements. If the number of elements in the Bloom filter is less * than the expected value, the true probability of false positives will be lower. * * @return expected probability of false positives. */ public double expectedFalsePositiveProbability() { return getFalsePositiveProbability(expectedNumberOfFilterElements); } /** * Calculate the probability of a false positive given the specified * number of inserted elements. * * @param numberOfElements number of inserted elements. * @return probability of a false positive. */ public double getFalsePositiveProbability(double numberOfElements) { // (1 - e^(-k * n / m)) ^ k return Math.pow((1 - Math.exp(-k * (double) numberOfElements / (double) bitSetSize)), k); } /** * Get the current probability of a false positive. The probability is calculated from * the size of the Bloom filter and the current number of elements added to it. * * @return probability of false positives. */ public double getFalsePositiveProbability() { return getFalsePositiveProbability(numberOfAddedElements); } /** * Returns the value chosen for K.<br /> * <br /> * K is the optimal number of hash functions based on the size * of the Bloom filter and the expected number of inserted elements. * * @return optimal k. */ public int getK() { return k; } /** * Sets all bits to false in the Bloom filter. */ public void clear() { bitset.clear(); numberOfAddedElements = 0; } /** * Adds an object to the Bloom filter. The output from the object's * toString() method is used as input to the hash functions. * * @param element is an element to register in the Bloom filter. */ public void add(E element) { add(element.toString().getBytes(charset)); } /** * Adds an array of bytes to the Bloom filter. * * @param bytes array of bytes to add to the Bloom filter. */ public void add(byte[] bytes) { int[] hashes = createHashes(bytes, k); for (int hash : hashes) bitset.set(Math.abs(hash % bitSetSize), true); numberOfAddedElements ++; } /** * Adds all elements from a Collection to the Bloom filter. * @param c Collection of elements. */ public void addAll(Collection<? extends E> c) { for (E element : c) add(element); } /** * Returns true if the element could have been inserted into the Bloom filter. * Use getFalsePositiveProbability() to calculate the probability of this * being correct. * * @param element element to check. * @return true if the element could have been inserted into the Bloom filter. */ public boolean contains(E element) { return contains(element.toString().getBytes(charset)); } /** * Returns true if the array of bytes could have been inserted into the Bloom filter. * Use getFalsePositiveProbability() to calculate the probability of this * being correct. * * @param bytes array of bytes to check. * @return true if the array could have been inserted into the Bloom filter. */ public boolean contains(byte[] bytes) { int[] hashes = createHashes(bytes, k); for (int hash : hashes) { if (!bitset.get(Math.abs(hash % bitSetSize))) { return false; } } return true; } /** * Returns true if all the elements of a Collection could have been inserted * into the Bloom filter. Use getFalsePositiveProbability() to calculate the * probability of this being correct. * @param c elements to check. * @return true if all the elements in c could have been inserted into the Bloom filter. */ public boolean containsAll(Collection<? extends E> c) { for (E element : c) if (!contains(element)) return false; return true; } /** * Read a single bit from the Bloom filter. * @param bit the bit to read. * @return true if the bit is set, false if it is not. */ public boolean getBit(int bit) { return bitset.get(bit); } /** * Set a single bit in the Bloom filter. * @param bit is the bit to set. * @param value If true, the bit is set. If false, the bit is cleared. */ public void setBit(int bit, boolean value) { bitset.set(bit, value); } /** * Return the bit set used to store the Bloom filter. * @return bit set representing the Bloom filter. */ public BitSet getBitSet() { return bitset; } /** * Returns the number of bits in the Bloom filter. Use count() to retrieve * the number of inserted elements. * * @return the size of the bitset used by the Bloom filter. */ public int size() { return this.bitSetSize; } /** * Returns the number of elements added to the Bloom filter after it * was constructed or after clear() was called. * * @return number of elements added to the Bloom filter. */ public int count() { return this.numberOfAddedElements; } /** * Returns the expected number of elements to be inserted into the filter. * This value is the same value as the one passed to the constructor. * * @return expected number of elements. */ public int getExpectedNumberOfElements() { return expectedNumberOfFilterElements; } /** * Get expected number of bits per element when the Bloom filter is full. This value is set by the constructor * when the Bloom filter is created. See also getBitsPerElement(). * * @return expected number of bits per element. */ public double getExpectedBitsPerElement() { return this.bitsPerElement; } /** * Get actual number of bits per element based on the number of elements that have currently been inserted and the length * of the Bloom filter. See also getExpectedBitsPerElement(). * * @return number of bits per element. */ public double getBitsPerElement() { return this.bitSetSize / (double)numberOfAddedElements; } }


相關推薦

大量資料去重:Bitmap點陣圖演算法和過濾器(Bloom Filter)

Bitmap演算法 與其說是演算法,不如說是一種緊湊的資料儲存結構。是用記憶體中連續的二進位制位(bit),用於對大量整型資料做去重和查詢。其實如果並非如此大量的資料,有很多排重方案可以使用,典型的就是雜湊表。 實際上,雜湊表為每一個可能出現的數字提供了一個一一對映的關係,每個元素都相當於有

url去重 --過濾器 bloom filter原理及python實現

array art bits bras pos for tar ack setup https://blog.csdn.net/a1368783069/article/details/52137417 # -*- encoding: utf-8 -*- """This

過濾器(Bloom Filter)詳解

轉: http://www.cnblogs.com/haippy/archive/2012/07/13/2590351.html 布隆過濾器[1](Bloom Filter)是由布隆(Burton Howard Bloom)在1970年提出的。它實際上是由一個

過濾器Bloom Filter演算法的Java實現(用於去重)

在日常生活中,包括在設計計算機軟體時,我們經常要判斷一個元素是否在一個 集合中。比如在字處理軟體中,需要檢查一個英語單詞是否拼寫正確(也就是要判斷它是否在已知的字典中);在 FBI,一個嫌疑人的名字是否已經在嫌疑名單上;在網路爬蟲裡,一個網址是否被訪問過等等。最直接的方法就

過濾器-Bloom Filter

/** * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Lesser General Public License as published by

分散式快取擊穿(過濾器 Bloom Filter

前面的文章介紹了快取的分類和使用的場景。通常情況下,快取是加速系統響應的一種途徑,通常情況下只有系統的部分資料。當請求了快取中沒有的資料時,這時候就會回源到DB裡面。此時如果黑客故意對上面資料發起大量請求,則DB有可能會掛掉,這就是快取擊穿。當然快取掛掉的

大量資料去重:Bitmap和過濾器(Bloom Filter)

5TB的硬碟上放滿了資料,請寫一個演算法將這些資料進行排重。如果這些資料是一些32bit大小的資料該如何解決?如果是64bit的呢?在面試時遇到的問題,問題的解決方案十分典型,但對於海量資料處理接觸少的同學可能一時也想不到什麼好方案。介紹兩個演算法,對於空間的利用到達了一種極

過濾器(Bloom Filter)

布隆過濾器(Bloom Filter)是一種基於Hash的高效查詢資料結構,它能夠快速答覆“某個元素是否存在”的問題。布隆過濾器只能用於新增元素與查詢元素,不能夠用於刪除元素。 在布隆過濾器之前,使用的是基於Hash的快速查詢演算法。Hash可以將一個元素進行雜湊,然後根據雜湊值對映到陣列的某一個位置。並且

過濾器 Bloom Filter

# 一 前言 假如有一個15億使用者的系統,每天有幾億使用者訪問系統,要如何快速判斷是否為系統中的使用者呢? - 方法一,將15億使用者儲存在資料庫中,每次使用者訪問系統,都到資料庫進行查詢判斷,準確性高,但是查詢速度會比較慢。 - 方法二,將15億使用者快取在Redis記憶體中,每次使用者訪問系統,都到R

Bloom Filter過濾器

布隆過濾器 數據及結構 原文鏈接:http://blog.csdn.net/qq_38646470/article/details/794316591.概念:如果想判斷一個元素是不是在一個集合裏,一般想到的是將所有元素保存起來,然後通過比較確定。鏈表,樹等等數據結構都是這種思路. 但是隨著集合中元素的

Bloom filter(過濾器)概念與原理

概念 int 復雜 gravity water pac 基數 AS class https://en.wikipedia.org/wiki/Bloom_filter 寫在前面 在大數據與雲計算發展的時代,我們經常會碰到這樣的問題。我們是否能高效的判斷一個用

Bloom Filter(過濾器)

最早看到這個精巧的資料結構是在《數學之美》上,今天梳理一下它的特點。 布隆過濾器:        Bloom Filter是一個節省空間的概率型資料結構,被用來測試一個元素是否存在於集合中。“元素實際不存在於集合中但判定為存在“的這類錯誤(False

過濾器Bloom Filter)(給兩個檔案,分別有100億個字串,我們只要1g的記憶體,如何找到兩個檔案的交集?分別給出精確演算法和近似演算法?)

  給兩個檔案,分別有100億個字串,我們只要1g的記憶體,如何找到兩個檔案的交集?分別給出精確演算法和近似演算法? 精確演算法:   我們可以建立1000個檔案,運用雜湊函式先將檔案1的字串儲存在對應的檔案中,之後再檔案2中取元素,通過雜湊函式計算出雜湊地址

過濾器Bloom Filter)的簡單實現

最近在部署Scrapy專案時,瞭解到Scrapy_Redis的去重機制並不太友好。查詢之後發現了一個更好的去重方式——布隆過濾器。 使用布隆過濾器的原因: 關於布隆過濾器的詳細原理及介紹,推薦一個部落格:https://www.cnblogs.com/haippy/archive/2012/

過濾器Bloom Filter)原理以及應用

布隆過濾器(Bloom Filter)是1970年由布隆提出的。它實際上是一個很長的二進位制向量和一系列隨機對映函式。布隆過濾器可以用於檢索一個元素是否在一個集合中。它的優點是空間效率和查詢時間都遠遠超過一般的演算法,缺點是有一定的誤識別率和刪除困難。 hash原理

數學之美系列二十一 - 過濾器Bloom Filter

2007年7月3日 上午 09:35:00 在日常生活中,包括在設計計算機軟體時,我們經常要判斷一個元素是否在一個集合中。比如在字處理軟體中,需要檢查一個英語單詞是否拼寫正確(也就是要判斷它是否在已知的字典中);在 FBI,一個嫌疑人的名字是否已經在嫌疑名單上;在網

Redis快取擊穿解決辦法之bloom filter過濾器

布隆過濾器:Google Guava類庫原始碼分析及基於Redis Bitmaps的重構 2017/12/30 · 開發 · Bitmaps, BloomFilter, Guava, Redis, 布隆過濾器 本文源地址:http://www.fullstack

過濾器的簡單介紹與例項(Bloom Filter)

布隆在1970年提出了布隆過濾器(Bloom Filter),是一個很長的二進位制向量(可以想象成一個序列)和一系列隨機對映函式(hash function)。  布隆過濾器可以用於檢索一個元素是否在一個集合中。  優點:佔用空間小,查詢快  缺點:有誤判,刪除困難 1

Bloom Filter 過濾器

布隆過濾器在很多場合能發揮很好的效果,比如:網頁URL的去重,垃圾郵件的判別,集合重複元素的判別,查詢加速(比如基於key-value的儲存系統)等,下面舉幾個例子: 有兩個URL集合A,B,每個集合中大約有1億個URL,每個URL佔64位元組,有1G的記憶體,如何找出兩個

Hadoop中的Bloom Filter過濾器介紹

布隆過濾器 布隆過濾器(Bloom Filter)是1970年由布隆提出的。它實際上是一個很長的二進位制向量和一系列隨機對映函式。布隆過濾器用於檢索一個元素是否在一個集合中。它的優點是空間效率和查詢時間都遠遠超過一般的演算法,缺點是有一定的誤識別率和刪除困難。 基本概念