當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hbase 解析（四) 布隆过滤器(BloomFilter)

發布時間：2024/5/14 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hbase 解析（四) 布隆过滤器(BloomFilter) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1.簡介

1.1 介紹

Bloom filter是1970年引入的一種數據結構，在過去的十年中，由于它為網絡主機之間的組成員信息傳輸提供了帶寬效率，因此被網絡研究界采用。發送者將信息編碼成一個比特向量，即Bloom過濾器，它比傳統的表示形式更緊湊。Bloom本身的計算和空間成本在元素數量上是線性的。接收器使用過濾器來測試各種元素是否是集合的成員。雖然過濾器偶爾會返回假陽性，但它永遠不會返回假陰性。Hbase正是利用這一點，在數據查詢時判斷數據是否在當前Hfile中，可以有效減少數據的讀取，減少IO操作。

1.2 Bloom filter最優的大小計算

Bloom過濾器對插入到其中的元素的數量非常敏感。對于HBase來說，條目的數量取決于存儲在列中的數據的大小。當前默認區域大小為256MB，因此條目計數~=256MB/（列的平均值大小）。盡管有這個經驗法則，但是由于壓縮，我們并沒有有效的方法來計算壓縮后的條目計數。因此，通常使用動態bloom過濾器來添加額外的空間，而不是允許錯誤率增長。

Bloom filter最優的大小計算公示為：bloom size m = -(n * ln(err) / (ln(2)^2) ~= n * ln(err) / ln(0.6185)

m表示Bloom filter中的位數（bitSize）
n表示插入bloomfilter中的元素數（maxKeys）
k表示使用的哈希函數數（nbHash）
e表示Bloom所需的誤報率（err）

但且僅當k=m/n ln（2）時，誤報概率最小。

Hbase中Bloom filter的設置是在創建列族時通過setBloomFilterType方法設定，Hbase支持ROW、ROWCOL、ROWPREFIX_FIXED_LENGTH三種類型的Bloom filter，創建列族時默認設置為ROW，對所有插入數據的rowkey寫入到Bloom filter中。

2.hbase中布隆過濾器的實現

2.1?Bloom filter接口實現

Hbase的布隆過濾器由父接口BloomFilterBase類定義，包含2個子繼承接口BloomFilter和BloomFilterWriter。

BloomFilter負責讀取、判斷，BloomFilterWriter負責將數據寫入布隆過濾器。

2.1.1 read類

BloomFilter負責數據的讀取判斷其中定義了三個方法

//檢查所定義的keyCell是否包含 boolean contains(Cell keyCell, ByteBuff bloom, BloomType type); boolean contains(byte[] buf, int offset, int length, ByteBuff bloom);//是否允許Bloom filter自動load數據，默認實現為true boolean supportsAutoLoading();

BloomFilter最終的實現類是CompoundBloomFilter類，CompoundBloomFilter的核心方法是contains方法。

public boolean contains(Cell keyCell, ByteBuff bloom, BloomType type) {//如果根索引不包含keyCell，返回false，根索引在Hfile創建時構建，不是對所有rowkeyint block = index.rootBlockContainingKey(keyCell);if (block < 0) {return false; // This key is not in the file.}boolean result;//獲得Bloom的BlockHFileBlock bloomBlock = getBloomBlock(block);try {ByteBuff bloomBuf = bloomBlock.getBufferReadOnly();//通過BloomFilterUtil的contains方法判斷result = BloomFilterUtil.contains(keyCell, bloomBuf, bloomBlock.headerSize(),bloomBlock.getUncompressedSizeWithoutHeader(), hash, hashCount, type);} finally {// After the use return back the block if it was served from a cache.reader.returnBlock(bloomBlock);}if (numPositivesPerChunk != null && result) {// Update statistics. Only used in unit tests.++numPositivesPerChunk[block];}return result;}

BloomFilterUtil.contains方法中，通過不同的?BloomType，構建不同的BloomHashKey，然后讀取bloomBuf中的bitvals，計算cell

對應類型的HashKey，判斷在bitvals中是否為1.

public static boolean contains(Cell cell, ByteBuff bloomBuf, int bloomOffset, int bloomSize,Hash hash, int hashCount, BloomType type) {HashKey<Cell> hashKey = type == BloomType.ROWCOL ? new RowColBloomHashKey(cell): new RowBloomHashKey(cell);//最終的判斷方法，實現還是比較簡單return contains(bloomBuf, bloomOffset, bloomSize, hash, hashCount, hashKey);}

2.1.2write類

?BloomFilterWriter類定義了一些寫入方法

//在數據寫入磁盤之前，壓縮Bloom filter void compactBloom();//獲得一個meta data 的Writer，寫入Bloom TYpe 、數據大小等 Writable getMetaWriter();//獲取一個 Bloom bits 的Writer Writable getDataWriter();// previous cell written Cell getPrevCell(); BloomFilterWriter的最終實現類為CompoundBloomFilterWriter。子類CompoundBloomFilterWriter的核心方法是append方法，負責將添加的數據寫入到BloomFilter @Overridepublic void append(Cell cell) throws IOException {if (cell == null)throw new NullPointerException();enqueueReadyChunk(false);if (chunk == null) {if (firstKeyInChunk != null) {throw new IllegalStateException("First key in chunk already set: "+ Bytes.toStringBinary(firstKeyInChunk));}//第一添加時需要allocateNewChunk，Chunk動態添加，完成hash等// This will be done only once per chunkif (bloomType == BloomType.ROWCOL) {firstKeyInChunk =PrivateCellUtil.getCellKeySerializedAsKeyValueKey(PrivateCellUtil.createFirstOnRowCol(cell));} else {firstKeyInChunk = CellUtil.copyRow(cell);}allocateNewChunk();}//Chunk 實現具體的hash計算，和bit置位chunk.add(cell);this.prevCell = cell;++totalKeyCount;}

BloomFilterChunk繼承自BloomFilterBase，實現了BloomFilter的讀寫，其中add方法實現寫入

public void add(Cell cell) {/** For faster hashing, use combinatorial generation* http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf*/int hash1;int hash2;HashKey<Cell> hashKey;// 計算2次hash 寫入位if (this.bloomType == BloomType.ROWCOL) {hashKey = new RowColBloomHashKey(cell);hash1 = this.hash.hash(hashKey, 0);hash2 = this.hash.hash(hashKey, hash1);} else {hashKey = new RowBloomHashKey(cell);hash1 = this.hash.hash(hashKey, 0);hash2 = this.hash.hash(hashKey, hash1);}setHashLoc(hash1, hash2);}

get方法完成數據查詢，和之前BloomFilterUtil.contains方法一致

static boolean get(int pos, ByteBuffer bloomBuf, int bloomOffset) {//實現位查找int bytePos = pos >> 3; //pos / 8int bitPos = pos & 0x7; //pos % 8// TODO access this via Util API which can do Unsafe access if possible(?)byte curByte = bloomBuf.get(bloomOffset + bytePos);curByte &= BloomFilterUtil.bitvals[bitPos];return (curByte != 0);}

3.布隆過濾器的使用

對于之前創建的布隆過濾器的使用，hbse中體現在兩個地方，一個是構建scannner時，判斷scanner的是否包含所需要的數據列或者列族.

在構建StoreFileScanner時，會通過shouldUseScanner方法判斷，時都用到當前Scanner，其中用到了reader.passesBloomFilter的方法。 public boolean shouldUseScanner(Scan scan, HStore store, long oldestUnexpiredTS) {// if the file has no entries, no need to validate or create a scanner.byte[] cf = store.getColumnFamilyDescriptor().getName();TimeRange timeRange = scan.getColumnFamilyTimeRange().get(cf);if (timeRange == null) {timeRange = scan.getTimeRange();}//從時間范圍、startkey和endkey范圍、bloomfilter判斷return reader.passesTimerangeFilter(timeRange, oldestUnexpiredTS) && reader.passesKeyRangeFilter(scan) && reader.passesBloomFilter(scan, scan.getFamilyMap().get(cf));} StoreFileReader在創建StoreFileScanner的時候創建，主要用來讀取hfile文件。?passesBloomFilter方法當前Hfile的bloomFilter的類型，構建具體的bloomFilter。bloomFilter的類型是創建表時，列族中定義的。 boolean passesBloomFilter(Scan scan, final SortedSet<byte[]> columns) {byte[] row = scan.getStartRow();switch (this.bloomFilterType) {case ROW:if (!scan.isGetScan()) {return true;}return passesGeneralRowBloomFilter(row, 0, row.length);case ROWCOL:if (!scan.isGetScan()) {return true;}if (columns != null && columns.size() == 1) {byte[] column = columns.first();// create the required fake keyCell kvKey = PrivateCellUtil.createFirstOnRow(row, HConstants.EMPTY_BYTE_ARRAY, column);return passesGeneralRowColBloomFilter(kvKey);}// For multi-column queries the Bloom filter is checked from the// seekExact operation.return true;case ROWPREFIX_FIXED_LENGTH:return passesGeneralRowPrefixBloomFilter(scan);default:return true;} }

?passesGeneralRowBloomFilter方法中this.generalBloomFilter是創建reader時構建的BloomFilter。

private boolean passesGeneralRowBloomFilter(byte[] row, int rowOffset, int rowLen) {BloomFilter bloomFilter = this.generalBloomFilter;if (bloomFilter == null) {return true;}// Used in ROW bloombyte[] key = null;if (rowOffset != 0 || rowLen != row.length) {throw new AssertionError("For row-only Bloom filters the row must occupy the whole array");}key = row;//判斷row是否在本hfile中return checkGeneralBloomFilter(key, null, bloomFilter);} checkGeneralBloomFilter方法中調用contains完成最終的判斷。

總結

以上是生活随笔為你收集整理的Hbase 解析（四) 布隆过滤器(BloomFilter)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：如何用sqlplus执行一个sql文件和
下一篇：【虹科】-网络监控协议总结