當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Simhash源码学习

發布時間：2024/3/26 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 Simhash源码学习小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、Simhash

傳入對象為object，根據源碼，支持傳入Simhash對象、String文本、features數組、int數字，分別標識已構建的Simhash對象、計算相似度的context文本、已分好詞的features數組（在simhash中稱為特征）、已計算好hash值的數字

1、初始化

1、init() 根據傳入對象構建Simhash對象，關鍵邏輯如下： if isinstance(value, Simhash):self.value = value.value elif isinstance(value, basestring):self.build_by_text(unicode(value)) elif isinstance(value, Iterable):self.build_by_features(value) elif isinstance(value, numbers.Integral):self.value = value else:raise Exception('Bad parameter with type {}'.format(type(value)))分別展開來講 1、傳入Simhash和Integral（相比int，能夠識別其他類型的數值，如int64、Python2中的Long等，可看做就是數字）就不多講了，都表示已經構建好的Simhash了 2、傳入String文本，會執行到build_by_text()方法 3、傳入Iterable對象，會執行到build_by_features()方法2、build_by_text(self, content) 源碼如下： def build_by_text(self, content):features = self._tokenize(content)features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))}return self.build_by_features(features)（1）_tokenize()方法表示Simhash自帶的分詞方法，邏輯包括：1）content文本轉為小寫2）使用正則表達式識別其中的合法文本，詳細識別范圍未深究3）默認以4為滑動步長，對文本進行分詞，示例如下： >>> features = a._tokenize('你好嗎，我很好') >>> features ['你好嗎我', '好嗎我很', '嗎我很好'] （2）對features進行詞頻統計 >>> features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))} >>> features {'你好嗎我': 1, '嗎我很好': 1, '好嗎我很': 1} （3）執行build_by_feature3、build_by_feature(self, features) 支持傳入參數類型為`features` might be a list of unweighted tokens (a weight of 1 will be assumed), a list of (token, weight) tuples or a token -> weight dict. 關鍵代碼： sums = [] batch = [] # 為節省內存，計算完成一個批次后，就累加到sums中 count = 0 w = 1 truncate_mask = 2 ** self.f - 1 if isinstance(features, dict):features = features.items()for f in features:skip_batch = False# 如果feature不是文本，就是(feature, weight)的dict，判斷是否需要累加的sum中；if not isinstance(f, basestring): f, w = f# 權重超過最低權重，或者權重類型不是int，直接添加到sums中skip_batch = w > self.large_weight_cutoff or not isinstance(w, int)count += wif self.hashfunc_returns_int:h = int_to_bytes(self.hashfunc(f.encode('utf-8')) & truncate_mask, self.f_bytes)else:# 執行hashfunc，默認為md5的hash值h = self.hashfunc(f.encode('utf-8'))[-self.f_bytes:]if skip_batch:sums.append(self._bitarray_from_bytes(h) * w)else:batch.append(h * w)if len(batch) >= self.batch_size:# _sum_hashes對batch數組轉換為字符串（拼接），對字符串轉換為二進制數組，# 將二進制數組轉換為M*N維數組，其中N默認為64，然后對多維數組降維為一維數組，# 降維方式是每一行對應位相加；添加到sums中，使sums行數不斷增加，列數默認為64，即init階段的fsums.append(self._sum_hashes(batch))batch = []# 對每個batch的sums進行降維，控制sums行數，降維方式是每一行對應位相加if len(sums) >= self.batch_size:sums = [np.sum(sums, 0)]if batch:sums.append(self._sum_hashes(batch))# 最后對sums進行降維，降維方式是每一行對應位相加 combined_sums = np.sum(sums, 0) # byte數組轉換為int值 self.value = bytes_to_int(np.packbits(combined_sums > count / 2).tobytes())

2、計算文本相似度

distance()方法

def distance(self, another):assert self.f == another.fx = (self.value ^ another.value) & ((1 << self.f) - 1)ans = 0while x:ans += 1x &= x - 1return ans （1）對兩個Simhash對象的value執行異或，并保留f位（默認為64），表示指紋（fingerprint）的精度（2）計算結果中1的個數，個數越少，表示相似度越高

二、SimhashIndex

用于對一組數據進行Simhash判斷

關鍵方法：

# 初始化SimhashIndex，該類維護了N個bucket數組，每個bucket內為obj_id相同的Simhash對象 def __init__(self, objs, f=64, k=2, log=None):"""`objs` is a list of (obj_id, simhash)obj_id is a string, simhash is an instance of Simhash`f` is the same with the one for Simhash`k` is the tolerance""" # 從SimhashIndex中查找傳入simhash相似的對象，并以列表方式返回；默認的海明距離是3 #（可以修改的，不過論文里提到選擇3是一個比較折中的方案），用于去重的話，那就判斷 # get_near_dups()返回結果是否為空，不為空則該文本重復 def get_near_dups(self, simhash):"""`simhash` is an instance of Simhashreturn a list of obj_id, which is in type of str""" # 添加 def add(self, obj_id, simhash):"""`obj_id` is a string`simhash` is an instance of Simhash""" # 刪除 def delete(self, obj_id, simhash):"""`obj_id` is a string`simhash` is an instance of Simhash"""@property def offsets(self):"""You may optimize this method according to <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33026.pdf>"""return [self.f // (self.k + 1) * i for i in range(self.k + 1)]# 還不明白獲取了什么 def get_keys(self, simhash):for i, offset in enumerate(self.offsets):if i == (len(self.offsets) - 1):m = 2 ** (self.f - offset) - 1else:m = 2 ** (self.offsets[i + 1] - offset) - 1c = simhash.value >> offset & myield '%x:%x' % (c, i)# bucket個數 def bucket_size(self):return len(self.bucket)

總結

以上是生活随笔為你收集整理的Simhash源码学习的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

Simhash源码学习

一、Simhash

1、初始化

2、計算文本相似度

二、SimhashIndex

總結

一、Simhash

1、初始化

2、計算文本相似度

二、SimhashIndex