Simhash源码学习
生活随笔
收集整理的這篇文章主要介紹了
Simhash源码学习
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
一、Simhash
傳入對象為object,根據源碼,支持傳入Simhash對象、String文本、features數組、int數字,分別標識已構建的Simhash對象、計算相似度的context文本、已分好詞的features數組(在simhash中稱為特征)、已計算好hash值的數字
1、初始化
1、init() 根據傳入對象構建Simhash對象,關鍵邏輯如下: if isinstance(value, Simhash):self.value = value.value elif isinstance(value, basestring):self.build_by_text(unicode(value)) elif isinstance(value, Iterable):self.build_by_features(value) elif isinstance(value, numbers.Integral):self.value = value else:raise Exception('Bad parameter with type {}'.format(type(value)))分別展開來講 1、傳入Simhash和Integral(相比int,能夠識別其他類型的數值,如int64、Python2中的Long等,可看做就是數字)就不多講了,都表示已經構建好的Simhash了 2、傳入String文本,會執行到build_by_text()方法 3、傳入Iterable對象,會執行到build_by_features()方法2、build_by_text(self, content) 源碼如下: def build_by_text(self, content):features = self._tokenize(content)features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))}return self.build_by_features(features)(1)_tokenize()方法表示Simhash自帶的分詞方法,邏輯包括:1)content文本轉為小寫2)使用正則表達式識別其中的合法文本,詳細識別范圍未深究3)默認以4為滑動步長,對文本進行分詞,示例如下: >>> features = a._tokenize('你好嗎,我很好') >>> features ['你好嗎我', '好嗎我很', '嗎我很好'] (2)對features進行詞頻統計 >>> features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))} >>> features {'你好嗎我': 1, '嗎我很好': 1, '好嗎我很': 1} (3)執行build_by_feature3、build_by_feature(self, features) 支持傳入參數類型為`features` might be a list of unweighted tokens (a weight of 1 will be assumed), a list of (token, weight) tuples or a token -> weight dict. 關鍵代碼: sums = [] batch = [] # 為節省內存,計算完成一個批次后,就累加到sums中 count = 0 w = 1 truncate_mask = 2 ** self.f - 1 if isinstance(features, dict):features = features.items()for f in features:skip_batch = False# 如果feature不是文本,就是(feature, weight)的dict,判斷是否需要累加的sum中;if not isinstance(f, basestring): f, w = f# 權重超過最低權重,或者權重類型不是int,直接添加到sums中skip_batch = w > self.large_weight_cutoff or not isinstance(w, int)count += wif self.hashfunc_returns_int:h = int_to_bytes(self.hashfunc(f.encode('utf-8')) & truncate_mask, self.f_bytes)else:# 執行hashfunc,默認為md5的hash值h = self.hashfunc(f.encode('utf-8'))[-self.f_bytes:]if skip_batch:sums.append(self._bitarray_from_bytes(h) * w)else:batch.append(h * w)if len(batch) >= self.batch_size:# _sum_hashes對batch數組轉換為字符串(拼接),對字符串轉換為二進制數組,# 將二進制數組轉換為M*N維數組,其中N默認為64,然后對多維數組降維為一維數組,# 降維方式是每一行對應位相加;添加到sums中,使sums行數不斷增加,列數默認為64,即init階段的fsums.append(self._sum_hashes(batch))batch = []# 對每個batch的sums進行降維,控制sums行數,降維方式是每一行對應位相加if len(sums) >= self.batch_size:sums = [np.sum(sums, 0)]if batch:sums.append(self._sum_hashes(batch))# 最后對sums進行降維,降維方式是每一行對應位相加 combined_sums = np.sum(sums, 0) # byte數組轉換為int值 self.value = bytes_to_int(np.packbits(combined_sums > count / 2).tobytes())2、計算文本相似度
distance()方法
def distance(self, another):assert self.f == another.fx = (self.value ^ another.value) & ((1 << self.f) - 1)ans = 0while x:ans += 1x &= x - 1return ans (1)對兩個Simhash對象的value執行異或,并保留f位(默認為64),表示指紋(fingerprint)的精度 (2)計算結果中1的個數,個數越少,表示相似度越高
二、SimhashIndex
用于對一組數據進行Simhash判斷
關鍵方法:
# 初始化SimhashIndex,該類維護了N個bucket數組,每個bucket內為obj_id相同的Simhash對象 def __init__(self, objs, f=64, k=2, log=None):"""`objs` is a list of (obj_id, simhash)obj_id is a string, simhash is an instance of Simhash`f` is the same with the one for Simhash`k` is the tolerance""" # 從SimhashIndex中查找傳入simhash相似的對象,并以列表方式返回;默認的海明距離是3 #(可以修改的,不過論文里提到選擇3是一個比較折中的方案),用于去重的話,那就判斷 # get_near_dups()返回結果是否為空,不為空則該文本重復 def get_near_dups(self, simhash):"""`simhash` is an instance of Simhashreturn a list of obj_id, which is in type of str""" # 添加 def add(self, obj_id, simhash):"""`obj_id` is a string`simhash` is an instance of Simhash""" # 刪除 def delete(self, obj_id, simhash):"""`obj_id` is a string`simhash` is an instance of Simhash"""@property def offsets(self):"""You may optimize this method according to <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33026.pdf>"""return [self.f // (self.k + 1) * i for i in range(self.k + 1)]# 還不明白獲取了什么 def get_keys(self, simhash):for i, offset in enumerate(self.offsets):if i == (len(self.offsets) - 1):m = 2 ** (self.f - offset) - 1else:m = 2 ** (self.offsets[i + 1] - offset) - 1c = simhash.value >> offset & myield '%x:%x' % (c, i)# bucket個數 def bucket_size(self):return len(self.bucket)總結
以上是生活随笔為你收集整理的Simhash源码学习的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 文章采集器-免费文章采集器
- 下一篇: 音乐频谱-傅里叶变换