Python基础:一起来面向对象 (二) 之搜索引擎
生活随笔
收集整理的這篇文章主要介紹了
Python基础:一起来面向对象 (二) 之搜索引擎
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
實(shí)例 搜索引擎
一個(gè)搜索引擎由搜索器、索引器、檢索器和用戶(hù)接口四個(gè)部分組成
搜索器就是爬蟲(chóng)(scrawler),爬出的內(nèi)容送給索引器生成索引(Index)存儲(chǔ)在內(nèi)部數(shù)據(jù)庫(kù)。用戶(hù)通過(guò)用戶(hù)接口發(fā)出詢(xún)問(wèn)(query),詢(xún)問(wèn)解析后送達(dá)檢索器,檢索器高效檢索后,將結(jié)果返回給用戶(hù)。
以下5個(gè)文件為爬取的搜索樣本。
# # 1.txt # I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.# # 2.txt # I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.# # 3.txt # I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.# # 4.txt # This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .# # 5.txt # And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"簡(jiǎn)單搜索引擎
SearchEngineBase 基類(lèi),corpus(語(yǔ)料) class SearchEngineBase(object):def __init__(self):pass#將文件作為id,與內(nèi)容一起送到process_corpusdef add_corpus(self, file_path):with open(file_path, 'r') as fin:text = fin.read()self.process_corpus(file_path, text)#索引器 將文件路徑作為id,將處理的內(nèi)容作為索引存儲(chǔ)def process_corpus(self, id, text):raise Exception('process_corpus not implemented.')#檢索器def search(self, query):raise Exception('search not implemented.')#多態(tài) def main(search_engine):for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:search_engine.add_corpus(file_path)while True:query = input()results = search_engine.search(query)print('found {} result(s):'.format(len(results)))for result in results:print(result)class SimpleEngine(SearchEngineBase):def __init__(self):super(SimpleEngine, self).__init__()self.__id_to_texts = dict()def process_corpus(self, id, text):self.__id_to_texts[id] = textdef search(self, query):results = []for id, text in self.__id_to_texts.items():if query in text:results.append(id)return resultssearch_engine = SimpleEngine() main(search_engine)########## 輸出 ########## # simple # found 0 result(s): # whe # found 2 result(s): # 1.txt # 5.txt 缺點(diǎn):每次索引與檢索都需要占用大量空間,還有查詢(xún)只能是一個(gè)詞或連續(xù)的幾個(gè)詞,對(duì)分散的在不同位置的多個(gè)詞無(wú)能為力詞袋模型 (bag of words model)
運(yùn)用詞袋模型 (bag of words model),NLP領(lǐng)域最簡(jiǎn)單的模型之一。 process_corpus函數(shù)中調(diào)用parse_text_to_words把文檔中的各個(gè)單詞裝進(jìn)set集合中。 search函數(shù)中把包含查詢(xún)關(guān)鍵字也打碎成set,與索引的文檔核對(duì),將匹配的id加入結(jié)果集。 import re class BOWEngine(SearchEngineBase):def __init__(self):super(BOWEngine, self).__init__()self.__id_to_words = dict()def process_corpus(self, id, text):self.__id_to_words[id] = self.parse_text_to_words(text)def search(self, query):query_words = self.parse_text_to_words(query)results = []for id, words in self.__id_to_words.items():if self.query_match(query_words, words):results.append(id)return results@staticmethoddef query_match(query_words, words):for query_word in query_words:if query_word not in words:return Falsereturn True#for query_word in query_words:# return False if query_word not in words else True#result = filter(lambda x:x not in words,query_words)#return False if (len(list(result)) > 0) else True @staticmethoddef parse_text_to_words(text):# 使用正則表達(dá)式去除標(biāo)點(diǎn)符號(hào)和換行符text = re.sub(r'[^\w ]', ' ', text)# 轉(zhuǎn)為小寫(xiě)text = text.lower()# 生成所有單詞的列表word_list = text.split(' ')# 去除空白單詞word_list = filter(None, word_list)# 返回單詞的 setreturn set(word_list)search_engine = BOWEngine() main(search_engine)########## 輸出 ########## # i have a dream # found 3 result(s): # 1.txt # 2.txt # 3.txt # freedom children # found 1 result(s): # 5.txt缺點(diǎn):每次search還是需要遍歷所有文檔
Inverted index 倒序索引
Inverted index 倒序索引,現(xiàn)在保留的是 word -> id 的字典 import re class BOWInvertedIndexEngine(SearchEngineBase):def __init__(self):super(BOWInvertedIndexEngine, self).__init__()self.inverted_index = dict()#生成索引 word -> iddef process_corpus(self, id, text):words = self.parse_text_to_words(text)for word in words:if word not in self.inverted_index:self.inverted_index[word] = []self.inverted_index[word].append(id) #{'little':['1.txt','2.txt'],...}def search(self, query):query_words = list(self.parse_text_to_words(query))query_words_index = list()for query_word in query_words:query_words_index.append(0)# 如果某一個(gè)查詢(xún)單詞的倒序索引為空,我們就立刻返回for query_word in query_words:if query_word not in self.inverted_index:return []result = []while True:# 首先,獲得當(dāng)前狀態(tài)下所有倒序索引的 indexcurrent_ids = []for idx, query_word in enumerate(query_words):current_index = query_words_index[idx]current_inverted_list = self.inverted_index[query_word] #['1.txt','2.txt']# 已經(jīng)遍歷到了某一個(gè)倒序索引的末尾,結(jié)束 searchif current_index >= len(current_inverted_list):return resultcurrent_ids.append(current_inverted_list[current_index])# 然后,如果 current_ids 的所有元素都一樣,那么表明這個(gè)單詞在這個(gè)元素對(duì)應(yīng)的文檔中都出現(xiàn)了if all(x == current_ids[0] for x in current_ids):result.append(current_ids[0])query_words_index = [x + 1 for x in query_words_index]continue# 如果不是,我們就把最小的元素加一min_val = min(current_ids)min_val_pos = current_ids.index(min_val)query_words_index[min_val_pos] += 1@staticmethoddef parse_text_to_words(text):# 使用正則表達(dá)式去除標(biāo)點(diǎn)符號(hào)和換行符text = re.sub(r'[^\w ]', ' ', text)# 轉(zhuǎn)為小寫(xiě)text = text.lower()# 生成所有單詞的列表word_list = text.split(' ')# 去除空白單詞word_list = filter(None, word_list)# 返回單詞的 setreturn set(word_list)search_engine = BOWInvertedIndexEngine() main(search_engine)########## 輸出 ########### little # found 2 result(s): # 1.txt # 2.txt # little vicious # found 1 result(s): # 2.txtLRUCache
如果90%以上都是重復(fù)搜索,為了提高性能,考慮增加緩存,使用Least Recently Used 近期最少使用算法實(shí)現(xiàn) import pylru class LRUCache(object):def __init__(self, size=32):self.cache = pylru.lrucache(size)def has(self, key):return key in self.cachedef get(self, key):return self.cache[key]def set(self, key, value):self.cache[key] = valueclass BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache):def __init__(self):super(BOWInvertedIndexEngineWithCache, self).__init__()LRUCache.__init__(self)def search(self, query):if self.has(query):print('cache hit!')return self.get(query)result = super(BOWInvertedIndexEngineWithCache, self).search(query)self.set(query, result)return resultsearch_engine = BOWInvertedIndexEngineWithCache() main(search_engine)########## 輸出 ########## # little # found 2 result(s): # 1.txt # 2.txt # little # cache hit! # found 2 result(s): # 1.txt # 2.txt注意BOWInvertedIndexEngineWithCache繼承了兩個(gè)類(lèi)。
在構(gòu)造函數(shù)里直接使用super(BOWInvertedIndexEngineWithCache, self).__init__()來(lái)初始化BOWInvertedIndexEngine父類(lèi) 對(duì)于多重繼承的父類(lèi)就要使用LRUCache.__init__(self)來(lái)初始化 BOWInvertedIndexEngineWithCache里重載了search函數(shù),在函數(shù)里面要調(diào)用父類(lèi)BOWInvertedIndexEngine的search函數(shù),使用: result = super(BOWInvertedIndexEngineWithCache, self).search(query)?
參考
極客時(shí)間《Python核心技術(shù)與實(shí)戰(zhàn)》專(zhuān)欄
https://time.geekbang.org/column/intro/176
轉(zhuǎn)載于:https://www.cnblogs.com/xiaoguanqiu/p/10984178.html
總結(jié)
以上是生活随笔為你收集整理的Python基础:一起来面向对象 (二) 之搜索引擎的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: async异步方法
- 下一篇: Educational Codeforc