Elasticsearch实现类百度搜索引擎搜索功能ES5.5.0v
2019獨(dú)角獸企業(yè)重金招聘Python工程師標(biāo)準(zhǔn)>>>
源碼地址:?GitHub
業(yè)務(wù)需求(使用背景):
一、搜索引擎前綴搜索功能:
中文搜索:
1、搜索“劉”,匹配到“劉德華”、“劉斌”、“劉德志”
2、搜索“劉德”,匹配到“劉德華”、“劉德志”
小結(jié):搜索的文字需要匹配到集合中所有名字的子集。
全拼搜索:
1、搜索“l(fā)i”,匹配到“劉德華”、“劉斌”、“劉德志”
2、搜索“l(fā)iud”,匹配到“劉德華”、“劉德”
3、搜索“l(fā)iudeh”,匹配到“劉德華”
小結(jié):搜索的文字轉(zhuǎn)換成拼音后,需要匹配到集合中所有名字轉(zhuǎn)成拼音后的子集
簡(jiǎn)拼搜索:
1、搜索“w”,匹配到“我是中國(guó)人”,“我愛(ài)我的祖國(guó)”
2、搜索“wszg”,匹配到“我是中國(guó)人”
小結(jié):搜索的文字取拼音首字母進(jìn)行組合,需要匹配到組合字符串中前綴匹配的子集
解決方案:
方案一:將“l(fā)ike”搜索的字段的中、英簡(jiǎn)拼、英全拼 分別用索引的三個(gè)字段來(lái)進(jìn)行存儲(chǔ)并且不進(jìn)行分詞,最簡(jiǎn)單直接(倒排索引存儲(chǔ)它們本身數(shù)據(jù)),檢索索引數(shù)據(jù)的時(shí)候進(jìn)行 通配符查詢(like查詢),從這三個(gè)字段中分別進(jìn)行搜索,查詢匹配的記錄然后返回。(優(yōu)勢(shì):存儲(chǔ)格式簡(jiǎn)單,倒排索引存儲(chǔ)的數(shù)據(jù)量最少。缺點(diǎn):like索引數(shù)據(jù)的時(shí)候開(kāi)銷比較大 prefix 查詢比 term 查詢開(kāi)銷大得多)
方案二:將中、中簡(jiǎn)拼、中全拼 用一個(gè)字段衍生出三個(gè)字段(multi-field)來(lái)存儲(chǔ)三種數(shù)據(jù),并且分詞器filter采用edge_ngram類型對(duì)分詞的數(shù)據(jù)進(jìn)行,然后處理存儲(chǔ)到倒排索引中,當(dāng)檢索索引數(shù)據(jù)時(shí),檢索所有字段的數(shù)據(jù)。(優(yōu)勢(shì):格式緊湊,檢索索引數(shù)據(jù)的時(shí)候采用term 全匹配規(guī)則,也無(wú)需對(duì)入?yún)⑦M(jìn)行分詞,查詢效率高。缺點(diǎn):采用以空間換時(shí)間的策略,但是對(duì)索引來(lái)說(shuō)可以接受。采用衍生字段來(lái)存儲(chǔ),增加了存儲(chǔ)及檢索的復(fù)雜度,對(duì)于三個(gè)字段搜索會(huì)將相關(guān)度相加,容易混淆查詢相關(guān)度結(jié)果)
方案三:將索引數(shù)據(jù)存儲(chǔ)在一個(gè)不需分詞的字段中(keyword), 生成倒排索引時(shí)進(jìn)行三種類型倒排索引的生成,倒排索引生成的時(shí)候采用edge_ngram 對(duì)倒排進(jìn)一步拆分,以滿足業(yè)務(wù)場(chǎng)景需求,檢索時(shí)不對(duì)入?yún)⑦M(jìn)行分詞。(優(yōu)勢(shì):索引數(shù)據(jù)存儲(chǔ)簡(jiǎn)單,,檢索索引數(shù)據(jù)的時(shí)只需對(duì)一個(gè)字段 采用term 全匹配查詢規(guī)則,查詢效率極高。缺點(diǎn):采用以空間換時(shí)間的策略——比方案二要少,對(duì)索引數(shù)據(jù)來(lái)說(shuō)可以接受。)
ES 針對(duì)這一業(yè)務(wù)場(chǎng)景解決方案還有很多種,先列出比較典型的這三種方案,選擇方案三來(lái)進(jìn)行處理。
準(zhǔn)備工作:
- pinyin分詞插件安裝及參數(shù)解讀
- ElasticSearch edge_ngram 使用
- ElasticSearch multi-field 使用
- ElasticSearch 多種查詢特性熟悉
代碼:
baidu_settings.json:
{"refresh_interval":"2s","number_of_replicas":1,"number_of_shards":2,"analysis":{"filter":{"autocomplete_filter":{"type":"edge_ngram","min_gram":1,"max_gram":15},"pinyin_first_letter_and_full_pinyin_filter" : {"type" : "pinyin","keep_first_letter" : true,"keep_full_pinyin" : false,"keep_joined_full_pinyin": true,"keep_none_chinese" : false,"keep_original" : false,"limit_first_letter_length" : 16,"lowercase" : true,"trim_whitespace" : true,"keep_none_chinese_in_first_letter" : true},"full_pinyin_filter" : {"type" : "pinyin","keep_first_letter" : true,"keep_full_pinyin" : false,"keep_joined_full_pinyin": true,"keep_none_chinese" : false,"keep_original" : true,"limit_first_letter_length" : 16,"lowercase" : true,"trim_whitespace" : true,"keep_none_chinese_in_first_letter" : true}},"analyzer":{"full_prefix_analyzer":{"type":"custom","char_filter": ["html_strip"],"tokenizer":"keyword","filter":["lowercase","full_pinyin_filter","autocomplete_filter"]},"chinese_analyzer":{"type":"custom","char_filter": ["html_strip"],"tokenizer":"keyword","filter":["lowercase","autocomplete_filter"]},"pinyin_analyzer":{"type":"custom","char_filter": ["html_strip"],"tokenizer":"keyword","filter":["pinyin_first_letter_and_full_pinyin_filter","autocomplete_filter"]}}} }baidu_mapping.json
{"baidu_type": {"properties": {"full_name": {"type": "text","analyzer": "full_prefix_analyzer"},"age": {"type": "integer"}}} } public class PrefixTest {@Testpublic void testCreateIndex() throws Exception{TransportClient client = ESConnect.getInstance().getTransportClient();//定義索引BaseIndex.createWithSetting(client,"baidu_index","esjson/baidu_settings.json");//定義類型及字段詳細(xì)設(shè)計(jì)BaseIndex.createMapping(client,"baidu_index","baidu_type","esjson/baidu_mapping.json");}@Testpublic void testBulkInsert() throws Exception{TransportClient client = ESConnect.getInstance().getTransportClient();List<Object> list = new ArrayList<>();list.add(new BulkInsert(12l,"我們都有一個(gè)家名字叫中國(guó)",12));list.add(new BulkInsert(13l,"兄弟姐妹都很多景色也不錯(cuò) ",13));list.add(new BulkInsert(14l,"家里盤著兩條龍是長(zhǎng)江與黃河",14));list.add(new BulkInsert(15l,"還有珠穆朗瑪峰兒是最高山坡",15));list.add(new BulkInsert(16l,"我們都有一個(gè)家名字叫中國(guó)",16));list.add(new BulkInsert(17l,"兄弟姐妹都很多景色也不錯(cuò)",17));list.add(new BulkInsert(18l,"看那一條長(zhǎng)城萬(wàn)里在云中穿梭",18));boolean flag = BulkOperation.batchInsert(client,"baidu_index","baidu_type",list);System.out.println(flag);} }不要意思,代碼封裝了,java生成索引網(wǎng)上查方式即可:重點(diǎn)不在java代碼怎么實(shí)現(xiàn)。而是上面的思想。
接下來(lái)查看下定義的分詞器效果:
http://192.168.20.114:9200/baidu_index/_analyze?text=劉德華AT2016&analyzer=full_prefix_analyzer {"tokens": [{"token": "劉","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華a","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at2","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at20","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at201","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "劉德華at2016","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "l","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "li","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liu","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liud","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liude","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liudeh","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liudehu","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "liudehua","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "l","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ld","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldh","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldha","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat2","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat20","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat201","start_offset": 0,"end_offset": 9,"type": "word","position": 0},{"token": "ldhat2016","start_offset": 0,"end_offset": 9,"type": "word","position": 0}] }大功告成。
參考:
http://blog.csdn.net/napoay/article/details/53907921
https://elasticsearch.cn/question/407
http://blog.csdn.net/xifeijian/article/details/51095762
http://www.cnblogs.com/xing901022/p/5910139.html
http://www.cnblogs.com/clonen/p/6674492.html
https://github.com/medcl/elasticsearch-analysis-pinyin
https://github.com/medcl/elasticsearch-analysis-ik
全文檢索后續(xù)有時(shí)間再進(jìn)行整理。
?
?
轉(zhuǎn)載于:https://my.oschina.net/LucasZhu/blog/1543956
總結(jié)
以上是生活随笔為你收集整理的Elasticsearch实现类百度搜索引擎搜索功能ES5.5.0v的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 准备情人节礼物比写代码难?来看看IT直男
- 下一篇: 项目符号,序号