當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

（9）Elasticsearch-suggest详解

發(fā)布時(shí)間：2024/3/26 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了（9）Elasticsearch-suggest详解小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

參考：Elasticsearch suggest? ??注：本文使用的版本是7.9.X，與原文有一些差異。

概述

（1）Term Suggester

option數(shù)組

（2）Phrase Suggester

（3）Completion Suggester

（4）global-suggest

其他

概述

現(xiàn)代的搜索引擎，一般會(huì)具備"Suggest As You Type"功能，即在用戶輸入搜索的過(guò)程中，進(jìn)行自動(dòng)補(bǔ)全或者糾錯(cuò)。通過(guò)協(xié)助用戶輸入更精準(zhǔn)的關(guān)鍵詞，提高后續(xù)全文搜索階段文檔匹配的程度。例如在Google上輸入部分關(guān)鍵詞，甚至輸入拼寫錯(cuò)誤的關(guān)鍵詞時(shí)，它依然能夠提示出用戶想要輸入的內(nèi)容:

輸入會(huì)自動(dòng)補(bǔ)全：

當(dāng)輸入有誤，開始提示相似的詞：

以上類似功能，在elasticsearch中，采用suggester api實(shí)現(xiàn)。suggester基本運(yùn)作原理是：將輸入的文本分解為token，然后在索引的字典中查找相似的term并且返回。根據(jù)使用場(chǎng)景不同，elasticsearch中涉及了 4種類別的suggester。分別是：

Term Suggester
Phrase Suggester
Completion Suggester
Context Suggester

以下實(shí)驗(yàn)，基于elasticsearch7.9.0，單機(jī)環(huán)境中完成，所有的結(jié)果都是7.9.0單機(jī)環(huán)境下運(yùn)行出來(lái)的結(jié)果。

（1）Term Suggester

提供一種基于單個(gè)詞項(xiàng)的拼寫糾錯(cuò)方法。
準(zhǔn)備一個(gè)叫做blogs的索引，配置一個(gè)text字段。

PUT /blogs/ {"mappings": {"properties": {"body": {"type": "text"}}} }

通過(guò)bulk api寫入幾條文檔

POST _bulk/?refresh=true { "index" : { "_index" : "blogs"} } { "body": "Lucene is cool"} { "index" : { "_index" : "blogs"} } { "body": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "blogs" } } { "body": "Elasticsearch rocks"} { "index" : { "_index" : "blogs" } } { "body": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "blogs" } } { "body": "elk rocks"} { "index" : { "_index" : "blogs" } } { "body": "elasticsearch is rock solid"}

此時(shí)blogs索引里已經(jīng)有一些文檔了，可以進(jìn)行下一步的探索。為幫助理解，我們先看看哪些term會(huì)存在于詞典里。
將輸入的文本分析一下:

POST _analyze {"text": ["Lucene is cool","Elasticsearch builds on top of lucene","Elasticsearch rocks","Elastic is the company behind ELK stack","elk rocks","elasticsearch is rock solid"] }

結(jié)果：

{"tokens": [{"token": "lucene","start_offset": 0,"end_offset": 6,"type": "<ALPHANUM>","position": 0},{"token": "is","start_offset": 7,"end_offset": 9,"type": "<ALPHANUM>","position": 1},{"token": "cool","start_offset": 10,"end_offset": 14,"type": "<ALPHANUM>","position": 2},{"token": "elasticsearch","start_offset": 15,"end_offset": 28,"type": "<ALPHANUM>","position": 3},{"token": "builds","start_offset": 29,"end_offset": 35,"type": "<ALPHANUM>","position": 4},{"token": "on","start_offset": 36,"end_offset": 38,"type": "<ALPHANUM>","position": 5},{"token": "top","start_offset": 39,"end_offset": 42,"type": "<ALPHANUM>","position": 6},{"token": "of","start_offset": 43,"end_offset": 45,"type": "<ALPHANUM>","position": 7},{"token": "lucene","start_offset": 46,"end_offset": 52,"type": "<ALPHANUM>","position": 8},{"token": "elasticsearch","start_offset": 53,"end_offset": 66,"type": "<ALPHANUM>","position": 9},{"token": "rocks","start_offset": 67,"end_offset": 72,"type": "<ALPHANUM>","position": 10},{"token": "elastic","start_offset": 73,"end_offset": 80,"type": "<ALPHANUM>","position": 11},{"token": "is","start_offset": 81,"end_offset": 83,"type": "<ALPHANUM>","position": 12},{"token": "the","start_offset": 84,"end_offset": 87,"type": "<ALPHANUM>","position": 13},{"token": "company","start_offset": 88,"end_offset": 95,"type": "<ALPHANUM>","position": 14},{"token": "behind","start_offset": 96,"end_offset": 102,"type": "<ALPHANUM>","position": 15},{"token": "elk","start_offset": 103,"end_offset": 106,"type": "<ALPHANUM>","position": 16},{"token": "stack","start_offset": 107,"end_offset": 112,"type": "<ALPHANUM>","position": 17},{"token": "elk","start_offset": 113,"end_offset": 116,"type": "<ALPHANUM>","position": 18},{"token": "rocks","start_offset": 117,"end_offset": 122,"type": "<ALPHANUM>","position": 19},{"token": "elasticsearch","start_offset": 123,"end_offset": 136,"type": "<ALPHANUM>","position": 20},{"token": "is","start_offset": 137,"end_offset": 139,"type": "<ALPHANUM>","position": 21},{"token": "rock","start_offset": 140,"end_offset": 144,"type": "<ALPHANUM>","position": 22},{"token": "solid","start_offset": 145,"end_offset": 150,"type": "<ALPHANUM>","position": 23}] }

這些分出來(lái)的token都會(huì)成為詞典里一個(gè)term，注意有些token會(huì)出現(xiàn)多次，因此在倒排索引里記錄的詞頻會(huì)比較高，同時(shí)記錄的還有這些token在原文檔里的偏移量和相對(duì)位置信息。
執(zhí)行一次suggester搜索看看效果:

POST /blogs/_search { "suggest": {"my-suggestion": {"text": "lucne rock","term": {"suggest_mode": "missing","field": "body"}}} }

suggest就是一種特殊類型的搜索，DSL內(nèi)部的"text"指的是api調(diào)用方提供的文本，也就是通常用戶界面上用戶輸入的內(nèi)容。這里的lucne是錯(cuò)誤的拼寫，模擬用戶輸入錯(cuò)誤。 "term"表示這是一個(gè)term suggester。 “field"指定suggester針對(duì)的字段，另外有一個(gè)可選的"suggest_mode”。范例里的"missing"實(shí)際上就是缺省值，它是什么意思？有點(diǎn)撓頭… 還是先看看返回結(jié)果吧:

結(jié)果：

{"took": 53,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"my-suggestion": [{"text": "lucne","offset": 0,"length": 5,"options": [{"text": "lucene","score": 0.8,"freq": 2}]},{"text": "rock","offset": 6,"length": 4,"options": [{"text": "rocks","score": 0.75,"freq": 2}]}]} }

在返回結(jié)果里"suggest" -> “my-suggestion"部分包含了一個(gè)數(shù)組，每個(gè)數(shù)組項(xiàng)對(duì)應(yīng)從輸入文本分解出來(lái)的token（存放在"text"這個(gè)key里）以及為該token提供的建議詞項(xiàng)（存放在options數(shù)組里)。示例里返回了"lucne”，“rock"這2個(gè)詞的建議項(xiàng)(options)，其中"rock"的options是空的，表示沒(méi)有可以建議的選項(xiàng)，為什么？上面提到了，我們?yōu)椴樵兲峁┑膕uggest mode是"missing”,由于"rock"在索引的詞典里已經(jīng)存在了，夠精準(zhǔn)，就不建議啦。只有詞典里找不到詞，才會(huì)為其提供相似的選項(xiàng)。

此處與原文結(jié)果略有不同。

如果將"suggest_mode"換成"popular"會(huì)是什么效果？
嘗試一下，重新執(zhí)行查詢，返回結(jié)果里"rock"這個(gè)詞的option不再是空的，而是建議為rocks。

{"took": 7,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"my-suggestion": [{"text": "lucne","offset": 0,"length": 5,"options": [{"text": "lucene","score": 0.8,"freq": 2}]},{"text": "rock","offset": 6,"length": 4,"options": [{"text": "rocks","score": 0.75,"freq": 2}]}]} }

回想一下，rock和rocks在索引詞典里都是有的。不難看出即使用戶輸入的token在索引的詞典里已經(jīng)有了，但是因?yàn)榇嬖谝粋€(gè)詞頻更高的相似項(xiàng)，這個(gè)相似項(xiàng)可能是更合適的，就被挑選到options里了。最后還有一個(gè)"always" mode，其含義是不管token是否存在于索引詞典里都要給出相似項(xiàng)。

有人可能會(huì)問(wèn)，兩個(gè)term的相似性是如何判斷的？ ES使用了一種叫做Levenstein edit distance的算法，其核心思想就是一個(gè)詞改動(dòng)多少個(gè)字符就可以和另外一個(gè)詞一致。 Term suggester還有其他很多可選參數(shù)來(lái)控制這個(gè)相似性的模糊程度，這里就不一一贅述了。

Term suggester正如其名，只基于analyze過(guò)的單個(gè)term去提供建議，并不會(huì)考慮多個(gè)term之間的關(guān)系。API調(diào)用方只需為每個(gè)token挑選options里的詞，組合在一起返回給用戶前端即可。那么有無(wú)更直接辦法，API直接給出和用戶輸入文本相似的內(nèi)容？答案是有，這就要求助Phrase Suggester了。

option數(shù)組

options數(shù)組包含給定詞的建議詞。如果elasticsearch沒(méi)有找到任何建議詞，則options數(shù)組為空。數(shù)組中的每一項(xiàng)都包含一個(gè)建議詞和以下可以用來(lái)表征該建議的信息：

text：elasticsearch給出的建議詞
score：建議詞的得分，得分越高的建議詞其質(zhì)量越高
freq：建議詞的文檔頻率。這里的頻率指建議詞在被查詢索引的多少個(gè)文檔中出現(xiàn)過(guò)。文檔頻率越高，說(shuō)明包含這個(gè)建議詞的文檔也越多，并且這個(gè)詞符合我們查詢意圖的可能性也越大。

（2）Phrase Suggester

可以返回完整的短語(yǔ)建議而不是單個(gè)詞項(xiàng)的建議。
Phrase suggester在Term suggester的基礎(chǔ)上，會(huì)考量多個(gè)term之間的關(guān)系，比如是否同時(shí)出現(xiàn)在索引的原文里，相鄰程度，以及詞頻等等。看個(gè)范例就比較容易明白了:

POST /blogs/_search {"suggest": {"my-suggestion": {"text": "lucne and elasticsear rock","phrase": {"field": "body","highlight": {"pre_tag": "","post_tag": ""}}}} }

結(jié)果：

{"took": 18,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"my-suggestion": [{"text": "lucne and elasticsear rock","offset": 0,"length": 26,"options": [{"text": "lucne and elasticsearch rocks","highlighted": "lucne and elasticsearch rocks","score": 0.12709484},{"text": "lucne and elasticsearch rock","highlighted": "lucne and elasticsearch rock","score": 0.10422645},{"text": "lucne and elasticsear rocks","highlighted": "lucne and elasticsear rocks","score": 0.10036137}]}]} }

options直接返回一個(gè)phrase列表，由于加了highlight選項(xiàng)，被替換的term會(huì)被高亮。因?yàn)閘ucene和elasticsearch曾經(jīng)在同一條原文里出現(xiàn)過(guò)，同時(shí)替換2個(gè)term的可信度更高，所以打分較高，排在第一位返回。Phrase suggester有相當(dāng)多的參數(shù)用于控制匹配的模糊程度，需要根據(jù)實(shí)際應(yīng)用情況去挑選和調(diào)試。

此處與原文結(jié)果略有不同。

（3）Completion Suggester

最后來(lái)談一下Completion Suggester，它主要針對(duì)的應(yīng)用場(chǎng)景就是"Auto Completion"。此場(chǎng)景下用戶每輸入一個(gè)字符的時(shí)候，就需要即時(shí)發(fā)送一次查詢請(qǐng)求到后端查找匹配項(xiàng)，在用戶輸入速度較高的情況下對(duì)后端響應(yīng)速度要求比較苛刻。因此實(shí)現(xiàn)上它和前面兩個(gè)Suggester采用了不同的數(shù)據(jù)結(jié)構(gòu)，索引并非通過(guò)倒排來(lái)完成，而是將analyze過(guò)的數(shù)據(jù)編碼成FST和索引一起存放。對(duì)于一個(gè)open狀態(tài)的索引，FST會(huì)被ES整個(gè)裝載到內(nèi)存里的，進(jìn)行前綴查找速度極快。但是FST只能用于前綴查找，這也是Completion Suggester的局限所在。

PUT /blogs_completion/ {"mappings": {"properties": {"body": {"type": "completion"}}} }

用bulk API索引點(diǎn)數(shù)據(jù):

POST _bulk/?refresh=true { "index" : { "_index" : "blogs_completion" } } { "body": "Lucene is cool"} { "index" : { "_index" : "blogs_completion" } } { "body": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "blogs_completion" } } { "body": "Elasticsearch rocks"} { "index" : { "_index" : "blogs_completion" } } { "body": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "blogs_completion" } } { "body": "the elk stack rocks"} { "index" : { "_index" : "blogs_completion" } } { "body": "elasticsearch is rock solid"}

查找:

POST blogs_completion/_search?pretty { "size": 0,"suggest": {"blog-suggest": {"prefix": "elastic i","completion": {"field": "body"}}} }

結(jié)果:

{"took": 44,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"blog-suggest": [{"text": "elastic i","offset": 0,"length": 9,"options": [{"text": "Elastic is the company behind ELK stack","_index": "blogs_completion","_type": "tech","_id": "WpgeMGoBguJ9vUco0qbN","_score": 1,"_source": {"body": "Elastic is the company behind ELK stack"}}]}]} }

值得注意的一點(diǎn)是Completion Suggester在索引原始數(shù)據(jù)的時(shí)候也要經(jīng)過(guò)analyze階段，取決于選用的analyzer不同，某些詞可能會(huì)被轉(zhuǎn)換，某些詞可能被去除，這些會(huì)影響FST編碼結(jié)果，也會(huì)影響查找匹配的效果。

比如我們新建索引blogs_completion_new，將analyzer更改為"english":

PUT /blogs_completion_new/ {"mappings": {"properties": {"body": {"type": "completion","analyzer": "english"}}} }

用bulk API索引點(diǎn)數(shù)據(jù):

POST _bulk/?refresh=true { "index" : { "_index" : "blogs_completion_new" } } { "body": "Lucene is cool"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "Elasticsearch rocks"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "the elk stack rocks"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "elasticsearch is rock solid"}

執(zhí)行下面的查詢:

POST blogs_completion_new/_search?pretty { "size": 0,"suggest": {"blog-suggest": {"prefix": "elastic i","completion": {"field": "body"}}} }

結(jié)果：

{"took": 2,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"blog-suggest": [{"text": "elastic i","offset": 0,"length": 9,"options": []}]} }

居然沒(méi)有匹配結(jié)果了，多么費(fèi)解！原來(lái)我們用的english analyzer會(huì)剝離掉stop word，而is就是其中一個(gè)，被剝離掉了！
用analyze api測(cè)試一下:

POST _analyze {"analyzer": "english","text": "elasticsearch is rock solid" }

在elasticsearch6.3.1中，如果使用原文作者那樣的寫法，這里將會(huì)報(bào)錯(cuò)"type": “illegal_argument_exception”,
“reason”: “request [/_analyze] contains unrecognized parameter: [analyzer]”。

結(jié)果是：

{"tokens": [{"token": "elasticsearch","start_offset": 0,"end_offset": 13,"type": "<ALPHANUM>","position": 0},{"token": "rock","start_offset": 17,"end_offset": 21,"type": "<ALPHANUM>","position": 2},{"token": "solid","start_offset": 22,"end_offset": 27,"type": "<ALPHANUM>","position": 3}] }

FST只編碼了這3個(gè)token，并且默認(rèn)的還會(huì)記錄他們?cè)谖臋n中的位置和分隔符。用戶輸入"elastic i"進(jìn)行查找的時(shí)候，輸入被分解成"elastic"和"i"，FST沒(méi)有編碼這個(gè)“i” , 匹配失敗。

好吧，如果你現(xiàn)在還足夠清醒的話，試一下搜索"elastic is"，會(huì)發(fā)現(xiàn)又有結(jié)果，why? 因?yàn)檫@次輸入的text經(jīng)過(guò)english analyzer的時(shí)候is也被剝離了，只需在FST里查詢"elastic"這個(gè)前綴，自然就可以匹配到了。

其他能影響completion suggester結(jié)果的，還有諸如"preserve_separators"，"preserve_position_increments"等等mapping參數(shù)來(lái)控制匹配的模糊程度。以及搜索時(shí)可以選用Fuzzy Queries，使得上面例子里的"elastic i"在使用english analyzer的情況下依然可以匹配到結(jié)果。

因此用好Completion Sugester并不是一件容易的事，實(shí)際應(yīng)用開發(fā)過(guò)程中，需要根據(jù)數(shù)據(jù)特性和業(yè)務(wù)需要，靈活搭配analyzer和mapping參數(shù)，反復(fù)調(diào)試才可能獲得理想的補(bǔ)全效果。

回到篇首Google搜索框的補(bǔ)全/糾錯(cuò)功能，如果用ES怎么實(shí)現(xiàn)呢？我能想到的一個(gè)的實(shí)現(xiàn)方式:
在用戶剛開始輸入的過(guò)程中，使用Completion Suggester進(jìn)行關(guān)鍵詞前綴匹配，剛開始匹配項(xiàng)會(huì)比較多，隨著用戶輸入字符增多，匹配項(xiàng)越來(lái)越少。如果用戶輸入比較精準(zhǔn)，可能Completion Suggester的結(jié)果已經(jīng)夠好，用戶已經(jīng)可以看到理想的備選項(xiàng)了。
如果Completion Suggester已經(jīng)到了零匹配，那么可以猜測(cè)是否用戶有輸入錯(cuò)誤，這時(shí)候可以嘗試一下Phrase Suggester。
如果Phrase Suggester沒(méi)有找到任何option，開始嘗試term Suggester。

精準(zhǔn)程度上(Precision)看： Completion > Phrase > term，而召回率上(Recall)則反之。從性能上看，Completion Suggester是最快的，如果能滿足業(yè)務(wù)需求，只用Completion Suggester做前綴匹配是最理想的。 Phrase和Term由于是做倒排索引的搜索，相比較而言性能應(yīng)該要低不少，應(yīng)盡量控制suggester用到的索引的數(shù)據(jù)量，最理想的狀況是經(jīng)過(guò)一定時(shí)間預(yù)熱后，索引可以全量map到內(nèi)存。

自己項(xiàng)目中的測(cè)試?yán)?#xff1a;

DELETE /steven_suggest/PUT /steven_suggest/ {"mappings": {"properties": {"lable": {"type": "completion"}}} }GET /steven_suggest/_search {"query": {"match_all": {}} }POST /steven_suggest/_search { "suggest": {"my-suggestion": {"prefix": "原告涉","completion": {"field": "lable"}}} }

（4）global-suggest

POST _search {"suggest": {"my-suggest-1" : {"text" : "tring out Elasticsearch","term" : {"field" : "message"}},"my-suggest-2" : {"text" : "kmichy","term" : {"field" : "user"}}} }

To avoid repetition of the suggest text, it is possible to define a global text. In the example below the suggest text is defined globally and applies to the my-suggest-1 and my-suggest-2 suggestions.

翻譯：

為了避免重復(fù)建議文本，可以定義全局文本。在下面的示例中，suggest文本是全局定義的，適用于my-suggest-1和my-suggest-2建議。

POST _search {"suggest": {"text" : "tring out Elasticsearch","my-suggest-1" : {"term" : {"field" : "message"}},"my-suggest-2" : {"term" : {"field" : "user"}}} }

“field”: “_all”

對(duì)于term suggest，可以使用"field": "_all"

POST /blogs/_search { "suggest": {"my-suggestion": {"text": "lucne rock","term": {"suggest_mode": "missing","field": "_all"}}} }

其他

analyzer

The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field. 指定分析器。分析器會(huì)將我們提供的text文本切分成詞項(xiàng)。如果不指定本選項(xiàng)的值，elasticsearch會(huì)使用filed參數(shù)所對(duì)應(yīng)字段的分析器。

在phrase suggester中，有smoothing model（平滑模型）：平衡索引中不存在的稀有n-gram詞元和索引中存在的高頻n-gram詞元之間的權(quán)重。

總結(jié)

以上是生活随笔為你收集整理的（9）Elasticsearch-suggest详解的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：发现7本书以开发有效的Java单元测试
下一篇： sharepoint 2016 学习系列

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

（9）Elasticsearch-suggest详解

概述

（1）Term Suggester

option數(shù)組

（2）Phrase Suggester

（3）Completion Suggester

（4）global-suggest

其他

總結(jié)