(9)Elasticsearch-suggest详解
參考:Elasticsearch suggest? ??注:本文使用的版本是7.9.X,與原文有一些差異。
目錄
概述
(1)Term Suggester
option數(shù)組
(2)Phrase Suggester
(3)Completion Suggester
(4)global-suggest
其他
概述
現(xiàn)代的搜索引擎,一般會(huì)具備"Suggest As You Type"功能,即在用戶輸入搜索的過(guò)程中,進(jìn)行自動(dòng)補(bǔ)全或者糾錯(cuò)。 通過(guò)協(xié)助用戶輸入更精準(zhǔn)的關(guān)鍵詞,提高后續(xù)全文搜索階段文檔匹配的程度。例如在Google上輸入部分關(guān)鍵詞,甚至輸入拼寫錯(cuò)誤的關(guān)鍵詞時(shí),它依然能夠提示出用戶想要輸入的內(nèi)容:
輸入會(huì)自動(dòng)補(bǔ)全:
當(dāng)輸入有誤,開始提示相似的詞:
?
以上類似功能,在elasticsearch中,采用suggester api實(shí)現(xiàn)。suggester基本運(yùn)作原理是:將輸入的文本分解為token,然后在索引的字典中查找相似的term并且返回。根據(jù)使用場(chǎng)景不同,elasticsearch中涉及了 4種類別的suggester。分別是:
- Term Suggester
- Phrase Suggester
- Completion Suggester
- Context Suggester
以下實(shí)驗(yàn),基于elasticsearch7.9.0,單機(jī)環(huán)境中完成,所有的結(jié)果都是7.9.0單機(jī)環(huán)境下運(yùn)行出來(lái)的結(jié)果。
(1)Term Suggester
提供一種基于單個(gè)詞項(xiàng)的拼寫糾錯(cuò)方法。
 準(zhǔn)備一個(gè)叫做blogs的索引,配置一個(gè)text字段。
通過(guò)bulk api寫入幾條文檔
POST _bulk/?refresh=true { "index" : { "_index" : "blogs"} } { "body": "Lucene is cool"} { "index" : { "_index" : "blogs"} } { "body": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "blogs" } } { "body": "Elasticsearch rocks"} { "index" : { "_index" : "blogs" } } { "body": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "blogs" } } { "body": "elk rocks"} { "index" : { "_index" : "blogs" } } { "body": "elasticsearch is rock solid"}此時(shí)blogs索引里已經(jīng)有一些文檔了,可以進(jìn)行下一步的探索。為幫助理解,我們先看看哪些term會(huì)存在于詞典里。
 將輸入的文本分析一下:
結(jié)果:
{"tokens": [{"token": "lucene","start_offset": 0,"end_offset": 6,"type": "<ALPHANUM>","position": 0},{"token": "is","start_offset": 7,"end_offset": 9,"type": "<ALPHANUM>","position": 1},{"token": "cool","start_offset": 10,"end_offset": 14,"type": "<ALPHANUM>","position": 2},{"token": "elasticsearch","start_offset": 15,"end_offset": 28,"type": "<ALPHANUM>","position": 3},{"token": "builds","start_offset": 29,"end_offset": 35,"type": "<ALPHANUM>","position": 4},{"token": "on","start_offset": 36,"end_offset": 38,"type": "<ALPHANUM>","position": 5},{"token": "top","start_offset": 39,"end_offset": 42,"type": "<ALPHANUM>","position": 6},{"token": "of","start_offset": 43,"end_offset": 45,"type": "<ALPHANUM>","position": 7},{"token": "lucene","start_offset": 46,"end_offset": 52,"type": "<ALPHANUM>","position": 8},{"token": "elasticsearch","start_offset": 53,"end_offset": 66,"type": "<ALPHANUM>","position": 9},{"token": "rocks","start_offset": 67,"end_offset": 72,"type": "<ALPHANUM>","position": 10},{"token": "elastic","start_offset": 73,"end_offset": 80,"type": "<ALPHANUM>","position": 11},{"token": "is","start_offset": 81,"end_offset": 83,"type": "<ALPHANUM>","position": 12},{"token": "the","start_offset": 84,"end_offset": 87,"type": "<ALPHANUM>","position": 13},{"token": "company","start_offset": 88,"end_offset": 95,"type": "<ALPHANUM>","position": 14},{"token": "behind","start_offset": 96,"end_offset": 102,"type": "<ALPHANUM>","position": 15},{"token": "elk","start_offset": 103,"end_offset": 106,"type": "<ALPHANUM>","position": 16},{"token": "stack","start_offset": 107,"end_offset": 112,"type": "<ALPHANUM>","position": 17},{"token": "elk","start_offset": 113,"end_offset": 116,"type": "<ALPHANUM>","position": 18},{"token": "rocks","start_offset": 117,"end_offset": 122,"type": "<ALPHANUM>","position": 19},{"token": "elasticsearch","start_offset": 123,"end_offset": 136,"type": "<ALPHANUM>","position": 20},{"token": "is","start_offset": 137,"end_offset": 139,"type": "<ALPHANUM>","position": 21},{"token": "rock","start_offset": 140,"end_offset": 144,"type": "<ALPHANUM>","position": 22},{"token": "solid","start_offset": 145,"end_offset": 150,"type": "<ALPHANUM>","position": 23}] }這些分出來(lái)的token都會(huì)成為詞典里一個(gè)term,注意有些token會(huì)出現(xiàn)多次,因此在倒排索引里記錄的詞頻會(huì)比較高,同時(shí)記錄的還有這些token在原文檔里的偏移量和相對(duì)位置信息。
 執(zhí)行一次suggester搜索看看效果:
suggest就是一種特殊類型的搜索,DSL內(nèi)部的"text"指的是api調(diào)用方提供的文本,也就是通常用戶界面上用戶輸入的內(nèi)容。這里的lucne是錯(cuò)誤的拼寫,模擬用戶輸入錯(cuò)誤。 "term"表示這是一個(gè)term suggester。 “field"指定suggester針對(duì)的字段,另外有一個(gè)可選的"suggest_mode”。 范例里的"missing"實(shí)際上就是缺省值,它是什么意思?有點(diǎn)撓頭… 還是先看看返回結(jié)果吧:
結(jié)果:
{"took": 53,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"my-suggestion": [{"text": "lucne","offset": 0,"length": 5,"options": [{"text": "lucene","score": 0.8,"freq": 2}]},{"text": "rock","offset": 6,"length": 4,"options": [{"text": "rocks","score": 0.75,"freq": 2}]}]} }在返回結(jié)果里"suggest" -> “my-suggestion"部分包含了一個(gè)數(shù)組,每個(gè)數(shù)組項(xiàng)對(duì)應(yīng)從輸入文本分解出來(lái)的token(存放在"text"這個(gè)key里)以及為該token提供的建議詞項(xiàng)(存放在options數(shù)組里)。 示例里返回了"lucne”,“rock"這2個(gè)詞的建議項(xiàng)(options),其中"rock"的options是空的,表示沒(méi)有可以建議的選項(xiàng),為什么? 上面提到了,我們?yōu)椴樵兲峁┑膕uggest mode是"missing”,由于"rock"在索引的詞典里已經(jīng)存在了,夠精準(zhǔn),就不建議啦。 只有詞典里找不到詞,才會(huì)為其提供相似的選項(xiàng)。
此處與原文結(jié)果略有不同。
如果將"suggest_mode"換成"popular"會(huì)是什么效果?
 嘗試一下,重新執(zhí)行查詢,返回結(jié)果里"rock"這個(gè)詞的option不再是空的,而是建議為rocks。
回想一下,rock和rocks在索引詞典里都是有的。 不難看出即使用戶輸入的token在索引的詞典里已經(jīng)有了,但是因?yàn)榇嬖谝粋€(gè)詞頻更高的相似項(xiàng),這個(gè)相似項(xiàng)可能是更合適的,就被挑選到options里了。 最后還有一個(gè)"always" mode,其含義是不管token是否存在于索引詞典里都要給出相似項(xiàng)。
有人可能會(huì)問(wèn),兩個(gè)term的相似性是如何判斷的? ES使用了一種叫做Levenstein edit distance的算法,其核心思想就是一個(gè)詞改動(dòng)多少個(gè)字符就可以和另外一個(gè)詞一致。 Term suggester還有其他很多可選參數(shù)來(lái)控制這個(gè)相似性的模糊程度,這里就不一一贅述了。
Term suggester正如其名,只基于analyze過(guò)的單個(gè)term去提供建議,并不會(huì)考慮多個(gè)term之間的關(guān)系。API調(diào)用方只需為每個(gè)token挑選options里的詞,組合在一起返回給用戶前端即可。 那么有無(wú)更直接辦法,API直接給出和用戶輸入文本相似的內(nèi)容? 答案是有,這就要求助Phrase Suggester了。
option數(shù)組
options數(shù)組包含給定詞的建議詞。如果elasticsearch沒(méi)有找到任何建議詞,則options數(shù)組為空。數(shù)組中的每一項(xiàng)都包含一個(gè)建議詞和以下可以用來(lái)表征該建議的信息:
- text:elasticsearch給出的建議詞
- score:建議詞的得分,得分越高的建議詞其質(zhì)量越高
- freq:建議詞的文檔頻率。這里的頻率指建議詞在被查詢索引的多少個(gè)文檔中出現(xiàn)過(guò)。文檔頻率越高,說(shuō)明包含這個(gè)建議詞的文檔也越多,并且這個(gè)詞符合我們查詢意圖的可能性也越大。
(2)Phrase Suggester
可以返回完整的短語(yǔ)建議而不是單個(gè)詞項(xiàng)的建議。
 Phrase suggester在Term suggester的基礎(chǔ)上,會(huì)考量多個(gè)term之間的關(guān)系,比如是否同時(shí)出現(xiàn)在索引的原文里,相鄰程度,以及詞頻等等。看個(gè)范例就比較容易明白了:
結(jié)果:
{"took": 18,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"my-suggestion": [{"text": "lucne and elasticsear rock","offset": 0,"length": 26,"options": [{"text": "lucne and elasticsearch rocks","highlighted": "lucne and <em>elasticsearch rocks</em>","score": 0.12709484},{"text": "lucne and elasticsearch rock","highlighted": "lucne and <em>elasticsearch</em> rock","score": 0.10422645},{"text": "lucne and elasticsear rocks","highlighted": "lucne and elasticsear <em>rocks</em>","score": 0.10036137}]}]} }options直接返回一個(gè)phrase列表,由于加了highlight選項(xiàng),被替換的term會(huì)被高亮。因?yàn)閘ucene和elasticsearch曾經(jīng)在同一條原文里出現(xiàn)過(guò),同時(shí)替換2個(gè)term的可信度更高,所以打分較高,排在第一位返回。Phrase suggester有相當(dāng)多的參數(shù)用于控制匹配的模糊程度,需要根據(jù)實(shí)際應(yīng)用情況去挑選和調(diào)試。
此處與原文結(jié)果略有不同。
(3)Completion Suggester
最后來(lái)談一下Completion Suggester,它主要針對(duì)的應(yīng)用場(chǎng)景就是"Auto Completion"。 此場(chǎng)景下用戶每輸入一個(gè)字符的時(shí)候,就需要即時(shí)發(fā)送一次查詢請(qǐng)求到后端查找匹配項(xiàng),在用戶輸入速度較高的情況下對(duì)后端響應(yīng)速度要求比較苛刻。因此實(shí)現(xiàn)上它和前面兩個(gè)Suggester采用了不同的數(shù)據(jù)結(jié)構(gòu),索引并非通過(guò)倒排來(lái)完成,而是將analyze過(guò)的數(shù)據(jù)編碼成FST和索引一起存放。對(duì)于一個(gè)open狀態(tài)的索引,FST會(huì)被ES整個(gè)裝載到內(nèi)存里的,進(jìn)行前綴查找速度極快。但是FST只能用于前綴查找,這也是Completion Suggester的局限所在。
PUT /blogs_completion/ {"mappings": {"properties": {"body": {"type": "completion"}}} }用bulk API索引點(diǎn)數(shù)據(jù):
POST _bulk/?refresh=true { "index" : { "_index" : "blogs_completion" } } { "body": "Lucene is cool"} { "index" : { "_index" : "blogs_completion" } } { "body": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "blogs_completion" } } { "body": "Elasticsearch rocks"} { "index" : { "_index" : "blogs_completion" } } { "body": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "blogs_completion" } } { "body": "the elk stack rocks"} { "index" : { "_index" : "blogs_completion" } } { "body": "elasticsearch is rock solid"}查找:
POST blogs_completion/_search?pretty { "size": 0,"suggest": {"blog-suggest": {"prefix": "elastic i","completion": {"field": "body"}}} }結(jié)果:
{"took": 44,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"blog-suggest": [{"text": "elastic i","offset": 0,"length": 9,"options": [{"text": "Elastic is the company behind ELK stack","_index": "blogs_completion","_type": "tech","_id": "WpgeMGoBguJ9vUco0qbN","_score": 1,"_source": {"body": "Elastic is the company behind ELK stack"}}]}]} }值得注意的一點(diǎn)是Completion Suggester在索引原始數(shù)據(jù)的時(shí)候也要經(jīng)過(guò)analyze階段,取決于選用的analyzer不同,某些詞可能會(huì)被轉(zhuǎn)換,某些詞可能被去除,這些會(huì)影響FST編碼結(jié)果,也會(huì)影響查找匹配的效果。
比如我們新建索引blogs_completion_new,將analyzer更改為"english":
PUT /blogs_completion_new/ {"mappings": {"properties": {"body": {"type": "completion","analyzer": "english"}}} }用bulk API索引點(diǎn)數(shù)據(jù):
POST _bulk/?refresh=true { "index" : { "_index" : "blogs_completion_new" } } { "body": "Lucene is cool"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "Elasticsearch rocks"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "the elk stack rocks"} { "index" : { "_index" : "blogs_completion_new" } } { "body": "elasticsearch is rock solid"}執(zhí)行下面的查詢:
POST blogs_completion_new/_search?pretty { "size": 0,"suggest": {"blog-suggest": {"prefix": "elastic i","completion": {"field": "body"}}} }結(jié)果:
{"took": 2,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"blog-suggest": [{"text": "elastic i","offset": 0,"length": 9,"options": []}]} }居然沒(méi)有匹配結(jié)果了,多么費(fèi)解! 原來(lái)我們用的english analyzer會(huì)剝離掉stop word,而is就是其中一個(gè),被剝離掉了!
 用analyze api測(cè)試一下:
在elasticsearch6.3.1中,如果使用原文作者那樣的寫法,這里將會(huì)報(bào)錯(cuò)"type": “illegal_argument_exception”,
 “reason”: “request [/_analyze] contains unrecognized parameter: [analyzer]”。
結(jié)果是:
{"tokens": [{"token": "elasticsearch","start_offset": 0,"end_offset": 13,"type": "<ALPHANUM>","position": 0},{"token": "rock","start_offset": 17,"end_offset": 21,"type": "<ALPHANUM>","position": 2},{"token": "solid","start_offset": 22,"end_offset": 27,"type": "<ALPHANUM>","position": 3}] }FST只編碼了這3個(gè)token,并且默認(rèn)的還會(huì)記錄他們?cè)谖臋n中的位置和分隔符。 用戶輸入"elastic i"進(jìn)行查找的時(shí)候,輸入被分解成"elastic"和"i",FST沒(méi)有編碼這個(gè)“i” , 匹配失敗。
好吧,如果你現(xiàn)在還足夠清醒的話,試一下搜索"elastic is",會(huì)發(fā)現(xiàn)又有結(jié)果,why? 因?yàn)檫@次輸入的text經(jīng)過(guò)english analyzer的時(shí)候is也被剝離了,只需在FST里查詢"elastic"這個(gè)前綴,自然就可以匹配到了。
其他能影響completion suggester結(jié)果的,還有諸如"preserve_separators","preserve_position_increments"等等mapping參數(shù)來(lái)控制匹配的模糊程度。以及搜索時(shí)可以選用Fuzzy Queries,使得上面例子里的"elastic i"在使用english analyzer的情況下依然可以匹配到結(jié)果。
因此用好Completion Sugester并不是一件容易的事,實(shí)際應(yīng)用開發(fā)過(guò)程中,需要根據(jù)數(shù)據(jù)特性和業(yè)務(wù)需要,靈活搭配analyzer和mapping參數(shù),反復(fù)調(diào)試才可能獲得理想的補(bǔ)全效果。
回到篇首Google搜索框的補(bǔ)全/糾錯(cuò)功能,如果用ES怎么實(shí)現(xiàn)呢?我能想到的一個(gè)的實(shí)現(xiàn)方式:
 在用戶剛開始輸入的過(guò)程中,使用Completion Suggester進(jìn)行關(guān)鍵詞前綴匹配,剛開始匹配項(xiàng)會(huì)比較多,隨著用戶輸入字符增多,匹配項(xiàng)越來(lái)越少。如果用戶輸入比較精準(zhǔn),可能Completion Suggester的結(jié)果已經(jīng)夠好,用戶已經(jīng)可以看到理想的備選項(xiàng)了。
 如果Completion Suggester已經(jīng)到了零匹配,那么可以猜測(cè)是否用戶有輸入錯(cuò)誤,這時(shí)候可以嘗試一下Phrase Suggester。
 如果Phrase Suggester沒(méi)有找到任何option,開始嘗試term Suggester。
精準(zhǔn)程度上(Precision)看: Completion > Phrase > term, 而召回率上(Recall)則反之。從性能上看,Completion Suggester是最快的,如果能滿足業(yè)務(wù)需求,只用Completion Suggester做前綴匹配是最理想的。 Phrase和Term由于是做倒排索引的搜索,相比較而言性能應(yīng)該要低不少,應(yīng)盡量控制suggester用到的索引的數(shù)據(jù)量,最理想的狀況是經(jīng)過(guò)一定時(shí)間預(yù)熱后,索引可以全量map到內(nèi)存。
自己項(xiàng)目中的測(cè)試?yán)?#xff1a;
DELETE /steven_suggest/PUT /steven_suggest/ {"mappings": {"properties": {"lable": {"type": "completion"}}} }GET /steven_suggest/_search {"query": {"match_all": {}} }POST /steven_suggest/_search { "suggest": {"my-suggestion": {"prefix": "原告涉","completion": {"field": "lable"}}} }(4)global-suggest
POST _search {"suggest": {"my-suggest-1" : {"text" : "tring out Elasticsearch","term" : {"field" : "message"}},"my-suggest-2" : {"text" : "kmichy","term" : {"field" : "user"}}} }To avoid repetition of the suggest text, it is possible to define a global text. In the example below the suggest text is defined globally and applies to the my-suggest-1 and my-suggest-2 suggestions.
翻譯:
為了避免重復(fù)建議文本,可以定義全局文本。在下面的示例中,suggest文本是全局定義的,適用于my-suggest-1和my-suggest-2建議。
?
POST _search {"suggest": {"text" : "tring out Elasticsearch","my-suggest-1" : {"term" : {"field" : "message"}},"my-suggest-2" : {"term" : {"field" : "user"}}} }“field”: “_all”
對(duì)于term suggest,可以使用"field": "_all"
POST /blogs/_search { "suggest": {"my-suggestion": {"text": "lucne rock","term": {"suggest_mode": "missing","field": "_all"}}} }其他
analyzer
The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field. 指定分析器。分析器會(huì)將我們提供的text文本切分成詞項(xiàng)。如果不指定本選項(xiàng)的值,elasticsearch會(huì)使用filed參數(shù)所對(duì)應(yīng)字段的分析器。
在phrase suggester中,有smoothing model(平滑模型):平衡索引中不存在的稀有n-gram詞元和索引中存在的高頻n-gram詞元之間的權(quán)重。
總結(jié)
以上是生活随笔為你收集整理的(9)Elasticsearch-suggest详解的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
 
                            
                        - 上一篇: 发现7本书以开发有效的Java单元测试
- 下一篇: sharepoint 2016 学习系列
