Elasticsearch suggest
轉載于https://elasticsearch.cn/article/142
現代的搜索引擎,一般會具備"Suggest As You Type"功能,即在用戶輸入搜索的過程中,進行自動補全或者糾錯。 通過協助用戶輸入更精準的關鍵詞,提高后續全文搜索階段文檔匹配的程度。例如在Google上輸入部分關鍵詞,甚至輸入拼寫錯誤的關鍵詞時,它依然能夠提示出用戶想要輸入的內容:
輸入會自動補全:
 
當輸入有誤,開始提示相似的詞:
 
以上類似功能,在elasticsearch中,采用suggester api實現。suggester基本運作原理是:將輸入的文本分解為token,然后在索引的字典中查找相似的term并且返回。根據使用場景不同,elasticsearch中涉及了 4種類別的suggester。分別是:
- Term Suggester
- Phrase Suggester
- Completion Suggester
- Context Suggester
以下實驗,基于elasticsearch6.3.1,單機環境中完成,所有的結果都是6.3.1單機環境下運行出來的結果。
Term Suggester:
提供一種基于單個詞項的拼寫糾錯方法。
 準備一個叫做blogs的索引,配置一個text字段。
通過bulk api寫入幾條文檔
POST _bulk/?refresh=true { "index" : { "_index" : "blogs", "_type" : "tech" } } { "body": "Lucene is cool"} { "index" : { "_index" : "blogs", "_type" : "tech" } } { "body": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "blogs", "_type" : "tech" } } { "body": "Elasticsearch rocks"} { "index" : { "_index" : "blogs", "_type" : "tech" } } { "body": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "blogs", "_type" : "tech" } } { "body": "elk rocks"} { "index" : { "_index" : "blogs", "_type" : "tech" } } { "body": "elasticsearch is rock solid"}此時blogs索引里已經有一些文檔了,可以進行下一步的探索。為幫助理解,我們先看看哪些term會存在于詞典里。
 將輸入的文本分析一下:
結果
{"tokens": [{"token": "lucene","start_offset": 0,"end_offset": 6,"type": "<ALPHANUM>","position": 0},{"token": "is","start_offset": 7,"end_offset": 9,"type": "<ALPHANUM>","position": 1},{"token": "cool","start_offset": 10,"end_offset": 14,"type": "<ALPHANUM>","position": 2},{"token": "elasticsearch","start_offset": 15,"end_offset": 28,"type": "<ALPHANUM>","position": 3},{"token": "builds","start_offset": 29,"end_offset": 35,"type": "<ALPHANUM>","position": 4},{"token": "on","start_offset": 36,"end_offset": 38,"type": "<ALPHANUM>","position": 5},{"token": "top","start_offset": 39,"end_offset": 42,"type": "<ALPHANUM>","position": 6},{"token": "of","start_offset": 43,"end_offset": 45,"type": "<ALPHANUM>","position": 7},{"token": "lucene","start_offset": 46,"end_offset": 52,"type": "<ALPHANUM>","position": 8},{"token": "elasticsearch","start_offset": 53,"end_offset": 66,"type": "<ALPHANUM>","position": 9},{"token": "rocks","start_offset": 67,"end_offset": 72,"type": "<ALPHANUM>","position": 10},{"token": "elastic","start_offset": 73,"end_offset": 80,"type": "<ALPHANUM>","position": 11},{"token": "is","start_offset": 81,"end_offset": 83,"type": "<ALPHANUM>","position": 12},{"token": "the","start_offset": 84,"end_offset": 87,"type": "<ALPHANUM>","position": 13},{"token": "company","start_offset": 88,"end_offset": 95,"type": "<ALPHANUM>","position": 14},{"token": "behind","start_offset": 96,"end_offset": 102,"type": "<ALPHANUM>","position": 15},{"token": "elk","start_offset": 103,"end_offset": 106,"type": "<ALPHANUM>","position": 16},{"token": "stack","start_offset": 107,"end_offset": 112,"type": "<ALPHANUM>","position": 17},{"token": "elk","start_offset": 113,"end_offset": 116,"type": "<ALPHANUM>","position": 18},{"token": "rocks","start_offset": 117,"end_offset": 122,"type": "<ALPHANUM>","position": 19},{"token": "elasticsearch","start_offset": 123,"end_offset": 136,"type": "<ALPHANUM>","position": 20},{"token": "is","start_offset": 137,"end_offset": 139,"type": "<ALPHANUM>","position": 21},{"token": "rock","start_offset": 140,"end_offset": 144,"type": "<ALPHANUM>","position": 22},{"token": "solid","start_offset": 145,"end_offset": 150,"type": "<ALPHANUM>","position": 23}] }這些分出來的token都會成為詞典里一個term,注意有些token會出現多次,因此在倒排索引里記錄的詞頻會比較高,同時記錄的還有這些token在原文檔里的偏移量和相對位置信息。
 執行一次suggester搜索看看效果:
suggest就是一種特殊類型的搜索,DSL內部的"text"指的是api調用方提供的文本,也就是通常用戶界面上用戶輸入的內容。這里的lucne是錯誤的拼寫,模擬用戶輸入錯誤。 "term"表示這是一個term suggester。 “field"指定suggester針對的字段,另外有一個可選的"suggest_mode”。 范例里的"missing"實際上就是缺省值,它是什么意思?有點撓頭… 還是先看看返回結果吧:
結果:
{"took": 53,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"my-suggestion": [{"text": "lucne","offset": 0,"length": 5,"options": [{"text": "lucene","score": 0.8,"freq": 2}]},{"text": "rock","offset": 6,"length": 4,"options": [{"text": "rocks","score": 0.75,"freq": 2}]}]} }在返回結果里"suggest" -> “my-suggestion"部分包含了一個數組,每個數組項對應從輸入文本分解出來的token(存放在"text"這個key里)以及為該token提供的建議詞項(存放在options數組里)。 示例里返回了"lucne”,“rock"這2個詞的建議項(options),其中"rock"的options是空的,表示沒有可以建議的選項,為什么? 上面提到了,我們為查詢提供的suggest mode是"missing”,由于"rock"在索引的詞典里已經存在了,夠精準,就不建議啦。 只有詞典里找不到詞,才會為其提供相似的選項。
 我這里的結果,跟原文作者的結果不一致。我這里對于rock,依然是有options的,建議是rocks。
如果將"suggest_mode"換成"popular"會是什么效果?
 嘗試一下,重新執行查詢,返回結果里"rock"這個詞的option不再是空的,而是建議為rocks。
我這里的結果,跟原文作者說的不一致。。
回想一下,rock和rocks在索引詞典里都是有的。 不難看出即使用戶輸入的token在索引的詞典里已經有了,但是因為存在一個詞頻更高的相似項,這個相似項可能是更合適的,就被挑選到options里了。 最后還有一個"always" mode,其含義是不管token是否存在于索引詞典里都要給出相似項。
有人可能會問,兩個term的相似性是如何判斷的? ES使用了一種叫做Levenstein edit distance的算法,其核心思想就是一個詞改動多少個字符就可以和另外一個詞一致。 Term suggester還有其他很多可選參數來控制這個相似性的模糊程度,這里就不一一贅述了。
Term suggester正如其名,只基于analyze過的單個term去提供建議,并不會考慮多個term之間的關系。API調用方只需為每個token挑選options里的詞,組合在一起返回給用戶前端即可。 那么有無更直接辦法,API直接給出和用戶輸入文本相似的內容? 答案是有,這就要求助Phrase Suggester了。
option數組
options數組包含給定詞的建議詞。如果elasticsearch沒有找到任何建議詞,則options數組為空。數組中的每一項都包含一個建議詞和以下可以用來表征該建議的信息:
- text:elasticsearch給出的建議詞
- score:建議詞的得分,得分越高的建議詞其質量越高
- freq:建議詞的文檔頻率。這里的頻率指建議詞在被查詢索引的多少個文檔中出現過。文檔頻率越高,說明包含這個建議詞的文檔也越多,并且這個詞符合我們查詢意圖的可能性也越大。
Phrase Suggester
可以返回完整的短語建議而不是單個詞項的建議。
 Phrase suggester在Term suggester的基礎上,會考量多個term之間的關系,比如是否同時出現在索引的原文里,相鄰程度,以及詞頻等等。看個范例就比較容易明白了:
結果:
{"took": 18,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"my-suggestion": [{"text": "lucne and elasticsear rock","offset": 0,"length": 26,"options": [{"text": "lucne and elasticsearch rocks","highlighted": "lucne and <em>elasticsearch rocks</em>","score": 0.12709484},{"text": "lucne and elasticsearch rock","highlighted": "lucne and <em>elasticsearch</em> rock","score": 0.10422645},{"text": "lucne and elasticsear rocks","highlighted": "lucne and elasticsear <em>rocks</em>","score": 0.10036137}]}]} }options直接返回一個phrase列表,由于加了highlight選項,被替換的term會被高亮。因為lucene和elasticsearch曾經在同一條原文里出現過,同時替換2個term的可信度更高,所以打分較高,排在第一位返回。Phrase suggester有相當多的參數用于控制匹配的模糊程度,需要根據實際應用情況去挑選和調試。
 我這里的結果,跟原文作者的結果不一致。不僅options里面的結果不一樣,而且score也不一樣。。
Completion Suggester
最后來談一下Completion Suggester,它主要針對的應用場景就是"Auto Completion"。 此場景下用戶每輸入一個字符的時候,就需要即時發送一次查詢請求到后端查找匹配項,在用戶輸入速度較高的情況下對后端響應速度要求比較苛刻。因此實現上它和前面兩個Suggester采用了不同的數據結構,索引并非通過倒排來完成,而是將analyze過的數據編碼成FST和索引一起存放。對于一個open狀態的索引,FST會被ES整個裝載到內存里的,進行前綴查找速度極快。但是FST只能用于前綴查找,這也是Completion Suggester的局限所在。
PUT /blogs_completion/ {"mappings": {"tech": {"properties": {"body": {"type": "completion"}}}} }用bulk API索引點數據:
POST _bulk/?refresh=true { "index" : { "_index" : "blogs_completion", "_type" : "tech" } } { "body": "Lucene is cool"} { "index" : { "_index" : "blogs_completion", "_type" : "tech" } } { "body": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "blogs_completion", "_type" : "tech" } } { "body": "Elasticsearch rocks"} { "index" : { "_index" : "blogs_completion", "_type" : "tech" } } { "body": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "blogs_completion", "_type" : "tech" } } { "body": "the elk stack rocks"} { "index" : { "_index" : "blogs_completion", "_type" : "tech" } } { "body": "elasticsearch is rock solid"}查找:
POST blogs_completion/_search?pretty { "size": 0,"suggest": {"blog-suggest": {"prefix": "elastic i","completion": {"field": "body"}}} }結果:
{"took": 44,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"blog-suggest": [{"text": "elastic i","offset": 0,"length": 9,"options": [{"text": "Elastic is the company behind ELK stack","_index": "blogs_completion","_type": "tech","_id": "WpgeMGoBguJ9vUco0qbN","_score": 1,"_source": {"body": "Elastic is the company behind ELK stack"}}]}]} }值得注意的一點是Completion Suggester在索引原始數據的時候也要經過analyze階段,取決于選用的analyzer不同,某些詞可能會被轉換,某些詞可能被去除,這些會影響FST編碼結果,也會影響查找匹配的效果。
比如我們新建索引blogs_completion_new,將analyzer更改為"english":
PUT /blogs_completion_new/ {"mappings": {"tech": {"properties": {"body": {"type": "completion","analyzer": "english"}}}} }用bulk API索引點數據:
POST _bulk/?refresh=true { "index" : { "_index" : "blogs_completion_new", "_type" : "tech" } } { "body": "Lucene is cool"} { "index" : { "_index" : "blogs_completion_new", "_type" : "tech" } } { "body": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "blogs_completion_new", "_type" : "tech" } } { "body": "Elasticsearch rocks"} { "index" : { "_index" : "blogs_completion_new", "_type" : "tech" } } { "body": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "blogs_completion_new", "_type" : "tech" } } { "body": "the elk stack rocks"} { "index" : { "_index" : "blogs_completion_new", "_type" : "tech" } } { "body": "elasticsearch is rock solid"}執行下面的查詢:
POST blogs_completion_new/_search?pretty { "size": 0,"suggest": {"blog-suggest": {"prefix": "elastic i","completion": {"field": "body"}}} }結果:
{"took": 2,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 0,"max_score": 0,"hits": []},"suggest": {"blog-suggest": [{"text": "elastic i","offset": 0,"length": 9,"options": []}]} }居然沒有匹配結果了,多么費解! 原來我們用的english analyzer會剝離掉stop word,而is就是其中一個,被剝離掉了!
 用analyze api測試一下:
在elasticsearch6.3.1中,如果使用原文作者那樣的寫法,這里將會報錯"type": “illegal_argument_exception”,
 “reason”: “request [/_analyze] contains unrecognized parameter: [analyzer]”。
結果是:
{"tokens": [{"token": "elasticsearch","start_offset": 0,"end_offset": 13,"type": "<ALPHANUM>","position": 0},{"token": "rock","start_offset": 17,"end_offset": 21,"type": "<ALPHANUM>","position": 2},{"token": "solid","start_offset": 22,"end_offset": 27,"type": "<ALPHANUM>","position": 3}] }FST只編碼了這3個token,并且默認的還會記錄他們在文檔中的位置和分隔符。 用戶輸入"elastic i"進行查找的時候,輸入被分解成"elastic"和"i",FST沒有編碼這個“i” , 匹配失敗。
好吧,如果你現在還足夠清醒的話,試一下搜索"elastic is",會發現又有結果,why? 因為這次輸入的text經過english analyzer的時候is也被剝離了,只需在FST里查詢"elastic"這個前綴,自然就可以匹配到了。
其他能影響completion suggester結果的,還有諸如"preserve_separators","preserve_position_increments"等等mapping參數來控制匹配的模糊程度。以及搜索時可以選用Fuzzy Queries,使得上面例子里的"elastic i"在使用english analyzer的情況下依然可以匹配到結果。
因此用好Completion Sugester并不是一件容易的事,實際應用開發過程中,需要根據數據特性和業務需要,靈活搭配analyzer和mapping參數,反復調試才可能獲得理想的補全效果。
回到篇首Google搜索框的補全/糾錯功能,如果用ES怎么實現呢?我能想到的一個的實現方式:
 在用戶剛開始輸入的過程中,使用Completion Suggester進行關鍵詞前綴匹配,剛開始匹配項會比較多,隨著用戶輸入字符增多,匹配項越來越少。如果用戶輸入比較精準,可能Completion Suggester的結果已經夠好,用戶已經可以看到理想的備選項了。
 如果Completion Suggester已經到了零匹配,那么可以猜測是否用戶有輸入錯誤,這時候可以嘗試一下Phrase Suggester。
 如果Phrase Suggester沒有找到任何option,開始嘗試term Suggester。
精準程度上(Precision)看: Completion > Phrase > term, 而召回率上(Recall)則反之。從性能上看,Completion Suggester是最快的,如果能滿足業務需求,只用Completion Suggester做前綴匹配是最理想的。 Phrase和Term由于是做倒排索引的搜索,相比較而言性能應該要低不少,應盡量控制suggester用到的索引的數據量,最理想的狀況是經過一定時間預熱后,索引可以全量map到內存。
global-suggest
POST _search {"suggest": {"my-suggest-1" : {"text" : "tring out Elasticsearch","term" : {"field" : "message"}},"my-suggest-2" : {"text" : "kmichy","term" : {"field" : "user"}}} }To avoid repetition of the suggest text, it is possible to define a global text. In the example below the suggest text is defined globally and applies to the my-suggest-1 and my-suggest-2 suggestions.
POST _search {"suggest": {"text" : "tring out Elasticsearch","my-suggest-1" : {"term" : {"field" : "message"}},"my-suggest-2" : {"term" : {"field" : "user"}}} }“field”: “_all”
對于term suggest,可以使用"field": "_all"
POST /blogs/_search { "suggest": {"my-suggestion": {"text": "lucne rock","term": {"suggest_mode": "missing","field": "_all"}}} }其他
analyzer
The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field. 指定分析器。分析器會將我們提供的text文本切分成詞項。如果不指定本選項的值,elasticsearch會使用filed參數所對應字段的分析器。
在phrase suggester中,有smoothing model(平滑模型):平衡索引中不存在的稀有n-gram詞元和索引中存在的高頻n-gram詞元之間的權重。
總結
以上是生活随笔為你收集整理的Elasticsearch suggest的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 广告学概论重点复习资料-完整版
- 下一篇: 唐诗宋词学习·100~105节
