Elasticsearch中的Multi Match Query
在Elasticsearch全文檢索中,我們用的比較多的就是Multi Match Query,其支持對多個字段進行匹配。Elasticsearch支持5種類型的Multi Match,我們一起來深入學(xué)習(xí)下它們的區(qū)別。
5種類型的Multi Match Query
直接從官網(wǎng)的文檔上摘抄一段來:
- best_fields: (default) Finds documents which match any field, but uses the _score from the best field.
- most_fields: Finds documents which match any field and combines the _score from each field.
- cross_fields: Treats fields with the same analyzer as though they were one big field. Looks for each word in any field.
- phrase: Runs a match_phrase query on each field and combines the _score from each field.
- phrase_prefix: Runs a match_phrase_prefix query on each field and combines the _score from each field.
這里我們只考慮前面三種,后兩種可以另外單獨研究,就先忽略了。
創(chuàng)建測試索引,預(yù)置測試數(shù)據(jù)
創(chuàng)建gino_product索引
PUT /gino_product {"mappings": {"product": {"properties": {"productName": {"type": "string","analyzer": "fulltext_analyzer","copy_to": ["bigSearchField"]},"brandName": {"type": "string","analyzer": "fulltext_analyzer","copy_to": ["bigSearchField"],"fields": {"brandName_pinyin": {"type": "string","analyzer": "pinyin_analyzer","search_analyzer": "standard"},"brandName_keyword": {"type": "string","analyzer": "keyword","search_analyzer": "standard"}}},"sortName": {"type": "string","analyzer": "fulltext_analyzer","copy_to": ["bigSearchField"],"fields": {"sortName_pinyin": {"type": "string","analyzer": "pinyin_analyzer","search_analyzer": "standard"}}},"productKeyword": {"type": "string","analyzer": "fulltext_analyzer","copy_to": ["bigSearchField"]},"bigSearchField": {"type": "string","analyzer": "fulltext_analyzer"}}}},"settings": {"index": {"number_of_shards": 1,"number_of_replicas": 0},"analysis": {"tokenizer": {"simple_pinyin": {"type": "pinyin","first_letter": "none"}},"analyzer": {"fulltext_analyzer": {"type": "ik","use_smart": true},"pinyin_analyzer": {"type": "custom","tokenizer": "simple_pinyin","filter": ["word_delimiter","lowercase"]}}}} }插入一些測試數(shù)據(jù)
POST /gino_product/product/1 {"productName": "耐克女生運動輕跑鞋","brandName": "耐克","sortName": "鞋子","productKeyword": "耐克,潮流,運動,輕跑鞋" }POST /gino_product/product/2 {"productName": "耐克女生休閑運動服","brandName": "耐克","sortName": "上衣","productKeyword": "耐克,休閑,運動" }POST /gino_product/product/3 {"productName": "阿迪達斯女生冬季運動板鞋","brandName": "阿迪達斯","sortName": "鞋子","productKeyword": "阿迪達斯,冬季,運動,板鞋" }POST /gino_product/product/4 {"productName": "阿迪達斯女生冬季運動夾克外套","brandName": "阿迪達斯","sortName": "上衣","productKeyword": "阿迪達斯,冬季,運動,夾克,外套" }測試數(shù)據(jù)總覽
分別搜索【運動】
POST /gino_product/_search {"query": {"multi_match": {"query": "運動","fields": ["brandName^100","brandName.brandName_pinyin^100","brandName.brandName_keyword^100","sortName^80","sortName.sortName_pinyin^80","productName^60","productKeyword^20"],"type": <multi-match-type>,"operator": "AND"}} }發(fā)現(xiàn)使用3種type都可以搜索出4條商品數(shù)據(jù),而且排序也是一致的。
分別搜索【運動 上衣】
POST /gino_product/_search {"query": {"multi_match": {"query": "運動 上衣","fields": ["brandName^100","brandName.brandName_pinyin^100","brandName.brandName_keyword^100","sortName^80","sortName.sortName_pinyin^80","productName^60","productKeyword^20"],"type": <multi-match-type>,"operator": "AND"}} }這次搜索只有cross_field才能搜索出數(shù)據(jù),而使用best_fields和most_fields不行,為什么?
使用validate API來比較區(qū)別
POST /gino_product/_validate/query?rewrite=true {"query": {"multi_match": {"query": "運動 上衣","fields": ["brandName^100","brandName.brandName_pinyin^100","brandName.brandName_keyword^100","sortName^80","sortName.sortName_pinyin^80","productName^60","productKeyword^20"],"type": <multi-match-type>,"operator": "AND"}} }best_fields:所有輸入的Token必須在一個字段上全部匹配。
每個字段匹配時分別使用mapping上定義的analyzer和search_analyzer。
(+brandName:運動 +brandName:上衣)^100.0 | (+brandName.brandName_pinyin:運 +brandName.brandName_pinyin:動 +brandName.brandName_pinyin:上 +brandName.brandName_pinyin:衣)^100.0 | (+brandName.brandName_keyword:運 +brandName.brandName_keyword:動 +brandName.brandName_keyword:上 +brandName.brandName_keyword:衣)^100.0 | (+sortName:運動 +sortName:上衣)^80.0 | (+sortName.sortName_pinyin:運 +sortName.sortName_pinyin:動 +sortName.sortName_pinyin:上 +sortName.sortName_pinyin:衣)^80.0 | (+productName:運動 +productName:上衣)^60.0 | (+productKeyword:運動 +productKeyword:上衣)^20.0most_fields:所有輸入的Token必須在一個字段上全部匹配。
與best_fields不同之處在于相關(guān)性評分,best_fields取最大匹配得分(max計算),而most_fields取所有匹配之和(sum計算)。
((+brandName:運動 +brandName:上衣)^100.0 (+brandName.brandName_pinyin:運 +brandName.brandName_pinyin:動 +brandName.brandName_pinyin:上 +brandName.brandName_pinyin:衣)^100.0 (+brandName.brandName_keyword:運 +brandName.brandName_keyword:動 +brandName.brandName_keyword:上 +brandName.brandName_keyword:衣)^100.0(+sortName:運動 +sortName:上衣)^80.0 (+sortName.sortName_pinyin:運 +sortName.sortName_pinyin:動 +sortName.sortName_pinyin:上 +sortName.sortName_pinyin:衣)^80.0 (+productName:運動 +productName:上衣)^60.0 (+productKeyword:運動 +productKeyword:上衣)^20.0 )cross_fields:所有輸入的Token必須在同一組的字段上全部匹配。
首先ES會對cross_fields進行查詢重寫分組,分組的依據(jù)是search_analyzer。具體到我們的例子中【brandName.brandName_pinyin、brandName.brandName_keyword、sortName.sortName_pinyin】這三個字段的search_analyzer是standard,而其余的字段是fulltext_analyzer,因此最終被分為了兩組。
((+(brandName.brandName_pinyin:運^100.0 | sortName.sortName_pinyin:運^80.0 | brandName.brandName_keyword:運^100.0) +(brandName.brandName_pinyin:動^100.0 | sortName.sortName_pinyin:動^80.0 | brandName.brandName_keyword:動^100.0) +(brandName.brandName_pinyin:上^100.0 | sortName.sortName_pinyin:上^80.0 | brandName.brandName_keyword:上^100.0) +(brandName.brandName_pinyin:衣^100.0 | sortName.sortName_pinyin:衣^80.0 | brandName.brandName_keyword:衣^100.0)) (+(productKeyword:運動^20.0 | brandName:運動^100.0 | sortName:運動^80.0 | productName:運動^60.0) +(productKeyword:上衣^20.0 | brandName:上衣^100.0 | sortName:上衣^80.0 | productName:上衣^60.0)) )繼續(xù)探索和思考
如何讓best_fields和most_fields也可以匹配出商品?
最常見的做法就是使用_all字段或者copyTo字段來實現(xiàn),比如我們mapping里面的bigSearchField字段。
如何改進cross_fields的搜索結(jié)果?
由于cross_fields需要根據(jù)search_analyzer進行分組,因此像搜索【運動 shangyi】這樣的輸入時是無法匹配到商品的,因此應(yīng)該盡可能地減少分組既盡量使用統(tǒng)一的search_analyzer,或者在search時強制指定search_analyzer覆蓋mapping里定義的search_analyzer。
把operator改成OR會如何?
在上面的例子中,我們設(shè)置的operator均為AND,意味著所有搜索的Token都必須被匹配。那設(shè)置成OR會怎么樣以及什么場景下該使用OR呢?
在使用OR的時候要特別注意,因為只要有一個Token匹配就會把商品搜索出來,比如上面的搜索【運動 上衣】的時候,會把鞋子的商品也匹配出來,這樣搜索的準確度會遠遠降低。
在一些特殊的搜索中,比如我們搜索【耐克 阿迪達斯 上衣】,如果使用operator為AND,則無論使用哪種multi-search-type都無法匹配出商品(想想為什么?),此時我們可以設(shè)置operator為OR并且設(shè)置minimum_should_match為60%,這樣就可以搜索出屬于耐克和阿迪達斯的上衣了,這種情況相當(dāng)于一種智能的搜索降級了。
/gino_product/_search {"query": {"multi_match": {"query": "耐克 阿迪達斯 上衣","fields": ["brandName^100","brandName.brandName_pinyin^100","brandName.brandName_keyword^100","sortName^80","sortName.sortName_pinyin^80","productName^60","productKeyword^20"],"type": "cross_fields","operator": "OR","minimum_should_match": "60%"}} }再談相關(guān)性評分
在Elasticsearch相關(guān)性打分機制學(xué)習(xí)一文中我們曾經(jīng)探討過best_fields和cross_fields相關(guān)性評分的機制,其中的例子使用的相同的search_analyzer。那對于分組情況下,cross_fields評分又是如何計算的呢?
我們還是用上面的例子,增加explain參數(shù)來看一下。
POST /gino_product/_search {"explain": true,"query": {"multi_match": {"query": "運動 上衣","fields": ["brandName^100","brandName.brandName_pinyin^100","brandName.brandName_keyword^100","sortName^80","sortName.sortName_pinyin^80","productName^60","productKeyword^20"],"type": "cross_fields","operator": "AND"}} }詳細ES響應(yīng)報文:cross_fields_scoring.json
通過上述validate API得到的分組信息和explain得到的評分詳情信息,可以總結(jié)出一個cross_fields評分公式:
score(q, d) = coord(q, d) * ∑(∑(max(score(t, f))))- coord(q, d): 分組匹配因子,比如上面我們只有一個分組匹配,coord就是0.5(兩個分組中匹配了一個分組);
- score(t, f): 搜索的一個Token和一個特定的字段的相關(guān)性評分(使用TFIDF)計算;
- max:搜索的一個Token在所有字段評分中取最大值;
- 分組內(nèi)求和:一個分組內(nèi)搜索的所有Token的最大值進行求和;
- 分組間求和:所有分組的得分最終進行求和計算;
小結(jié)
- best_fields對搜索為單個Token的情況下效果更好,比如搜索【耐克】的時候品牌為耐克和商品關(guān)鍵字包含耐克的時候前者相關(guān)性得分更高;但是對于都是為多個Token需要跨字段匹配時,只能引進大字段來匹配,這樣權(quán)重的設(shè)置就失去意義了;
- most_fields和best_fields類似,其優(yōu)點在于能夠盡可能多地匹配,相關(guān)性評分機制更合理;
- cross_fields最大的優(yōu)點在于能夠跨字段匹配,而且充分利用到了各個字段的權(quán)重設(shè)置。但是需要注意的是匹配時是根據(jù)search_analyzer進行分組,不同分組直接的匹配無法跨字段。
參考材料
- ElaticSearch Reference > Multi Match Query
原文:http://ginobefunny.com/post/elasticsearch_multi_match_query/
總結(jié)
以上是生活随笔為你收集整理的Elasticsearch中的Multi Match Query的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: elasticsearch使用more_
- 下一篇: ElasticSearch 知识点整理(