白话Elasticsearch14-深度探秘搜索技术之基于multi_match 使用most_fields策略进行cross-fields search弊端
文章目錄
- 概述
- 官網
- 示例
概述
繼續跟中華石杉老師學習ES,第十四篇
課程地址: https://www.roncoo.com/view/55
官網
https://www.elastic.co/guide/en/elasticsearch/reference/7.2/query-dsl-multi-match-query.html
cross-fields搜索,一個唯一標識,跨了多個field。
比如一個人,標識,是姓名;一個建筑,它的標識是地址。
姓名可以散落在多個field中,比如first_name和last_name中,地址可以散落在country,province,city中。
跨多個field搜索一個標識,比如搜索一個人名,或者一個地址,就是cross-fields搜索
初步來說,如果要實現,可能用most_fields比較合適。因為best_fields是優先搜索單個field最匹配的結果,cross-fields本身就不是一個field的問題了。
示例
構造數據
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} } { "update": { "_id": "2"} } { "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} } { "update": { "_id": "3"} } { "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} } { "update": { "_id": "4"} } { "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} } { "update": { "_id": "5"} } { "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }執行查詢
GET /forum/article/_search {"query": {"multi_match": {"query": "Peter Smith","type": "cross_fields","fields": ["author_first_name","author_last_name"]}} }等同于 most_fileds
GET /forum/article/_search {"query": {"multi_match": {"query": "Peter Smith","type": "most_fields","fields": ["author_first_name","author_last_name"]}} }返回結果
{"took": 2,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 3,"max_score": 2.3258216,"hits": [{"_index": "forum","_type": "article","_id": "1","_score": 2.3258216,"_source": {"articleID": "XHDK-A-1293-#fJ3","userID": 1,"hidden": false,"postDate": "2017-01-01","tag": ["java","hadoop"],"tag_cnt": 2,"view_cnt": 30,"title": "this is java and elasticsearch blog","content": "i like to write best elasticsearch article","sub_title": "learning more courses","author_first_name": "Peter","author_last_name": "Smith"}},{"_index": "forum","_type": "article","_id": "5","_score": 1.7770995,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2019-05-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java","sub_title": "haha, hello world","author_first_name": "Tonny","author_last_name": "Peter Smith"}},{"_index": "forum","_type": "article","_id": "2","_score": 0.5389965,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language","sub_title": "learned a lot of course","author_first_name": "Smith","author_last_name": "Williams"}}]} }5.x版本中可能會出現: Peter Smith,匹配author_first_name,匹配到了Smith,這時候它的分數很高,為什么???
因為IDF分數高,IDF分數要高,那么這個匹配到的term(Smith),在所有doc中的出現頻率要低,author_first_name field中,Smith就出現過1次
Peter Smith這個人,doc 1,Smith在author_last_name中,但是author_last_name出現了兩次Smith,所以導致doc 1的IDF分數較低
cross-fields弊端
- 問題1:只是找到盡可能多的field匹配的doc,而不是某個field完全匹配的doc
- 問題2:most_fields,沒辦法用minimum_should_match去掉長尾數據,就是匹配的特別少的結果
- 問題3:TF/IDF算法,比如Peter Smith和Smith Williams,搜索Peter Smith的時候,由于first_name中很少有Smith的,所以query在所有document中的頻率很低,得到的分數很高,可能Smith Williams反而會排在Peter Smith前面
總結
以上是生活随笔為你收集整理的白话Elasticsearch14-深度探秘搜索技术之基于multi_match 使用most_fields策略进行cross-fields search弊端的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 白话Elasticsearch13-深度
- 下一篇: 白话Elasticsearch15-深度