當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

白话Elasticsearch17-深度探秘搜索技术之match_phrase query 短语匹配搜索

發布時間：2025/3/21 编程问答 16 豆豆

生活随笔收集整理的這篇文章主要介紹了白话Elasticsearch17-深度探秘搜索技术之match_phrase query 短语匹配搜索小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

概述
官網
近似匹配
例子
- match query
- match phrase query
- term position
match_phrase的基本原理

概述

繼續跟中華石杉老師學習ES，第17篇

課程地址： https://www.roncoo.com/view/55

官網

https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

近似匹配

假設content字段中有2個語句

java is my favourite programming language, and I also think spark is a very good big data system.java spark are very related, because scala is spark's programming language and scala is also based on jvm like java.

使用match query , 搜索java spark ,DSL 大致如下

{"match": {"content": "java spark"} }

content 被拆分為兩個單詞 java 和 spark去匹配，所以如上兩個doc都能被查詢出來。

match query，只能搜索到包含java和spark的document，但是不知道java和spark是不是離的很近. 包含java或包含spark，或包含java和spark的doc，都會被查詢出來。我們其實并不知道哪個doc，java和spark距離的比較近。

如果我們希望搜索java spark，中間不能插入任何其他的字符，這個時候match就無能為力了。

再比如，如果我們要盡量讓java和spark離的很近的document優先返回，要給它一個更高的relevance score，這就涉及到了proximity match，近似匹配.

例子

假設要實現兩個需求：

java spark，就靠在一起，中間不能插入任何其他字符，就要搜索出來這種doc

java spark，但是要求，java和spark兩個單詞靠的越近，doc的分數越高，排名越靠前

要實現上述兩個需求，用match做全文檢索，是搞不定的，必須得用proximity match，近似匹配

phrase match：短語匹配
proximity match：近似匹配

這里我們要學習的是phrase match，就是僅僅搜索出java和spark靠在一起的那些doc，比如有個doc，是java use’d spark，不行。必須是比如java spark are very good friends，是可以搜索出來的。

match phrase query，就是要去將多個term作為一個短語，一起去搜索，只有包含這個短語的doc才會作為結果返回。

不像是match query，java spark，java的doc也會返回，spark的doc也會返回。

match query

為了做比對，我們先看下match query的查詢結果

GET /forum/article/_search {"query": {"match": {"content": "java spark"}} }

返回結果

{"took": 40,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 2,"max_score": 1.8166281,"hits": [{"_index": "forum","_type": "article","_id": "5","_score": 1.8166281,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2019-05-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java spark","sub_title": "haha, hello world","author_first_name": "Tonny","author_last_name": "Peter Smith","new_author_last_name": "Peter Smith","new_author_first_name": "Tonny"}},{"_index": "forum","_type": "article","_id": "2","_score": 0.7721133,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language","sub_title": "learned a lot of course","author_first_name": "Smith","author_last_name": "Williams","new_author_last_name": "Williams","new_author_first_name": "Smith"}}]} }

可以看到單單包含java的doc也返回了，不是我們想要的結果。

match phrase query

為了演示match phrase query的功能，我們先調整一下測試數據

POST /forum/article/5/_update {"doc": {"content":"spark is best big data solution based on scala ,an programming language similar to java spark"} }

將id=5的doc的content設置為恰巧包含java spark這個短語。

GET /forum/article/_search {"query": {"match_phrase": {"content": "java spark"}} }

返回結果

{"took": 47,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 1,"max_score": 1.4302213,"hits": [{"_index": "forum","_type": "article","_id": "5","_score": 1.4302213,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2019-05-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java spark","sub_title": "haha, hello world","author_first_name": "Tonny","author_last_name": "Peter Smith","new_author_last_name": "Peter Smith","new_author_first_name": "Tonny"}}]} }

從結果中可以看到只有包含java spark這個短語的doc才返回，只包含java的doc不會返回

term position

分詞后，每個單詞就是一個term

分詞后， es還記錄了每個field的位置。

舉個例子兩個doc 如下：

hello world, java spark doc1
hi, spark java doc2

建立倒排索引后

分詞文檔(位置)文檔(位置

hello	doc1(1)	-
wolrd	doc1(1)
java	doc1(2)	doc2(2)
spark	doc1(3)	doc2(1)
hi		doc2(0)

可以通過如下API來看下

GET _analyze {"text": "hello world, java spark","analyzer": "standard" }

{"tokens": [{"token": "hello","start_offset": 0,"end_offset": 5,"type": "<ALPHANUM>","position": 0},{"token": "world","start_offset": 6,"end_offset": 11,"type": "<ALPHANUM>","position": 1},{"token": "java","start_offset": 13,"end_offset": 17,"type": "<ALPHANUM>","position": 2},{"token": "spark","start_offset": 18,"end_offset": 23,"type": "<ALPHANUM>","position": 3}] }

通過position 可以看到位置信息。

match_phrase的基本原理

理解下索引中的position，match_phrase

兩個doc 如下

hello world, java spark doc1 hi, spark java doc2 分詞文檔(位置)文檔(位置

hello	doc1(1)	-
wolrd	doc1(1)
java	doc1(2)	doc2(2)
spark	doc1(3)	doc2(1)
hi		doc2(0)

java spark , 采用match phrase來查詢

首先 java spark 被拆成 java和spark ，分別取索引中查找

java 出現在 doc1(2) doc2(2) spark 出現在 doc1(3) doc2(1)

要找到每個term都在的一個共有的那些doc，就是要求一個doc，必須包含每個term，才能拿出來繼續計算

doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2，spark的position是3，恰好滿足條件

doc1符合條件

doc2 --> java和spark --> java position是2，spark position是1，spark position比java position小1，而不是大1 --> 光是position就不滿足，那么doc2不匹配 .

總結

以上是生活随笔為你收集整理的白话Elasticsearch17-深度探秘搜索技术之match_phrase query 短语匹配搜索的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：白话Elasticsearch16-深度
下一篇：白话Elasticsearch18-深度