ElastricSearch打分简介
1、Elasticsearch的打分公式
Elasticsearch的默認打分公式是lucene的打分公式,主要分為兩部分的計算,一部分是計算query部分的得分,另一部分是計算field部分的得分,下面給出ES官網(wǎng)給出的打分公式:
score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) ) (t in q)queryNorm(q):
對查詢進行一個歸一化,不影響排序,因為對于同一個查詢這個值是相同的,但是對term于ES來說,必須在分片是1的時候才不影響排序,否則的話,還是會有一些細小的區(qū)別,有幾個分片就會有幾個不同的queryNorm值
queryNorm(q)=1 / √sumOfSquaredWeights?
上述公式是ES官網(wǎng)的公式,這是在默認query boost為1,并且在默認term boost為1 的情況下的打分,其中
sumOfSquaredWeights?=idf(t1)*idf(t1)+idf(t2)*idf(t2)+...+idf(tn)*idf(tn)
其中n為在query里面切成term的個數(shù),但是上面全部是在默認為1的情況下的計算,實際上的計算公式如下所示:
coord(q,d):
coord(q,d)是一個協(xié)調因子它的值如下:
coord(q,d)=overlap/maxoverlap?
其中overlap是檢索命中query中term的個數(shù),maxoverlap是query中總共的term個數(shù),例如查詢詞為“無線通信”,使用默認分詞器,如果文檔為“通知他們開會”,只會有一個“通”命中,這個時候它的值就是1/4=0.25
tf(t in d):
即term t在文檔中出現(xiàn)的個數(shù),它的計算公式官網(wǎng)給出的是:
tf(t in d) = √frequency?
即出現(xiàn)的個數(shù)進行開方,這個沒什么可以講述的,實際打分也是如此
?
idf(t):
這個的意思是出現(xiàn)的逆詞頻數(shù),即召回的文檔在總文檔中出現(xiàn)過多少次,這個的計算在ES中與lucene中有些區(qū)別,只有在分片數(shù)為1的情況下,與lucene的計算是一致的,如果不唯一,那么每一個分片都有一個不同的idf的值,它的計算方式如下所示:
idf(t) = 1 + log ( numDocs / (docFreq + 1))?
其中,log是以e為底的,不是以10或者以2為底,這點需要注意,numDocs是指所有的文檔個數(shù),如果有分片的話,就是指的是在當前分片下總的文檔個數(shù),docFreq是指召回文檔的個數(shù),如果有分片對應的也是在當前分片下召回的個數(shù),這點是計算的時候與lucene不同之處,如果想驗證是否正確,只需將分片shard的個數(shù)設置為1即可。
?
t.getboost():
對于每一個term的權值,沒仔細研究這個項,個人理解的是,如果對一個field設置boost,那么如果在這個boost召回的話,每一個term的boost都是該field的boost
norm(t,d):
對于field的標準化因子,在官方給的解釋是field越短,如果召回的話權重越大,例如搜索無線通信,一個是很長的內容,但都是包含這幾個字,但是并不是我們想要的,另外一個內容很短,但是完整包含了無線通信,我們不能因為后面的只出現(xiàn)了一次就認為權重是低的,相反,權重應當是更高的,其計算公式如下所示:
其中d.getboost表明如果該文檔權重越大那么久越重要
f.getboost表明該field的權值越大,越重要
lengthnorm表示該field越長,越不重要,越短,越重要,在官方文檔給出的公式中,默認boost全部為1,在此給出官方文檔的打分公式:
norm(d) = 1 / √numTerms以上的是理論上的,看看實際例子
GET act_shop-2018.01.12/shop/_search {"size": 1, "query": {"term": {"name.keyword": "星巴克"}}, "explain": true }結果是
{"took": 25,"timed_out": false,"_shards": {"total": 150,"successful": 150,"failed": 0},"hits": {"total": 127667,"max_score": 15.511484,"hits": [{"_shard": "[act_shop-2018.01.12][80]","_node": "6vfIeV95QOK1vAcLdx6CEA","_index": "act_shop-2018.01.12","_type": "shop","_id": "187672","_score": 15.511484,"_routing": "36341","_parent": "36341","_source": {"status": 1,"city": {"id": 2084,"name": "虹口區(qū)"},"update_time": "2017-10-23 15:23:00.329000","tel": ["021-65200108"],"name": "星巴克(涼城店)","tags": ["餐飲服務","咖啡廳","咖啡廳"],"tags_enrich": {"name": "美食","id": 10},"id": 187672,"label": "have_act","create_time": "2017-01-11 14:59:43.950000","city_enrich": {"region": "華東地區(qū)","name": "上海","level": 1},"address": "車站南路330弄2號、6號第一、二層的4839F01059","coordinate": {"lat": 31.29496,"lon": 121.475442},"brand": {"id": 490,"name": "星巴克"}},"_explanation": {"value": 15.511484,"description": "sum of:","details": [{"value": 15.511484,"description": "sum of:","details": [{"value": 4.7601295,"description": "weight(name:星 in 6914) [PerFieldSimilarity], result of:","details": [{"value": 4.7601295,"description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:","details": [{"value": 4.314013,"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:","details": [{"value": 159,"description": "docFreq","details": []},{"value": 11920,"description": "docCount","details": []}]},{"value": 1.103411,"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:","details": [{"value": 1,"description": "termFreq=1.0","details": []},{"value": 1.2,"description": "parameter k1","details": []},{"value": 0.75,"description": "parameter b","details": []},{"value": 9.224329,"description": "avgFieldLength","details": []},{"value": 7.111111,"description": "fieldLength","details": []}]}]}]},{"value": 5.0423846,"description": "weight(name:巴 in 6914) [PerFieldSimilarity], result of:","details": [{"value": 5.0423846,"description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:","details": [{"value": 4.5698156,"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:","details": [{"value": 123,"description": "docFreq","details": []},{"value": 11920,"description": "docCount","details": []}]},{"value": 1.103411,"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:","details": [{"value": 1,"description": "termFreq=1.0","details": []},{"value": 1.2,"description": "parameter k1","details": []},{"value": 0.75,"description": "parameter b","details": []},{"value": 9.224329,"description": "avgFieldLength","details": []},{"value": 7.111111,"description": "fieldLength","details": []}]}]}]},{"value": 5.70897,"description": "weight(name:克 in 6914) [PerFieldSimilarity], result of:","details": [{"value": 5.70897,"description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:","details": [{"value": 5.173929,"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:","details": [{"value": 67,"description": "docFreq","details": []},{"value": 11920,"description": "docCount","details": []}]},{"value": 1.103411,"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:","details": [{"value": 1,"description": "termFreq=1.0","details": []},{"value": 1.2,"description": "parameter k1","details": []},{"value": 0.75,"description": "parameter b","details": []},{"value": 9.224329,"description": "avgFieldLength","details": []},{"value": 7.111111,"description": "fieldLength","details": []}]}]}]}]},{"value": 0,"description": "match on required clause, product of:","details": [{"value": 0,"description": "# clause","details": []},{"value": 1,"description": "_type:shop, product of:","details": [{"value": 1,"description": "boost","details": []},{"value": 1,"description": "queryNorm","details": []}]}]}]}}]} }詳細說明一下
1、在?"_shard": "[act_shop-2018.01.12][80]"這個分片里,按照es的標準分詞,當match'星巴克'的時候,然后會分詞為'星','巴','克'這三個詞。每個詞的得分為:
'星':4.7601295
'巴':5.0423846
'克':5.70897
總的得分:4.7601295+5.0423846+5.70897=15.511484
2、然后每個詞是怎么得分的,這里詳細說一下,以'星'為例:
sorce'星'=idf.tfNorm(也就是詞頻*逆向詞頻)
idf計算如下:
{"value": 4.7601295,"description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:","details": [{"value": 4.314013,"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:","details": [{"value": 159,"description": "docFreq","details": []},{"value": 11920,"description": "docCount","details": []}]}docFreq:在這個分片里,擊中'星'的文檔數(shù)量:159
docCount:在這個分片里,包括總的文檔數(shù)量:11920
公式:log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))=4.314013
tfNorm計算如下
tf可以理解為,這個'星',在某個文檔里出現(xiàn)的次數(shù)的一些占比
{"value": 1.103411,"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:","details": [{"value": 1,"description": "termFreq=1.0","details": []},{"value": 1.2,"description": "parameter k1","details": []},{"value": 0.75,"description": "parameter b","details": []},{"value": 9.224329,"description": "avgFieldLength","details": []},{"value": 7.111111,"description": "fieldLength","details": []}]}tfNorm=(freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength))=1.103411
所以sorce'星'=idf.tfNorm=4.314013*1.103411=4.7601295
轉載于:https://my.oschina.net/u/3455048/blog/1606033
總結
以上是生活随笔為你收集整理的ElastricSearch打分简介的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 随机过程:【1】基于MATLAB对泊松过
- 下一篇: linux后台启动,不输出日志文件