java tf值搜索_搜索引擎优化 TF_IDF之Java实现
實(shí)現(xiàn)之前,我們要事先說(shuō)明一些問(wèn)題:
我們用Redis對(duì)數(shù)據(jù)進(jìn)行持久化,存兩種形式的MAP:
key值為term,value值為含有該term的url
key值為url,value值為map,記錄term及在文章中出現(xiàn)的次數(shù)
總的計(jì)算公式如下:
1.計(jì)算詞頻TF
這里通過(guò)給出url地址,獲取搜索詞term在此url中的數(shù)量,計(jì)算出TF
獲取url中的詞匯總數(shù)
/**
* @Author Ragty
* @Description 獲取url中的詞匯總數(shù)
* @Date 11:18 2019/6/4
**/
public Integer getWordCount(String url) {
String redisKey = urlSetKey(url);
Map map = jedis.hgetAll(redisKey);
Integer count = 0;
for(Map.Entry entry: map.entrySet()) {
count += Integer.valueOf(entry.getValue());
}
return count;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
返回搜索項(xiàng)在url中出現(xiàn)的次數(shù)
/**
* @Author Ragty
* @Description 返回搜索項(xiàng)在url中出現(xiàn)的次數(shù)
* @Date 22:12 2019/5/14
**/
public Integer getTermCount(String url,String term) {
String redisKey = urlSetKey(url);
String count = jedis.hget(redisKey,term);
return new Integer(count);
}
1
2
3
4
5
6
7
8
9
10
獲取搜索詞的詞頻
/**
* @Author Ragty
* @Description 獲取搜索詞的詞頻(Term Frequency)
* @Date 11:25 2019/6/4
**/
public BigDecimal getTermFrequency(String url,String term) {
if (!isIndexed(url)) {
System.out.println("Doesn‘t indexed.");
return null;
}
Integer documentCount = getWordCount(url);
Integer termCount = getTermCount(url,term);
return documentCount==0 ? new BigDecimal(0) : new BigDecimal(termCount).divide(new BigDecimal(documentCount),6,BigDecimal.ROUND_HALF_UP);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2.計(jì)算逆文檔頻率
計(jì)算逆文檔頻率,需要計(jì)算文檔總數(shù),以及包含該搜索詞的文章數(shù)
獲取redis索引文章的總數(shù)
/**
* @Author Ragty
* @Description 獲取redis索引文章的總數(shù)
* @Date 19:46 2019/6/5
**/
public Integer getUrlCount() {
Integer count = 0;
count = urlSetKeys().size();
return count;
}
1
2
3
4
5
6
7
8
9
10
獲取含有搜索詞的文章數(shù)
/**
* @Author Ragty
* @Description 獲取含有搜索詞的文章數(shù)
* @Date 22:42 2019/6/5
**/
public Integer getUrlTermCount(String term) {
Integer count = 0;
count = getUrls(term).size();
return count;
}
1
2
3
4
5
6
7
8
9
10
計(jì)算逆文檔頻率IDF(InverseDocumnetFrequency)
/**
* @Author Ragty
* @Description 計(jì)算逆文檔頻率IDF(InverseDocumnetFrequency)
* @Date 23:32 2019/6/5
**/
public BigDecimal getInverseDocumentFrequency(String term) {
Integer totalUrl = getUrlCount();
Integer urlTermCount = getUrlTermCount(term);
Double xx = new BigDecimal(totalUrl).divide(new BigDecimal(urlTermCount),6,BigDecimal.ROUND_HALF_UP).doubleValue();
BigDecimal idf = new BigDecimal(Math.log10(xx));
return idf;
}
1
2
3
4
5
6
7
8
9
10
11
12
3.獲取TF-IDF
/**
* @Author Ragty
* @Description 獲取tf-idf值
* @Date 23:34 2019/6/5
**/
public BigDecimal getTFIDF(String url,String term) {
BigDecimal tf = getTermFrequency(url, term);
BigDecimal idf = getInverseDocumentFrequency(term);
BigDecimal tfidf =tf.multiply(idf);
return tfidf;
}
1
2
3
4
5
6
7
8
9
10
11
4.數(shù)據(jù)測(cè)試
這里我采用我自己爬取的部分?jǐn)?shù)據(jù),進(jìn)行一下簡(jiǎn)單的測(cè)試(可能因?yàn)閿?shù)據(jù)集的原因?qū)е虏糠纸Y(jié)果不準(zhǔn)確)
測(cè)試類方法:
/**
* @Author Ragty
* @Description 獲取tfidf下的相關(guān)性
* @Date 8:47 2019/6/6
**/
private static BigDecimal getRelevance(String url,String term,JedisIndex index) {
BigDecimal tfidf = index.getTFIDF(url,term);
return tfidf;
}
/**
* @Author Ragty
* @Description 執(zhí)行搜索
* @Date 23:49 2019/5/30
**/
public static WikiSearch search(String term,JedisIndex index) {
Map map = new HashMap();
Set urls = index.getUrls(term);
for (String url: urls) {
BigDecimal tfidf = getRelevance(url,term,index).setScale(6,BigDecimal.ROUND_HALF_UP);
map.put(url,tfidf);
}
return new WikiSearch(map);
}
/**
* @Author Ragty
* @Description 按搜索項(xiàng)頻率順序打印內(nèi)容
* @Date 13:46 2019/5/30
**/
private void print() {
List> entries = sort();
for(Entry entry: entries) {
System.out.println(entry.getKey()+" "+entry.getValue());
}
}
/**
* @Author Ragty
* @Description 根據(jù)相關(guān)性對(duì)數(shù)據(jù)排序
* @Date 13:54 2019/5/30
**/
public List> sort(){
List> entries = new LinkedList>(map.entrySet());
Comparator> comparator = new Comparator>() {
@Override
public int compare(Entry o1, Entry o2) {
return o2.getValue().compareTo(o1.getValue());
}
};
Collections.sort(entries,comparator);
return entries;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
測(cè)試代碼:
public static void main(String[] args) throws IOException {
Jedis jedis = JedisMaker.make();
JedisIndex index = new JedisIndex(jedis);
// search for the first term
String term1 = "java";
System.out.println("Query: " + term1);
WikiSearch search1 = search(term1, index);
search1.print();
// search for the second term
String term2 = "programming";
System.out.println("Query: " + term2);
WikiSearch search2 = search(term2, index);
search2.print();
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
測(cè)試結(jié)果:
Query: java
https://baike.baidu.com/item/LiveScript 0.029956
https://baike.baidu.com/item/Java/85979 0.019986
https://baike.baidu.com/item/Brendan%20Eich 0.017188
https://baike.baidu.com/item/%E7%94%B2%E9%AA%A8%E6%96%87/471435 0.013163
https://baike.baidu.com/item/Sun/69463 0.005504
https://baike.baidu.com/item/Rhino 0.004401
https://baike.baidu.com/item/%E6%8E%92%E7%89%88%E5%BC%95%E6%93%8E 0.003452
https://baike.baidu.com/item/javascript 0.002212
https://baike.baidu.com/item/js/10687961 0.002212
https://baike.baidu.com/item/%E6%BA%90%E7%A0%81 0.002205
https://baike.baidu.com/item/%E6%BA%90%E7%A0%81/344212 0.002205
https://baike.baidu.com/item/%E8%84%9A%E6%9C%AC%E8%AF%AD%E8%A8%80 0.001989
https://baike.baidu.com/item/SQL 0.001779
https://baike.baidu.com/item/PHP/9337 0.001503
https://baike.baidu.com/item/iOS/45705 0.001499
https://baike.baidu.com/item/Netscape 0.000863
https://baike.baidu.com/item/%E6%93%8D%E4%BD%9C%E7%B3%BB%E7%BB%9F 0.000835
https://baike.baidu.com/item/Mac%20OS%20X 0.000521
https://baike.baidu.com/item/C%E8%AF%AD%E8%A8%80 0.000318
Query: programming
https://baike.baidu.com/item/C%E8%AF%AD%E8%A8%80 0.004854
https://baike.baidu.com/item/%E8%84%9A%E6%9C%AC%E8%AF%AD%E8%A8%80 0.002529
---------------------
原文:https://www.cnblogs.com/ly570/p/11106215.html
總結(jié)
以上是生活随笔為你收集整理的java tf值搜索_搜索引擎优化 TF_IDF之Java实现的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: java web实现页面跳转页面_Jav
- 下一篇: 字符串匹配算法Java_如何简单理解字符