當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

lucene构建同义词分词器

發(fā)布時間：2023/11/27 生活经验 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 lucene构建同义词分词器小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

? ? ? ?lucene4.0版本號以后已經(jīng)用TokenStreamComponents?代替了TokenStream流。里面包含了filter和tokenizer

? ? ? ?在較復(fù)雜的lucene搜索業(yè)務(wù)場景下，直接網(wǎng)上下載一個作為項(xiàng)目的分詞器，是不夠的。那么怎么去評定一個中文分詞器的好與差：一般來講。有兩個點(diǎn)。詞庫和搜索效率，也就是算法。

? ? ? ?lucene的倒排列表中，不同的分詞單元有不同的PositionIncrementAttribute，假設(shè)兩個詞之間PositionIncrementAttribute距離為0。則為同義詞；比方：我定義美國和中國這兩個詞在倒排列表中是同一個位置及距離為0，那么搜索美國的話，中國也能出來。

這就是同義詞搜索原理。

下面代碼（用mmseg的?Tokenizer 去切詞之后，然后再做同義詞）：

先自己定義分詞器：

package hhc;import java.io.Reader;import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;import com.chenlb.mmseg4j.Dictionary;
import com.chenlb.mmseg4j.MaxWordSeg;
import com.chenlb.mmseg4j.analysis.MMSegTokenizer;/*** 寫一個分詞器，一般能夠參照原來分詞器是怎么寫法的* @author hhc**/
public class MySameAnalyzer extends Analyzer{//同義詞private SamewordContext samewordContext=null;public MySameAnalyzer(SamewordContext samewordContext){this.samewordContext=samewordContext;}@Overridepublic TokenStream tokenStream(String fieldName, Reader reader) {// Dictionary dic=Dictionary.getInstance();return new MySameTokenFilter(new MMSegTokenizer(new MaxWordSeg(dic), reader),samewordContext);}}

然后再對TokenStream流做同義詞處理

package hhc;import java.io.IOException;
import java.util.Stack;import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;public class MySameTokenFilter extends TokenFilter {// 分詞單元信息private CharTermAttribute cta = null;// 位置信息private PositionIncrementAttribute pia = null;// 狀態(tài)private AttributeSource.State current;// 同義詞集合private Stack<String> sames = null;private SamewordContext samewordContext=null;protected MySameTokenFilter(TokenStream input,SamewordContext samewordContext) {super(input);cta = input.addAttribute(CharTermAttribute.class);pia = input.addAttribute(PositionIncrementAttribute.class);sames=new Stack<String>();this.samewordContext=samewordContext;}@Overridepublic boolean incrementToken() throws IOException {try {if (sames!=null&&sames.size()> 0) {// 刪除對象在堆棧,然后返回的對象上的函數(shù)值。而且獲取這個同義詞String str = sames.pop();// 還原狀態(tài)restoreState(current);cta.setEmpty();cta.append(str);pia.setPositionIncrement(0);return true;}// 假設(shè)流中沒有數(shù)據(jù)了。if (!input.incrementToken())return false;/*** 流中有數(shù)據(jù)的話，進(jìn)行對應(yīng)的同義詞*/// 處理切分出來的詞的信息if (existAddSameword(cta.toString())) {// 把當(dāng)前狀態(tài)先保存current = captureState();}} catch (Exception e) {// TODO: handle exceptione.printStackTrace();}return true;}/*** 推斷是否該分詞單元存在* * @param word* @return*/private boolean existAddSameword(String word) {String[] words=samewordContext.getSameword(word);if (words != null) {for (String s : words) {sames.push(s);}return true;}return false;}}

轉(zhuǎn)載于:https://www.cnblogs.com/zsychanpin/p/6789050.html

總結(jié)

以上是生活随笔為你收集整理的lucene构建同义词分词器的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： “愿至天必成”下一句是什么
下一篇： lol skt t1劫的皮肤多少钱

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

生活经验

lucene构建同义词分词器

總結(jié)