Lucene从入门到进阶(6.6.0版本)
Lucene學(xué)習(xí)筆記
前言
基于最新的Lucene-6.6.0進(jìn)行學(xué)習(xí),很多方法都過時(shí)并不適用了,本文盡可能以最簡(jiǎn)單的方法入門學(xué)習(xí)。
第二章的例子都是官方的例子,寫得很好很詳細(xì),但是竟然一句注釋都沒有,里面的注釋都是我自己添加的,可能有不正確的理解,望體諒,可以將錯(cuò)誤的注解反饋給我。
第三章開始是自己寫的例子,很簡(jiǎn)單,很好理解,建議是直接從第三章開始看。
1???資源準(zhǔn)備
1.1入門文檔
軟件文檔:http://lucene.apache.org/core/6_6_0/index.html
可以根據(jù)該文檔看官方例子。
1.2 開發(fā)文檔
?????? Luence核心coreAPI文檔:http://lucene.apache.org/core/6_6_0/core/index.html
1.3 導(dǎo)入Maven依賴
導(dǎo)入使用lucene所必須的jar包
| <dependency> |
1.1.4 Luke
Luke是專門用于Lucene的索引查看工具
GitHub地址:https://github.com/DmitryKey/luke
安裝步驟:
(Alternatively, for older versions of lukeyou can directly download the jar file from the?releases?page and run it with the command?java -jarluke-with-deps.jar)
2 入門
2.1 IndexFiels
官方例子 IndexFiels.java創(chuàng)建一個(gè)Lucene索引。
該類啟動(dòng)要往main方法寫入?yún)?shù),可以有三種參數(shù)寫入方式,這里就寫一種,使用IDEA在配置中寫入如下參數(shù):
2.1.1 Test.txt的內(nèi)容如下:
| numberA numberB number 范德薩 jklj test 你好 不錯(cuò)啊 |
?
2.1.2 代碼
| package com.bingo.backstage;import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.nio.charset.StandardCharsets; import java.nio.file.FileVisitResult; import java.nio.file.Files; import java.nio.file.LinkOption; import java.nio.file.OpenOption; import java.nio.file.Path; import java.nio.file.Paths; import java.nio.file.SimpleFileVisitor; import java.nio.file.attribute.BasicFileAttributes; import java.util.Date; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.*; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.store.FSDirectory;/*** Created by MoSon on 2017/6/30.*/public class IndexFiles {private IndexFiles() {}public static void main(String[] args) {//在運(yùn)行是要添加參數(shù)如:-docs (你文件的路徑)String usage = "java com.bingo.backstage.IndexFiles [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n" +"This indexes the documents in DOCS_PATH, creating a Lucene indexin INDEX_PATH that can be searched with SearchFiles";String indexPath = "index";String docsPath = null;boolean create = true;for(int docDir = 0; docDir < args.length; ++docDir) {if("-index".equals(args[docDir])) {indexPath = args[docDir + 1];++docDir;} else if("-docs".equals(args[docDir])) {docsPath = args[docDir + 1];++docDir;} else if("-update".equals(args[docDir])) {create = false;}}if(docsPath == null) {System.err.println("Usage: " + usage);System.exit(1);}Path var13 = Paths.get(docsPath, new String[0]);if(!Files.isReadable(var13)) {System.out.println("Document directory \'" + var13.toAbsolutePath() + "\' does not exist or is not readable, please check the path");System.exit(1);}Date start = new Date();try {System.out.println("Indexing to directory \'" + indexPath + "\'...");//打開文件路徑FSDirectory e = FSDirectory.open(Paths.get(indexPath, new String[0]));StandardAnalyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(analyzer);if(create) {iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);} else {iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);}IndexWriter writer = new IndexWriter(e, iwc);indexDocs(writer, var13);writer.close();Date end = new Date();System.out.println(end.getTime() - start.getTime() + " total milliseconds");} catch (IOException var12) {System.out.println(" caught a " + var12.getClass() + "\n with message: " + var12.getMessage());}}static void indexDocs(final IndexWriter writer, Path path) throws IOException {if(Files.isDirectory(path, new LinkOption[0])) {Files.walkFileTree(path, new SimpleFileVisitor() {public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {try {IndexFiles.indexDoc(writer, file, attrs.lastModifiedTime().toMillis());} catch (IOException var4) {;}return FileVisitResult.CONTINUE;}});} else {indexDoc(writer, path, Files.getLastModifiedTime(path, new LinkOption[0]).toMillis());}}static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {InputStream stream = Files.newInputStream(file, new OpenOption[0]);Throwable var5 = null;try {Document doc = new Document();StringField pathField = new StringField("path", file.toString(), Field.Store.YES);doc.add(pathField);doc.add(new LongPoint("modified", new long[]{lastModified}));doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));if(writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE) {System.out.println("adding " + file);writer.addDocument(doc);} else {System.out.println("updating " + file);writer.updateDocument(new Term("path", file.toString()), doc);}} catch (Throwable var15) {var5 = var15;try {throw var15;} catch (Throwable throwable) {throwable.printStackTrace();}} finally {if(stream != null) {if(var5 != null) {try {stream.close();} catch (Throwable var14) {var5.addSuppressed(var14);}} else {stream.close();}}}} } |
2.1.3啟動(dòng)效果
將會(huì)在跟目錄下自動(dòng)生成一個(gè)文件用來保存索引
使用Luke查看效果:
發(fā)現(xiàn)沒有添加中文進(jìn)去
2.1.4 分析
IndexFiles類創(chuàng)建一個(gè)Lucene索引。
在主()方法分析命令行參數(shù),則在制備用于實(shí)例化 IndexWriter,打開 Directory,和實(shí)例化StandardAnalyzer 和IndexWriterConfig。
所述的值-index命令行參數(shù)是其中應(yīng)該存儲(chǔ)所有索引信息文件系統(tǒng)目錄的名稱。如果IndexFiles與在給定的相對(duì)路徑調(diào)用-index命令行參數(shù),或者如果-index沒有給出命令行參數(shù),使默認(rèn)的相對(duì)索引路徑“ 指數(shù) ”被使用,索引路徑將被創(chuàng)建作為當(dāng)前工作目錄的子目錄(如果它不存在)。在某些平臺(tái)上,可以在不同的目錄(例如用戶的主目錄)中創(chuàng)建索引路徑。
所述-docs命令行參數(shù)值是包含文件的目錄的位置被索引。
該-update命令行參數(shù)告訴 IndexFiles不刪除索引,如果它已經(jīng)存在。當(dāng)沒有給出-update時(shí),IndexFiles將在索引任何文檔之前首先擦拭平板。
IndexWriterDirectory使用Lucene 來存儲(chǔ)索引中的信息。除了 我們使用的實(shí)現(xiàn)之外,還有其他幾個(gè)可以寫入RAM,數(shù)據(jù)庫等的Directory子類。FSDirectory
Lucene Analyzer正在處理管道,將文本分解為索引令牌,也稱為條款,并可選擇對(duì)這些令牌進(jìn)行其他操作,例如縮小,同義詞插入,過濾掉不需要的令牌等。我們使用的Analyzer是StandardAnalyzer,它使用Unicode標(biāo)準(zhǔn)附件#29中指定的Unicode文本分段算法中的Word Break規(guī)則; 將令牌轉(zhuǎn)換為小寫字母; 然后過濾掉停用詞。停用詞是諸如文章(a,an,等等)和其他可能具有較少搜索價(jià)值的標(biāo)記的常用語言單詞。應(yīng)該注意的是,每個(gè)語言都有不同的規(guī)則,你應(yīng)該為每個(gè)語言使用適當(dāng)?shù)姆治銎鳌?/p>
該IndexWriterConfig實(shí)例適用于所有配置的IndexWriter。例如,我們將OpenMode設(shè)置為基于-update命令行參數(shù)的值使用。
在文件中進(jìn)一步看,在IndexWriter被實(shí)例化之后,應(yīng)該看到indexDocs()代碼。此遞歸函數(shù)可以抓取目錄并創(chuàng)建Document對(duì)象。該文獻(xiàn)僅僅是一個(gè)數(shù)據(jù)對(duì)象來表示從文件以及其創(chuàng)建時(shí)間和位置的文本內(nèi)容。這些實(shí)例被添加到IndexWriter。如果給出了 -update命令行參數(shù),則 IndexWriterConfig OpenMode將被設(shè)置為OpenMode.CREATE_OR_APPEND,而不是向索引添加文檔,IndexWriter將通過嘗試找到具有相同標(biāo)識(shí)符的已經(jīng)索引的文檔來更新它們?cè)谒饕?#xff08;在我們的例子中,文件路徑作為標(biāo)識(shí)符); 如果存在,則從索引中刪除它; 然后將新文檔添加到索引中。
2.2 SearchFiles
搜索文件
2.2.1代碼
| package com.bingo.backstage;import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.FSDirectory;import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Paths; import java.util.Date;/*** Created by MoSon on 2017/6/30.*/public class SearchFiles {private SearchFiles() {}public static void main(String[] args) throws Exception {String usage = "Usage:\tjava com.bingo.backstage.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee http://lucene.apache.org/core/4_1_0/demo/ for details.";if(args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {System.out.println(usage);System.exit(0);}String index = "index";String field = "contents";String queries = null;int repeat = 0;boolean raw = false;String queryString = null;int hitsPerPage = 10;for(int reader = 0; reader < args.length; ++reader) {if("-index".equals(args[reader])) {index = args[reader + 1];++reader;} else if("-field".equals(args[reader])) {field = args[reader + 1];++reader;} else if("-queries".equals(args[reader])) {queries = args[reader + 1];++reader;} else if("-query".equals(args[reader])) {queryString = args[reader + 1];++reader;} else if("-repeat".equals(args[reader])) {repeat = Integer.parseInt(args[reader + 1]);++reader;} else if("-raw".equals(args[reader])) {raw = true;} else if("-paging".equals(args[reader])) {hitsPerPage = Integer.parseInt(args[reader + 1]);if(hitsPerPage <= 0) {System.err.println("There must be at least 1 hit per page.");System.exit(1);}++reader;}}//打開文件DirectoryReader var18 = DirectoryReader.open(FSDirectory.open(Paths.get(index, new String[0])));IndexSearcher searcher = new IndexSearcher(var18);StandardAnalyzer analyzer = new StandardAnalyzer();BufferedReader in = null;if(queries != null) {in = Files.newBufferedReader(Paths.get(queries, new String[0]), StandardCharsets.UTF_8);} else {in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));}QueryParser parser = new QueryParser(field, analyzer);do {if(queries == null && queryString == null) {System.out.println("Enter query: ");}String line = queryString != null?queryString:in.readLine();if(line == null || line.length() == -1) {break;}line = line.trim();if(line.length() == 0) {break;}Query query = parser.parse(line);System.out.println("Searching for: " + query.toString(field));if(repeat > 0) {Date start = new Date();for(int end = 0; end < repeat; ++end) {searcher.search(query, 100);}Date var19 = new Date();System.out.println("Time: " + (var19.getTime() - start.getTime()) + "ms");}doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);} while(queryString == null);var18.close();}public static void doPagingSearch(BufferedReader in, IndexSearcher searcher, Query query, int hitsPerPage, boolean raw, boolean interactive) throws IOException {TopDocs results = searcher.search(query, 5 * hitsPerPage);ScoreDoc[] hits = results.scoreDocs;int numTotalHits = results.totalHits;System.out.println(numTotalHits + " total matching documents");int start = 0;int end = Math.min(numTotalHits, hitsPerPage);while(true) {if(end > hits.length) {System.out.println("Only results 1 - " + hits.length + " of " + numTotalHits + " total matching documents collected.");System.out.println("Collect more (y/n) ?");String quit = in.readLine();if(quit.length() == 0 || quit.charAt(0) == 110) {break;}hits = searcher.search(query, numTotalHits).scoreDocs;}end = Math.min(hits.length, start + hitsPerPage);for(int var15 = start; var15 < end; ++var15) {if(raw) {System.out.println("doc=" + hits[var15].doc + " score=" + hits[var15].score);} else {Document line = searcher.doc(hits[var15].doc);String page = line.get("path");if(page != null) {System.out.println(var15 + 1 + ". " + page);String title = line.get("title");if(title != null) {System.out.println("?? Title: " + line.get("title"));}} else {System.out.println(var15 + 1 + ". No path for this document");}}}if(!interactive || end == 0) {break;}if(numTotalHits >= end) {boolean var16 = false;while(true) {System.out.print("Press ");if(start - hitsPerPage >= 0) {System.out.print("(p)revious page, ");}if(start + hitsPerPage < numTotalHits) {System.out.print("(n)ext page, ");}System.out.println("(q)uit or enter number to jump to a page.");String var17 = in.readLine();if(var17.length() == 0 || var17.charAt(0) == 113) {var16 = true;break;}if(var17.charAt(0) == 112) {start = Math.max(0, start - hitsPerPage);break;}if(var17.charAt(0) == 110) {if(start + hitsPerPage < numTotalHits) {start += hitsPerPage;}break;}int var18 = Integer.parseInt(var17);if((var18 - 1) * hitsPerPage < numTotalHits) {start = (var18 - 1) * hitsPerPage;break;}System.out.println("No such page");}if(var16) {break;}end = Math.min(numTotalHits, start + hitsPerPage);}}} } |
2.2.2 運(yùn)行效果
可以看出是跟上面Luke工具查看的結(jié)果一樣,只有是對(duì)了才能查到
2.2.3 分析
該類主要與一個(gè)IndexSearcher,, StandardAnalyzer(在IndexFiles類中使用)和一個(gè)QueryParser。查詢解析器是用一個(gè)分析器構(gòu)造的,用于以與解釋文檔相同的方式解釋查詢文本:查找單詞邊界,縮小和刪除無用單詞,如“a”,“an”和“the”。該 Query對(duì)象包含 QueryParser傳遞給搜索者的結(jié)果。請(qǐng)注意,也可以以編程方式構(gòu)建豐富Query 對(duì)象,而不使用查詢解析器。查詢語法分析器只能將 Lucene查詢語法解碼為相應(yīng)的 Query對(duì)象。
SearchFiles使用最大 n個(gè)匹配IndexSearcher.search(query,n)返回的方法 。結(jié)果以頁面打印,按分?jǐn)?shù)(即相關(guān)性)排序。
2.3 SimpleSortedSetFacetsExample
一個(gè)簡(jiǎn)單的例子,比前面的兩個(gè)Demo理解起來容易一些。
該例子使用SortedSetDocValuesFacetField和SortedSetDocValuesFacetCounts顯示了簡(jiǎn)單的使用分面索引和搜索。
以下代碼里面有注釋,結(jié)合起來看會(huì)比較容易理解。
2.3.1 代碼
| package com.bingo.backstage.facet;import java.io.IOException; import java.util.ArrayList; import java.util.List;import org.apache.lucene.analysis.core.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.facet.DrillDownQuery; import org.apache.lucene.facet.FacetResult; import org.apache.lucene.facet.FacetsCollector; import org.apache.lucene.facet.FacetsConfig; import org.apache.lucene.facet.sortedset.DefaultSortedSetDocValuesReaderState; import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts; import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchAllDocsQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory;/*** Created by MoSon on 2017/6/30.*/public class SimpleSortedSetFacetsExample {//RAMDirectory:內(nèi)存駐留目錄實(shí)現(xiàn)。 默認(rèn)情況下,鎖定實(shí)現(xiàn)是SingleInstanceLockFactory。private final Directory indexDir = new RAMDirectory();private final FacetsConfig config = new FacetsConfig();public SimpleSortedSetFacetsExample() {}private void index() throws IOException {初始化索引創(chuàng)建器//WhitespaceAnalyzer僅僅是去除空格,對(duì)字符沒有lowcase化,不支持中文;并且不對(duì)生成的詞匯單元進(jìn)行其他的規(guī)范化處理。//openMode:創(chuàng)建索引模式:CREATE,覆蓋模式; APPEND,追加模式//IndexWriter:創(chuàng)建并維護(hù)索引IndexWriter indexWriter = new IndexWriter(this.indexDir, (new IndexWriterConfig(new WhitespaceAnalyzer())).setOpenMode(OpenMode.CREATE));//建立文檔Document doc = new Document();// 創(chuàng)建Field對(duì)象,并放入doc對(duì)象中doc.add(new SortedSetDocValuesFacetField("Author", "Bob"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2010"));// 寫入IndexWriterindexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Lisa"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2010"));indexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Lisa"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2012"));indexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Susan"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2012"));indexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Frank"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "1999"));indexWriter.addDocument(this.config.build(doc));indexWriter.close();}//查詢并統(tǒng)計(jì)文檔的信息private List<FacetResult> search() throws IOException {//基本都是一層包著一層封裝//DirectoryReader是可以讀取目錄中的索引的CompositeReader的實(shí)現(xiàn)。DirectoryReader indexReader = DirectoryReader.open(this.indexDir);//通過一個(gè)IndexReader實(shí)現(xiàn)搜索。IndexSearcher searcher = new IndexSearcher(indexReader);DefaultSortedSetDocValuesReaderState state = new DefaultSortedSetDocValuesReaderState(indexReader);//收集命中后續(xù)刻面。 一旦你運(yùn)行了一個(gè)搜索并收集命中,就可以實(shí)例化一個(gè)Facets子類來進(jìn)行細(xì)分計(jì)數(shù)。 使用搜索實(shí)用程序方法執(zhí)行“普通”搜索,但也會(huì)收集到Collector中。FacetsCollector fc = new FacetsCollector();//實(shí)用方法,搜索并收集所有的命中到提供的Collector。FacetsCollector.search(searcher, new MatchAllDocsQuery(), 10, fc);//計(jì)算所提供的匹配中的所有命中。SortedSetDocValuesFacetCounts facets = new SortedSetDocValuesFacetCounts(state, fc);ArrayList results = new ArrayList();//getTopChildren:返回指定路徑下的頂級(jí)子標(biāo)簽。results.add(facets.getTopChildren(10, "Author", new String[0]));results.add(facets.getTopChildren(10, "Publish Year", new String[0]));indexReader.close();return results;}private FacetResult drillDown() throws IOException {DirectoryReader indexReader = DirectoryReader.open(this.indexDir);IndexSearcher searcher = new IndexSearcher(indexReader);DefaultSortedSetDocValuesReaderState state = new DefaultSortedSetDocValuesReaderState(indexReader);DrillDownQuery q = new DrillDownQuery(this.config);//添加查詢條件q.add("Publish Year", new String[]{"2012"});FacetsCollector fc = new FacetsCollector();FacetsCollector.search(searcher, q, 10, fc);SortedSetDocValuesFacetCounts facets = new SortedSetDocValuesFacetCounts(state, fc);//獲取符合的作者FacetResult result = facets.getTopChildren(10, "Author", new String[0]);indexReader.close();return result;}public List<FacetResult> runSearch() throws IOException {this.index();return this.search();}public FacetResult runDrillDown() throws IOException {this.index();return this.drillDown();}public static void main(String[] args) throws Exception {System.out.println("Facet counting example:");System.out.println("-----------------------");SimpleSortedSetFacetsExample example = new SimpleSortedSetFacetsExample();List results = example.runSearch();System.out.println("Author: " + results.get(0));System.out.println("Publish Year: " + results.get(0));System.out.println("\n");System.out.println("Facet drill-down example (Publish Year/2010):");System.out.println("---------------------------------------------");System.out.println("Author: " + example.runDrillDown());} } |
2.3.2????????運(yùn)行效果
3 簡(jiǎn)單上手
3.1 創(chuàng)建索引
這是自己寫的例子,很好理解。
簡(jiǎn)單地添加內(nèi)容到索引庫。
3.1.1 代碼
?
| package com.bingo.backstage;import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.LegacyLongField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory;import java.io.IOException; import java.nio.file.FileSystems; import java.nio.file.Path;import static org.apache.lucene.document.TextField.TYPE_STORED;/*** Created by MoSon on 2017/6/30.*/public class CreateIndex {public static void main(String[] args) throws IOException {//定義IndexWriter//index是一個(gè)相對(duì)路徑,當(dāng)前工程Path path = FileSystems.getDefault().getPath("", "index");Directory directory = FSDirectory.open(path);//定義分詞器Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer).setOpenMode(IndexWriterConfig.OpenMode.CREATE);IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);//定義文檔Document document = new Document();//定義文檔字段document.add(new LegacyLongField("id", 5499, Field.Store.YES));document.add(new Field("title", "小米6", TYPE_STORED));document.add(new Field("sellPoint", "驍龍835,6G內(nèi)存,雙攝!", TYPE_STORED));//寫入數(shù)據(jù)indexWriter.addDocument(document);//添加新的數(shù)據(jù)document = new Document();document.add(new LegacyLongField("id", 8324, Field.Store.YES));document.add(new Field("title", "OnePlus5", TYPE_STORED));document.add(new Field("sellPoint", "8核,8G運(yùn)行內(nèi)存", TYPE_STORED));indexWriter.addDocument(document);//提交indexWriter.commit();//關(guān)閉indexWriter.close();}} |
?
3.1.2結(jié)果
一下是使用Luke查看的結(jié)果
3.2 分詞搜索
根據(jù)條件查詢符合的內(nèi)容
3.2.1 代碼
| package com.bingo.backstage;import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.search.*; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory;import java.io.IOException; import java.nio.file.FileSystems; import java.nio.file.Path;/*** Created by MoSon on 2017/7/1.*/public class Search {public static void main(String[] args) throws IOException {//定義索引目錄Path path = FileSystems.getDefault().getPath("index");Directory directory = FSDirectory.open(path);//定義索引查看器IndexReader indexReader = DirectoryReader.open(directory);//定義搜索器IndexSearcher indexSearcher = new IndexSearcher(indexReader);//搜索內(nèi)容//定義查詢字段Term term = new Term("sellPoint","存");Query query = new TermQuery(term);//命中前10條文檔TopDocs topDocs = indexSearcher.search(query,10);//打印命中數(shù)System.out.println("命中數(shù):"+topDocs.totalHits);//取出文檔ScoreDoc[] scoreDocs = topDocs.scoreDocs;//遍歷取出數(shù)據(jù)for (ScoreDoc scoreDoc : scoreDocs){//通過indexSearcher的doc方法取出文檔Document doc = indexSearcher.doc(scoreDoc.doc);System.out.println("id:"+doc.get("id"));System.out.println("sellPoint:"+doc.get("sellPoint"));}//關(guān)閉索引查看器indexReader.close();} } |
3.2.2 運(yùn)行結(jié)果
將符合條件的結(jié)果查詢并顯示。
4???Lucene創(chuàng)建索引核心API
Directory? 索引操作目錄
Analyzer?? 分詞器
Document 索引中文檔對(duì)象
IndexableField 文檔內(nèi)部數(shù)據(jù)信息
IndexWriterConfig 索引生成配置信息
IndexWriter? 索引生成對(duì)象
5???IK分詞器
5.1下載
下載適合Lucene的IKAnalyzer
鏈接:http://download.csdn.net/detail/fanpei_moukoy/9796612
5.2 基本使用
使用IK分詞器對(duì)中文進(jìn)行詞意劃分。
使用方式:將系統(tǒng)的Analyzer替換為IKAnalyzer
效果:
能對(duì)常用的詞語識(shí)別并劃分,但還不足夠,例如“雙攝像頭”,“驍龍”識(shí)別出來。
5.3 自定義分詞器
創(chuàng)建配置文件
創(chuàng)建自定義的擴(kuò)展字典
分詞效果:
5.4 使用分頁查詢
代碼:
| packagecom.bingo.backstage; |
效果:
?
6文件索引建立與搜索
導(dǎo)入一百萬的數(shù)據(jù)創(chuàng)建索引
6.1 創(chuàng)建索引
| packagecom.bingo.backstage; |
6.2 效果
一開始100秒將一百萬的索引建完。后來逐漸加快,應(yīng)該跟只開了2個(gè)應(yīng)用程序有關(guān),不到一分鐘就建完了。
6.3 模糊搜索
搜索“茂名”,全部命中,一百多萬條。用時(shí)1秒多。
效果:
7 獲取分詞器分詞結(jié)果
7.1 使用IK分詞器
想百度那樣,把我們要搜索的一句話先給分詞了再按關(guān)鍵字搜索
代碼:
| packagecom.bingo.backstage; |
分詞效果
7.2 使用內(nèi)置CJK分詞器
把類中的IKAnalyzer替換為CJKAnalyzer就可以了
分詞效果:
基本以兩個(gè)字兩個(gè)字來劃分,沒有IK分詞器的效果好。
8???進(jìn)階
根據(jù)前面的知識(shí)結(jié)合起來,先分詞,根據(jù)關(guān)鍵詞搜索。相似度高的靠前輸出。
使用的是布爾搜索。
代碼:
| package com.bingo.backstage;import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.search.*; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.wltea.analyzer.lucene.IKAnalyzer;import java.io.IOException; import java.io.StringReader; import java.nio.file.FileSystems; import java.nio.file.Path; import java.util.ArrayList; import java.util.List;/*** Created by MoSon on 2017/7/5.*/public class BooleanSearchQuery {public static void main(String[] args) throws IOException, ParseException {long start = System.currentTimeMillis();System.out.println("開始時(shí)間:" + start);//定義索引目錄Path path = FileSystems.getDefault().getPath("index");Directory directory = FSDirectory.open(path);//定義索引查看器IndexReader indexReader = DirectoryReader.open(directory);//定義搜索器IndexSearcher indexSearcher = new IndexSearcher(indexReader);//搜索內(nèi)容//定義查詢字段//布爾搜索/*?? TermQuery termQuery1 = new TermQuery(term1);TermQuery termQuery2 = new TermQuery(term2);BooleanClause booleanClause1 = new BooleanClause(termQuery1, BooleanClause.Occur.MUST);BooleanClause booleanClause2 = new BooleanClause(termQuery2, BooleanClause.Occur.SHOULD);BooleanQuery.Builder builder = new BooleanQuery.Builder();builder.add(booleanClause1);builder.add(booleanClause2);BooleanQuery query = builder.build();*//*** 進(jìn)階*多關(guān)鍵字的布爾搜索* *///定義Term集合List<Term> termList = new ArrayList<Term>();//獲取分詞結(jié)果List<String> analyseResult = new AnalyzerResult().getAnalyseResult("信宜市1234ewrq13asd丁堡鎮(zhèn)丁堡鎮(zhèn)片區(qū)丁堡街道181號(hào)301", new IKAnalyzer());for (String result : analyseResult){termList.add(new Term("address",result));//??????????? System.out.println(result);}//定義TermQuery集合List<TermQuery> termQueries = new ArrayList<TermQuery>();//取出集合結(jié)果for(Term term : termList){termQueries.add(new TermQuery(term));}List<BooleanClause> booleanClauses = new ArrayList<BooleanClause>();//遍歷for (TermQuery termQuery : termQueries){booleanClauses.add(new BooleanClause(termQuery, BooleanClause.Occur.SHOULD));}BooleanQuery.Builder builder = new BooleanQuery.Builder();for (BooleanClause booleanClause : booleanClauses){builder.add(booleanClause);}//檢索BooleanQuery query = builder.build();//命中前10條文檔TopDocs topDocs = indexSearcher.search(query,20);//打印命中數(shù)System.out.println("命中數(shù):"+topDocs.totalHits);//取出文檔ScoreDoc[] scoreDocs = topDocs.scoreDocs;//遍歷取出數(shù)據(jù)for (ScoreDoc scoreDoc : scoreDocs){float score = scoreDoc.score; //相似度System.out.println("相似度:"+ score);//通過indexSearcher的doc方法取出文檔Document doc = indexSearcher.doc(scoreDoc.doc);System.out.println("id:"+doc.get("id"));System.out.println("address:"+doc.get("address"));}//關(guān)閉索引查看器indexReader.close();long end = System.currentTimeMillis();System.out.println("開始時(shí)間:" + end);long time =? end-start;System.out.println("用時(shí):" + time + "毫秒" );}/*** 獲取指定分詞器的分詞結(jié)果* @param analyzeStr*??????????? 要分詞的字符串* @param analyzer*??????????? 分詞器* @return 分詞結(jié)果*/public List<String> getAnalyseResult(String analyzeStr, Analyzer analyzer) {List<String> response = new ArrayList<String>();TokenStream tokenStream = null;try {//返回適用于fieldName的TokenStream,標(biāo)記讀者的內(nèi)容。tokenStream = analyzer.tokenStream("address", new StringReader(analyzeStr));// 語匯單元對(duì)應(yīng)的文本CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);//消費(fèi)者在使用incrementToken()開始消費(fèi)之前調(diào)用此方法。//將此流重置為干凈狀態(tài)。 有狀態(tài)的實(shí)現(xiàn)必須實(shí)現(xiàn)這種方法,以便它們可以被重用,就像它們被創(chuàng)建的一樣。tokenStream.reset();//Consumers(即IndexWriter)使用此方法將流推送到下一個(gè)令牌。while (tokenStream.incrementToken()) {response.add(attr.toString());}} catch (IOException e) {e.printStackTrace();} finally {if (tokenStream != null) {try {tokenStream.close();} catch (IOException e) {e.printStackTrace();}}}return response;} } |
效果:
輸入的句子是
檢索結(jié)果:
?
?
?在此入門到此結(jié)束,如有興趣可以查看進(jìn)階版,可以看底部的“我的更多文章”。
?
?
?
?
總結(jié)
以上是生活随笔為你收集整理的Lucene从入门到进阶(6.6.0版本)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Lucene 6.6.0在线开发文档
- 下一篇: Spring全局异常捕捉实现Handle