當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Lucene从入门到进阶（6.6.0版本）

發(fā)布時(shí)間：2025/3/20 编程问答 15 豆豆

生活随笔收集整理的這篇文章主要介紹了 Lucene从入门到进阶（6.6.0版本）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Lucene學(xué)習(xí)筆記

前言

基于最新的Lucene-6.6.0進(jìn)行學(xué)習(xí)，很多方法都過時(shí)并不適用了，本文盡可能以最簡(jiǎn)單的方法入門學(xué)習(xí)。

第二章的例子都是官方的例子，寫得很好很詳細(xì)，但是竟然一句注釋都沒有，里面的注釋都是我自己添加的，可能有不正確的理解，望體諒，可以將錯(cuò)誤的注解反饋給我。

第三章開始是自己寫的例子，很簡(jiǎn)單，很好理解，建議是直接從第三章開始看。

1???資源準(zhǔn)備

1.1入門文檔

軟件文檔：http://lucene.apache.org/core/6_6_0/index.html

可以根據(jù)該文檔看官方例子。

1.2 開發(fā)文檔

?????? Luence核心coreAPI文檔：http://lucene.apache.org/core/6_6_0/core/index.html

1.3 導(dǎo)入Maven依賴

導(dǎo)入使用lucene所必須的jar包

<dependency>
? <groupId>org.apache.lucene</groupId>
? <artifactId>lucene-core</artifactId>
? <version>6.6.0</version>
</dependency>
<dependency>
? <groupId>org.apache.lucene</groupId>
? <artifactId>lucene-analyzers-common</artifactId>
? <version>6.6.0</version>
</dependency>
<dependency>
? <groupId>org.apache.lucene</groupId>
? <artifactId>lucene-queryparser</artifactId>
? <version>6.6.0</version>
</dependency>

<dependency>
? <groupId>org.apache.lucene</groupId>
? <artifactId>lucene-demo</artifactId>
? <version>6.6.0</version>
</dependency>

1.1.4 Luke

Luke是專門用于Lucene的索引查看工具

GitHub地址：https://github.com/DmitryKey/luke

安裝步驟：

Clone the repository.

Run?mvn install?from the project directory. (Make sure you have Java and Maven installed before doing this)

Use?luke.sh?or?luke.bat?for launching luke from the command line based on the OS you are in.

(Alternatively, for older versions of lukeyou can directly download the jar file from the?releases?page and run it with the command?java -jarluke-with-deps.jar)

2 入門

2.1 IndexFiels

官方例子 IndexFiels.java創(chuàng)建一個(gè)Lucene索引。

該類啟動(dòng)要往main方法寫入?yún)?shù)，可以有三種參數(shù)寫入方式，這里就寫一種，使用IDEA在配置中寫入如下參數(shù)：

2.1.1 Test.txt的內(nèi)容如下：

numberA

numberB

number 范德薩 jklj

test

你好

不錯(cuò)啊

2.1.2 代碼

package com.bingo.backstage;import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.nio.charset.StandardCharsets; import java.nio.file.FileVisitResult; import java.nio.file.Files; import java.nio.file.LinkOption; import java.nio.file.OpenOption; import java.nio.file.Path; import java.nio.file.Paths; import java.nio.file.SimpleFileVisitor; import java.nio.file.attribute.BasicFileAttributes; import java.util.Date; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.*; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.store.FSDirectory;/*** Created by MoSon on 2017/6/30.*/public class IndexFiles {private IndexFiles() {}public static void main(String[] args) {//在運(yùn)行是要添加參數(shù)如：-docs （你文件的路徑）String usage = "java com.bingo.backstage.IndexFiles [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n" +"This indexes the documents in DOCS_PATH, creating a Lucene indexin INDEX_PATH that can be searched with SearchFiles";String indexPath = "index";String docsPath = null;boolean create = true;for(int docDir = 0; docDir < args.length; ++docDir) {if("-index".equals(args[docDir])) {indexPath = args[docDir + 1];++docDir;} else if("-docs".equals(args[docDir])) {docsPath = args[docDir + 1];++docDir;} else if("-update".equals(args[docDir])) {create = false;}}if(docsPath == null) {System.err.println("Usage: " + usage);System.exit(1);}Path var13 = Paths.get(docsPath, new String[0]);if(!Files.isReadable(var13)) {System.out.println("Document directory \'" + var13.toAbsolutePath() + "\' does not exist or is not readable, please check the path");System.exit(1);}Date start = new Date();try {System.out.println("Indexing to directory \'" + indexPath + "\'...");//打開文件路徑FSDirectory e = FSDirectory.open(Paths.get(indexPath, new String[0]));StandardAnalyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(analyzer);if(create) {iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);} else {iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);}IndexWriter writer = new IndexWriter(e, iwc);indexDocs(writer, var13);writer.close();Date end = new Date();System.out.println(end.getTime() - start.getTime() + " total milliseconds");} catch (IOException var12) {System.out.println(" caught a " + var12.getClass() + "\n with message: " + var12.getMessage());}}static void indexDocs(final IndexWriter writer, Path path) throws IOException {if(Files.isDirectory(path, new LinkOption[0])) {Files.walkFileTree(path, new SimpleFileVisitor() {public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {try {IndexFiles.indexDoc(writer, file, attrs.lastModifiedTime().toMillis());} catch (IOException var4) {;}return FileVisitResult.CONTINUE;}});} else {indexDoc(writer, path, Files.getLastModifiedTime(path, new LinkOption[0]).toMillis());}}static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {InputStream stream = Files.newInputStream(file, new OpenOption[0]);Throwable var5 = null;try {Document doc = new Document();StringField pathField = new StringField("path", file.toString(), Field.Store.YES);doc.add(pathField);doc.add(new LongPoint("modified", new long[]{lastModified}));doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));if(writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE) {System.out.println("adding " + file);writer.addDocument(doc);} else {System.out.println("updating " + file);writer.updateDocument(new Term("path", file.toString()), doc);}} catch (Throwable var15) {var5 = var15;try {throw var15;} catch (Throwable throwable) {throwable.printStackTrace();}} finally {if(stream != null) {if(var5 != null) {try {stream.close();} catch (Throwable var14) {var5.addSuppressed(var14);}} else {stream.close();}}}} }

2.1.3啟動(dòng)效果

將會(huì)在跟目錄下自動(dòng)生成一個(gè)文件用來保存索引

使用Luke查看效果：

發(fā)現(xiàn)沒有添加中文進(jìn)去

2.1.4 分析

IndexFiles類創(chuàng)建一個(gè)Lucene索引。

在主（）方法分析命令行參數(shù)，則在制備用于實(shí)例化 IndexWriter，打開 Directory，和實(shí)例化StandardAnalyzer 和IndexWriterConfig。

所述的值-index命令行參數(shù)是其中應(yīng)該存儲(chǔ)所有索引信息文件系統(tǒng)目錄的名稱。如果IndexFiles與在給定的相對(duì)路徑調(diào)用-index命令行參數(shù)，或者如果-index沒有給出命令行參數(shù)，使默認(rèn)的相對(duì)索引路徑“ 指數(shù) ”被使用，索引路徑將被創(chuàng)建作為當(dāng)前工作目錄的子目錄（如果它不存在）。在某些平臺(tái)上，可以在不同的目錄（例如用戶的主目錄）中創(chuàng)建索引路徑。

所述-docs命令行參數(shù)值是包含文件的目錄的位置被索引。

該-update命令行參數(shù)告訴 IndexFiles不刪除索引，如果它已經(jīng)存在。當(dāng)沒有給出-update時(shí)，IndexFiles將在索引任何文檔之前首先擦拭平板。

IndexWriterDirectory使用Lucene 來存儲(chǔ)索引中的信息。除了我們使用的實(shí)現(xiàn)之外，還有其他幾個(gè)可以寫入RAM，數(shù)據(jù)庫等的Directory子類。FSDirectory

Lucene Analyzer正在處理管道，將文本分解為索引令牌，也稱為條款，并可選擇對(duì)這些令牌進(jìn)行其他操作，例如縮小，同義詞插入，過濾掉不需要的令牌等。我們使用的Analyzer是StandardAnalyzer，它使用Unicode標(biāo)準(zhǔn)附件＃29中指定的Unicode文本分段算法中的Word Break規(guī)則; 將令牌轉(zhuǎn)換為小寫字母; 然后過濾掉停用詞。停用詞是諸如文章（a，an，等等）和其他可能具有較少搜索價(jià)值的標(biāo)記的常用語言單詞。應(yīng)該注意的是，每個(gè)語言都有不同的規(guī)則，你應(yīng)該為每個(gè)語言使用適當(dāng)?shù)姆治銎鳌?/p>

該IndexWriterConfig實(shí)例適用于所有配置的IndexWriter。例如，我們將OpenMode設(shè)置為基于-update命令行參數(shù)的值使用。

在文件中進(jìn)一步看，在IndexWriter被實(shí)例化之后，應(yīng)該看到indexDocs（）代碼。此遞歸函數(shù)可以抓取目錄并創(chuàng)建Document對(duì)象。該文獻(xiàn)僅僅是一個(gè)數(shù)據(jù)對(duì)象來表示從文件以及其創(chuàng)建時(shí)間和位置的文本內(nèi)容。這些實(shí)例被添加到IndexWriter。如果給出了 -update命令行參數(shù)，則 IndexWriterConfig OpenMode將被設(shè)置為OpenMode.CREATE_OR_APPEND，而不是向索引添加文檔，IndexWriter將通過嘗試找到具有相同標(biāo)識(shí)符的已經(jīng)索引的文檔來更新它們?cè)谒饕?#xff08;在我們的例子中，文件路徑作為標(biāo)識(shí)符）; 如果存在，則從索引中刪除它; 然后將新文檔添加到索引中。

2.2 SearchFiles

搜索文件

2.2.1代碼

package com.bingo.backstage;import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.FSDirectory;import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Paths; import java.util.Date;/*** Created by MoSon on 2017/6/30.*/public class SearchFiles {private SearchFiles() {}public static void main(String[] args) throws Exception {String usage = "Usage:\tjava com.bingo.backstage.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee http://lucene.apache.org/core/4_1_0/demo/ for details.";if(args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {System.out.println(usage);System.exit(0);}String index = "index";String field = "contents";String queries = null;int repeat = 0;boolean raw = false;String queryString = null;int hitsPerPage = 10;for(int reader = 0; reader < args.length; ++reader) {if("-index".equals(args[reader])) {index = args[reader + 1];++reader;} else if("-field".equals(args[reader])) {field = args[reader + 1];++reader;} else if("-queries".equals(args[reader])) {queries = args[reader + 1];++reader;} else if("-query".equals(args[reader])) {queryString = args[reader + 1];++reader;} else if("-repeat".equals(args[reader])) {repeat = Integer.parseInt(args[reader + 1]);++reader;} else if("-raw".equals(args[reader])) {raw = true;} else if("-paging".equals(args[reader])) {hitsPerPage = Integer.parseInt(args[reader + 1]);if(hitsPerPage <= 0) {System.err.println("There must be at least 1 hit per page.");System.exit(1);}++reader;}}//打開文件DirectoryReader var18 = DirectoryReader.open(FSDirectory.open(Paths.get(index, new String[0])));IndexSearcher searcher = new IndexSearcher(var18);StandardAnalyzer analyzer = new StandardAnalyzer();BufferedReader in = null;if(queries != null) {in = Files.newBufferedReader(Paths.get(queries, new String[0]), StandardCharsets.UTF_8);} else {in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));}QueryParser parser = new QueryParser(field, analyzer);do {if(queries == null && queryString == null) {System.out.println("Enter query: ");}String line = queryString != null?queryString:in.readLine();if(line == null || line.length() == -1) {break;}line = line.trim();if(line.length() == 0) {break;}Query query = parser.parse(line);System.out.println("Searching for: " + query.toString(field));if(repeat > 0) {Date start = new Date();for(int end = 0; end < repeat; ++end) {searcher.search(query, 100);}Date var19 = new Date();System.out.println("Time: " + (var19.getTime() - start.getTime()) + "ms");}doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);} while(queryString == null);var18.close();}public static void doPagingSearch(BufferedReader in, IndexSearcher searcher, Query query, int hitsPerPage, boolean raw, boolean interactive) throws IOException {TopDocs results = searcher.search(query, 5 * hitsPerPage);ScoreDoc[] hits = results.scoreDocs;int numTotalHits = results.totalHits;System.out.println(numTotalHits + " total matching documents");int start = 0;int end = Math.min(numTotalHits, hitsPerPage);while(true) {if(end > hits.length) {System.out.println("Only results 1 - " + hits.length + " of " + numTotalHits + " total matching documents collected.");System.out.println("Collect more (y/n) ?");String quit = in.readLine();if(quit.length() == 0 || quit.charAt(0) == 110) {break;}hits = searcher.search(query, numTotalHits).scoreDocs;}end = Math.min(hits.length, start + hitsPerPage);for(int var15 = start; var15 < end; ++var15) {if(raw) {System.out.println("doc=" + hits[var15].doc + " score=" + hits[var15].score);} else {Document line = searcher.doc(hits[var15].doc);String page = line.get("path");if(page != null) {System.out.println(var15 + 1 + ". " + page);String title = line.get("title");if(title != null) {System.out.println("?? Title: " + line.get("title"));}} else {System.out.println(var15 + 1 + ". No path for this document");}}}if(!interactive || end == 0) {break;}if(numTotalHits >= end) {boolean var16 = false;while(true) {System.out.print("Press ");if(start - hitsPerPage >= 0) {System.out.print("(p)revious page, ");}if(start + hitsPerPage < numTotalHits) {System.out.print("(n)ext page, ");}System.out.println("(q)uit or enter number to jump to a page.");String var17 = in.readLine();if(var17.length() == 0 || var17.charAt(0) == 113) {var16 = true;break;}if(var17.charAt(0) == 112) {start = Math.max(0, start - hitsPerPage);break;}if(var17.charAt(0) == 110) {if(start + hitsPerPage < numTotalHits) {start += hitsPerPage;}break;}int var18 = Integer.parseInt(var17);if((var18 - 1) * hitsPerPage < numTotalHits) {start = (var18 - 1) * hitsPerPage;break;}System.out.println("No such page");}if(var16) {break;}end = Math.min(numTotalHits, start + hitsPerPage);}}} }

2.2.2 運(yùn)行效果

可以看出是跟上面Luke工具查看的結(jié)果一樣，只有是對(duì)了才能查到

2.2.3 分析

該類主要與一個(gè)IndexSearcher，， StandardAnalyzer（在IndexFiles類中使用）和一個(gè)QueryParser。查詢解析器是用一個(gè)分析器構(gòu)造的，用于以與解釋文檔相同的方式解釋查詢文本：查找單詞邊界，縮小和刪除無用單詞，如“a”，“an”和“the”。該 Query對(duì)象包含 QueryParser傳遞給搜索者的結(jié)果。請(qǐng)注意，也可以以編程方式構(gòu)建豐富Query 對(duì)象，而不使用查詢解析器。查詢語法分析器只能將 Lucene查詢語法解碼為相應(yīng)的 Query對(duì)象。

SearchFiles使用最大 n個(gè)匹配IndexSearcher.search(query,n)返回的方法。結(jié)果以頁面打印，按分?jǐn)?shù)（即相關(guān)性）排序。

2.3 SimpleSortedSetFacetsExample

一個(gè)簡(jiǎn)單的例子，比前面的兩個(gè)Demo理解起來容易一些。

該例子使用SortedSetDocValuesFacetField和SortedSetDocValuesFacetCounts顯示了簡(jiǎn)單的使用分面索引和搜索。

以下代碼里面有注釋，結(jié)合起來看會(huì)比較容易理解。

2.3.1 代碼

package com.bingo.backstage.facet;import java.io.IOException; import java.util.ArrayList; import java.util.List;import org.apache.lucene.analysis.core.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.facet.DrillDownQuery; import org.apache.lucene.facet.FacetResult; import org.apache.lucene.facet.FacetsCollector; import org.apache.lucene.facet.FacetsConfig; import org.apache.lucene.facet.sortedset.DefaultSortedSetDocValuesReaderState; import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts; import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchAllDocsQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory;/*** Created by MoSon on 2017/6/30.*/public class SimpleSortedSetFacetsExample {//RAMDirectory：內(nèi)存駐留目錄實(shí)現(xiàn)。默認(rèn)情況下，鎖定實(shí)現(xiàn)是SingleInstanceLockFactory。private final Directory indexDir = new RAMDirectory();private final FacetsConfig config = new FacetsConfig();public SimpleSortedSetFacetsExample() {}private void index() throws IOException {初始化索引創(chuàng)建器//WhitespaceAnalyzer僅僅是去除空格，對(duì)字符沒有lowcase化,不支持中文；并且不對(duì)生成的詞匯單元進(jìn)行其他的規(guī)范化處理。//openMode:創(chuàng)建索引模式：CREATE，覆蓋模式； APPEND，追加模式//IndexWriter：創(chuàng)建并維護(hù)索引IndexWriter indexWriter = new IndexWriter(this.indexDir, (new IndexWriterConfig(new WhitespaceAnalyzer())).setOpenMode(OpenMode.CREATE));//建立文檔Document doc = new Document();// 創(chuàng)建Field對(duì)象，并放入doc對(duì)象中doc.add(new SortedSetDocValuesFacetField("Author", "Bob"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2010"));// 寫入IndexWriterindexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Lisa"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2010"));indexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Lisa"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2012"));indexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Susan"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "2012"));indexWriter.addDocument(this.config.build(doc));doc = new Document();doc.add(new SortedSetDocValuesFacetField("Author", "Frank"));doc.add(new SortedSetDocValuesFacetField("Publish Year", "1999"));indexWriter.addDocument(this.config.build(doc));indexWriter.close();}//查詢并統(tǒng)計(jì)文檔的信息private List<FacetResult> search() throws IOException {//基本都是一層包著一層封裝//DirectoryReader是可以讀取目錄中的索引的CompositeReader的實(shí)現(xiàn)。DirectoryReader indexReader = DirectoryReader.open(this.indexDir);//通過一個(gè)IndexReader實(shí)現(xiàn)搜索。IndexSearcher searcher = new IndexSearcher(indexReader);DefaultSortedSetDocValuesReaderState state = new DefaultSortedSetDocValuesReaderState(indexReader);//收集命中后續(xù)刻面。一旦你運(yùn)行了一個(gè)搜索并收集命中，就可以實(shí)例化一個(gè)Facets子類來進(jìn)行細(xì)分計(jì)數(shù)。使用搜索實(shí)用程序方法執(zhí)行“普通”搜索，但也會(huì)收集到Collector中。FacetsCollector fc = new FacetsCollector();//實(shí)用方法，搜索并收集所有的命中到提供的Collector。FacetsCollector.search(searcher, new MatchAllDocsQuery(), 10, fc);//計(jì)算所提供的匹配中的所有命中。SortedSetDocValuesFacetCounts facets = new SortedSetDocValuesFacetCounts(state, fc);ArrayList results = new ArrayList();//getTopChildren：返回指定路徑下的頂級(jí)子標(biāo)簽。results.add(facets.getTopChildren(10, "Author", new String[0]));results.add(facets.getTopChildren(10, "Publish Year", new String[0]));indexReader.close();return results;}private FacetResult drillDown() throws IOException {DirectoryReader indexReader = DirectoryReader.open(this.indexDir);IndexSearcher searcher = new IndexSearcher(indexReader);DefaultSortedSetDocValuesReaderState state = new DefaultSortedSetDocValuesReaderState(indexReader);DrillDownQuery q = new DrillDownQuery(this.config);//添加查詢條件q.add("Publish Year", new String[]{"2012"});FacetsCollector fc = new FacetsCollector();FacetsCollector.search(searcher, q, 10, fc);SortedSetDocValuesFacetCounts facets = new SortedSetDocValuesFacetCounts(state, fc);//獲取符合的作者FacetResult result = facets.getTopChildren(10, "Author", new String[0]);indexReader.close();return result;}public List<FacetResult> runSearch() throws IOException {this.index();return this.search();}public FacetResult runDrillDown() throws IOException {this.index();return this.drillDown();}public static void main(String[] args) throws Exception {System.out.println("Facet counting example:");System.out.println("-----------------------");SimpleSortedSetFacetsExample example = new SimpleSortedSetFacetsExample();List results = example.runSearch();System.out.println("Author: " + results.get(0));System.out.println("Publish Year: " + results.get(0));System.out.println("\n");System.out.println("Facet drill-down example (Publish Year/2010):");System.out.println("---------------------------------------------");System.out.println("Author: " + example.runDrillDown());} }

2.3.2????????運(yùn)行效果

3 簡(jiǎn)單上手

3.1 創(chuàng)建索引

這是自己寫的例子，很好理解。

簡(jiǎn)單地添加內(nèi)容到索引庫。

3.1.1 代碼

3.1.2結(jié)果

一下是使用Luke查看的結(jié)果

3.2 分詞搜索

根據(jù)條件查詢符合的內(nèi)容

3.2.1 代碼

3.2.2 運(yùn)行結(jié)果

將符合條件的結(jié)果查詢并顯示。

4???Lucene創(chuàng)建索引核心API

Directory? 索引操作目錄

Analyzer?? 分詞器

Document 索引中文檔對(duì)象

IndexableField 文檔內(nèi)部數(shù)據(jù)信息

IndexWriterConfig 索引生成配置信息

IndexWriter? 索引生成對(duì)象

5???IK分詞器

5.1下載

下載適合Lucene的IKAnalyzer

鏈接：http://download.csdn.net/detail/fanpei_moukoy/9796612

5.2 基本使用

使用IK分詞器對(duì)中文進(jìn)行詞意劃分。

使用方式：將系統(tǒng)的Analyzer替換為IKAnalyzer

效果：

能對(duì)常用的詞語識(shí)別并劃分，但還不足夠，例如“雙攝像頭”，“驍龍”識(shí)別出來。

5.3 自定義分詞器

創(chuàng)建配置文件

創(chuàng)建自定義的擴(kuò)展字典

分詞效果：

5.4 使用分頁查詢

代碼：

packagecom.bingo.backstage;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;

import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Path;

/**
?* Created by MoSon on 2017/7/1.
?*/
public class SearchPage {

??? public static void main(String[] args)throwsIOException,ParseException {
??????? //定義索引目錄
??????? Path path = FileSystems.getDefault().getPath("index");
??????? Directory directory = FSDirectory.open(path);
??????? //定義索引查看器
??????? IndexReader indexReader = DirectoryReader.open(directory);
??????? //定義搜索器
??????? IndexSearcher indexSearcher = newIndexSearcher(indexReader);
??????? //搜索內(nèi)容

??????? //搜索關(guān)鍵字
??????? String? keyWords = "內(nèi)存";

????? ??//分頁信息
??????? Integer page = 1;
??????? Integer pageSize = 20;
??????? Integer start = (page-1) * pageSize;
??????? Integer end = start + pageSize;

??????? Query query = newQueryParser("sellPoint",newIKAnalyzer()).parse(keyWords);//模糊搜索

??????? //命中前10條文檔
??????? TopDocs topDocs = indexSearcher.search(query,end);//根據(jù)end查詢

??????? Integer totalPage = ((topDocs.totalHits/ pageSize) ==0)
??????????????? ? topDocs.totalHits/pageSize
??????????????? : ((topDocs.totalHits / pageSize) +1);

??????? System.out.println("“"+ keyWords +"”搜索到"+ topDocs.totalHits
??????????????? + "條數(shù)據(jù)，頁數(shù)："+ page +"/"+ totalPage);
??????? //打印命中數(shù)
??????? System.out.println("命中數(shù)："+topDocs.totalHits);
??????? //取出文檔
??????? ScoreDoc[] scoreDocs = topDocs.scoreDocs;
??????? int length = scoreDocs.length> end ? end : scoreDocs.length;
??????? //遍歷取出數(shù)據(jù)
??????? for (inti = start;i < length;i++){
??????????? ScoreDoc doc = scoreDocs[i];
??????????? System.out.println("得分："+ doc.score);
??????????? Document document = indexSearcher.doc(doc.doc);
??????????? System.out.println("ID:"+ document.get("id"));
??????????? System.out.println("sellPoint:"+document.get("sellPoint"));
??????????? System.out.println("-----------------------");
??????? }

??????? //關(guān)閉索引查看器
??????? indexReader.close();
??? }
}

效果：

6文件索引建立與搜索

導(dǎo)入一百萬的數(shù)據(jù)創(chuàng)建索引

6.1 創(chuàng)建索引

packagecom.bingo.backstage;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;

import javax.print.Doc;
import java.io.*;
import java.nio.file.FileSystems;
import java.nio.file.Path;

import static org.apache.lucene.document.TextField.TYPE_STORED;

/**
?* Created by MoSon on 2017/7/4.
?*/
public class ReadTxt {
??? public static void main(String[] args)throwsIOException {
??????? Path path = FileSystems.getDefault().getPath("","index");
??????? String extPath = "H:\\IDEAWorkspace\\lucene\\src\\main\\resources\\ext.dic";
??????? Directory directory = FSDirectory.open(path);
??????? //定義分詞器
//??????? Analyzer analyzer = new StandardAnalyzer();
??????? Analyzer analyzer = newIKAnalyzer();
??????? IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer).setOpenMode(IndexWriterConfig.OpenMode.CREATE);
??????? IndexWriter indexWriter = newIndexWriter(directory,indexWriterConfig);

??????? String filePath = "H:\\myfile\\品高\\茂名全量地址20170401（boss+）.csv";
??????? FileInputStream fis = newFileInputStream(filePath);
??????? InputStreamReader isr = newInputStreamReader(fis,"GBK");
??????? BufferedReader br = newBufferedReader(isr);
??????? String content;
??????? String levelOne = "";
??????? String levelTwo = "";
??????? String levelThree = "";
??????? String levelFour = "";
??????? String levelFive = "";
??????? int i = 0;
?????? /* while ((content = br.readLine()) != null){
??????????? if (i == 1000) {
??????????????? break;
??????????? }
??????????? String[] split = content.split(",");
??????????? String tempOne = "";
??????????? String tempTwo = "";
??????????? String tempThree = "";
??????????? String tempFour = "";
??????????? String tempFive = "";
??????????? if (i == 1) {
??????????????? levelOne = split[2];
??????????????? levelTwo = split[3];
??????????????? levelThree = split[4];
??????????????? levelFour = split[5];
??????????????? levelFive = split[6];
??????????? }

??????????? tempOne = split[2];
??????????? tempTwo = split[3];
??????????? tempThree = split[4];
??????????? tempFour = split[5];
??????????? tempFive = split[6];

??????????? StringBuilder sb = new StringBuilder();
??????????? //使用equals如存在""避免放在前面
??????????? if (levelOne != null && levelOne != "" && tempOne!= "" && tempOne != null) {
??????????????? if(!tempOne.equals(levelOne)) {
??????????????????? sb.append("\n" + levelOne);
??????????????????? levelOne = tempOne;
??????????????????? System.out.println("11" + levelOne+tempOne);
??????????????? }
??????????? }
??????????? if (levelTwo != null && levelTwo != "" && tempTwo!= ""&& tempTwo != null) {
??????????????? if(!tempTwo.equals(levelTwo)) {
??????????????????? sb.append("\n" + levelTwo);
??????????????????? levelTwo = tempTwo;
??????????????? }
??????????? }
??????????? if (levelThree != null && levelThree != ""&& tempThree != ""&& tempThree != null) {
??????????????? if(!tempThree.equals(levelThree)) {
??????????????????? sb.append("\n" + levelThree);
??????????????????? levelThree = tempThree;
??????????????? }
??????????? }
??????????? if (levelFour != null && levelFour != ""&& tempFour != "" && tempFour != null) {
??????????????? if(!tempFour.equals(levelFour)) {
??????????????????? sb.append("\n" + levelFour);
??????????????????? levelFour = tempFour;
??????????????? }
??????????? }
??????????? if (levelFive != null && levelFive != "" && tempFive != "" && tempFive != null) {
??????????????? if(!tempFive.equals(levelFive)) {
??????????????????? sb.append("\n" + levelFive);
??????????????????? levelFive = tempFive;
??????????????? }
??????????? }
??????????? if(i == 422){
??????????????? System.out.println("address" + sb.toString()+tempFive+levelFive);
??????????? }

//??????????? System.out.println("address" + sb.toString()+tempFive+levelFive);
??????????? if (sb != null){
??????????????? //以追加的形式寫入
???????? ???????FileOutputStream fos = new FileOutputStream(extPath,true);
??????????????? OutputStreamWriter osr = new OutputStreamWriter(fos);
??????????????? BufferedWriter bw = new BufferedWriter(osr);
??????????????? bw.write(sb.toString(),0,sb.length());
??? ????????????bw.close();
??????????? }
??????????? i++;
??????? }*/

??????? long start = System.currentTimeMillis();
??????? System.out.println("start:"+ start);
??????? while ((content = br.readLine()) != null) {
??????????? //第一行不記錄
??????????? /*if(i == 0){
??????????????? continue;
??????????? }*/
?????????? /* if (i == 1000) {
??????????????? break;
??????????? }*/

??????????? //定義文檔
??????????? Document document = newDocument();
??????????? //讀取每一行
//??????????? System.out.println(content);
????? ??????String[] split = content.split(",");
??????????? String id = split[0];
??????????? String address = split[1];

//??????????? System.out.println(id + ":" + address);
??????????? document.add(newField("id",id,TYPE_STORED));
??????????? document.add(newField("address",address,TYPE_STORED));
??????????? indexWriter.addDocument(document);
??????????? i++;
??????? }
??????? long end = System.currentTimeMillis();
??????? System.out.println("end:"+ end);
??????? float time = end - start;
??????? System.out.println("用時(shí)："+ time);
??????? //提交
??????? indexWriter.commit();
??????? //關(guān)閉
??????? indexWriter.close();
??????? br.close();
??????? isr.close();
??????? fis.close();
??? }
}

6.2 效果

一開始100秒將一百萬的索引建完。后來逐漸加快，應(yīng)該跟只開了2個(gè)應(yīng)用程序有關(guān)，不到一分鐘就建完了。

6.3 模糊搜索

搜索“茂名”,全部命中，一百多萬條。用時(shí)1秒多。

效果：

7 獲取分詞器分詞結(jié)果

7.1 使用IK分詞器

想百度那樣，把我們要搜索的一句話先給分詞了再按關(guān)鍵字搜索

代碼：

packagecom.bingo.backstage;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.wltea.analyzer.lucene.IKAnalyzer;

import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;

/**
?* Created by MoSon on 2017/7/5.
?*/
public class AnalyzerResult {

??? /**
???? * 獲取指定分詞器的分詞結(jié)果
???? * @param analyzeStr
???? *??????????? 要分詞的字符串
???? * @param analyzer
???? *??????????? 分詞器
???? * @return 分詞結(jié)果
???? */
??? public List<String>getAnalyseResult(String analyzeStr,Analyzer analyzer) {
??????? List<String> response = new ArrayList<String>();
??????? TokenStream tokenStream = null;
??????? try {
??????????? //返回適用于fieldName的TokenStream，標(biāo)記讀者的內(nèi)容。
??????????? tokenStream = analyzer.tokenStream("address", newStringReader(analyzeStr));
??????????? // 語匯單元對(duì)應(yīng)的文本
??????????? CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);
??????????? //消費(fèi)者在使用incrementToken（）開始消費(fèi)之前調(diào)用此方法。
??????????? //將此流重置為干凈狀態(tài)。有狀態(tài)的實(shí)現(xiàn)必須實(shí)現(xiàn)這種方法，以便它們可以被重用，就像它們被創(chuàng)建的一樣。
??? ????????tokenStream.reset();
??????????? //Consumers（即IndexWriter）使用此方法將流推送到下一個(gè)令牌。
??????????? while (tokenStream.incrementToken()) {
??????????????? response.add(attr.toString());
??????????? }
??????? } catch (IOException e) {
??????????? e.printStackTrace();
??????? } finally{
??????????? if (tokenStream !=null) {
??????????????? try {
??????????????????? tokenStream.close();
??????????????? } catch(IOException e) {
??????????????????? e.printStackTrace();
??????????????? }
??????????? }
??????? }
?? ?????return response;
??? }

??? public static void main(String[] args) {
??????? List<String> analyseResult = new AnalyzerResult().getAnalyseResult("茂名市信宜市丁堡鎮(zhèn)丁堡鎮(zhèn)片區(qū)丁堡街道181號(hào)301", new IKAnalyzer());
??????? for (String result : analyseResult){
??????????? System.out.println(result);
??????? }
??? }
}

分詞效果

7.2 使用內(nèi)置CJK分詞器

把類中的IKAnalyzer替換為CJKAnalyzer就可以了

分詞效果：

基本以兩個(gè)字兩個(gè)字來劃分，沒有IK分詞器的效果好。

8???進(jìn)階

根據(jù)前面的知識(shí)結(jié)合起來，先分詞，根據(jù)關(guān)鍵詞搜索。相似度高的靠前輸出。

使用的是布爾搜索。

代碼：

package com.bingo.backstage;import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.search.*; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.wltea.analyzer.lucene.IKAnalyzer;import java.io.IOException; import java.io.StringReader; import java.nio.file.FileSystems; import java.nio.file.Path; import java.util.ArrayList; import java.util.List;/*** Created by MoSon on 2017/7/5.*/public class BooleanSearchQuery {public static void main(String[] args) throws IOException, ParseException {long start = System.currentTimeMillis();System.out.println("開始時(shí)間：" + start);//定義索引目錄Path path = FileSystems.getDefault().getPath("index");Directory directory = FSDirectory.open(path);//定義索引查看器IndexReader indexReader = DirectoryReader.open(directory);//定義搜索器IndexSearcher indexSearcher = new IndexSearcher(indexReader);//搜索內(nèi)容//定義查詢字段//布爾搜索/*?? TermQuery termQuery1 = new TermQuery(term1);TermQuery termQuery2 = new TermQuery(term2);BooleanClause booleanClause1 = new BooleanClause(termQuery1, BooleanClause.Occur.MUST);BooleanClause booleanClause2 = new BooleanClause(termQuery2, BooleanClause.Occur.SHOULD);BooleanQuery.Builder builder = new BooleanQuery.Builder();builder.add(booleanClause1);builder.add(booleanClause2);BooleanQuery query = builder.build();*//*** 進(jìn)階*多關(guān)鍵字的布爾搜索* *///定義Term集合List<Term> termList = new ArrayList<Term>();//獲取分詞結(jié)果List<String> analyseResult = new AnalyzerResult().getAnalyseResult("信宜市1234ewrq13asd丁堡鎮(zhèn)丁堡鎮(zhèn)片區(qū)丁堡街道181號(hào)301", new IKAnalyzer());for (String result : analyseResult){termList.add(new Term("address",result));//??????????? System.out.println(result);}//定義TermQuery集合List<TermQuery> termQueries = new ArrayList<TermQuery>();//取出集合結(jié)果for(Term term : termList){termQueries.add(new TermQuery(term));}List<BooleanClause> booleanClauses = new ArrayList<BooleanClause>();//遍歷for (TermQuery termQuery : termQueries){booleanClauses.add(new BooleanClause(termQuery, BooleanClause.Occur.SHOULD));}BooleanQuery.Builder builder = new BooleanQuery.Builder();for (BooleanClause booleanClause : booleanClauses){builder.add(booleanClause);}//檢索BooleanQuery query = builder.build();//命中前10條文檔TopDocs topDocs = indexSearcher.search(query,20);//打印命中數(shù)System.out.println("命中數(shù)："+topDocs.totalHits);//取出文檔ScoreDoc[] scoreDocs = topDocs.scoreDocs;//遍歷取出數(shù)據(jù)for (ScoreDoc scoreDoc : scoreDocs){float score = scoreDoc.score; //相似度System.out.println("相似度:"+ score);//通過indexSearcher的doc方法取出文檔Document doc = indexSearcher.doc(scoreDoc.doc);System.out.println("id:"+doc.get("id"));System.out.println("address:"+doc.get("address"));}//關(guān)閉索引查看器indexReader.close();long end = System.currentTimeMillis();System.out.println("開始時(shí)間：" + end);long time =? end-start;System.out.println("用時(shí)：" + time + "毫秒" );}/*** 獲取指定分詞器的分詞結(jié)果* @param analyzeStr*??????????? 要分詞的字符串* @param analyzer*??????????? 分詞器* @return 分詞結(jié)果*/public List<String> getAnalyseResult(String analyzeStr, Analyzer analyzer) {List<String> response = new ArrayList<String>();TokenStream tokenStream = null;try {//返回適用于fieldName的TokenStream，標(biāo)記讀者的內(nèi)容。tokenStream = analyzer.tokenStream("address", new StringReader(analyzeStr));// 語匯單元對(duì)應(yīng)的文本CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);//消費(fèi)者在使用incrementToken（）開始消費(fèi)之前調(diào)用此方法。//將此流重置為干凈狀態(tài)。有狀態(tài)的實(shí)現(xiàn)必須實(shí)現(xiàn)這種方法，以便它們可以被重用，就像它們被創(chuàng)建的一樣。tokenStream.reset();//Consumers（即IndexWriter）使用此方法將流推送到下一個(gè)令牌。while (tokenStream.incrementToken()) {response.add(attr.toString());}} catch (IOException e) {e.printStackTrace();} finally {if (tokenStream != null) {try {tokenStream.close();} catch (IOException e) {e.printStackTrace();}}}return response;} }

效果：

輸入的句子是

檢索結(jié)果：

?在此入門到此結(jié)束，如有興趣可以查看進(jìn)階版，可以看底部的“我的更多文章”。

總結(jié)

以上是生活随笔為你收集整理的Lucene从入门到进阶（6.6.0版本）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Lucene 6.6.0在线开发文档
下一篇： Spring全局异常捕捉实现Handle

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

Lucene从入门到进阶（6.6.0版本）

Lucene學(xué)習(xí)筆記

前言

1???資源準(zhǔn)備

1.1入門文檔

1.2 開發(fā)文檔

1.3 導(dǎo)入Maven依賴

1.1.4 Luke

2 入門

2.1 IndexFiels

2.1.1 Test.txt的內(nèi)容如下：

2.1.2 代碼

2.1.3啟動(dòng)效果

2.1.4 分析

2.2 SearchFiles

2.2.1代碼

2.2.2 運(yùn)行效果

2.2.3 分析

2.3 SimpleSortedSetFacetsExample

2.3.1 代碼

2.3.2????????運(yùn)行效果

3 簡(jiǎn)單上手

3.1 創(chuàng)建索引

3.1.1 代碼

3.1.2結(jié)果

3.2 分詞搜索

3.2.1 代碼

3.2.2 運(yùn)行結(jié)果

4???Lucene創(chuàng)建索引核心API

5???IK分詞器

5.1下載

5.2 基本使用

5.3 自定義分詞器

5.4 使用分頁查詢

6文件索引建立與搜索

6.1 創(chuàng)建索引

6.2 效果

6.3 模糊搜索

7 獲取分詞器分詞結(jié)果

7.1 使用IK分詞器

7.2 使用內(nèi)置CJK分詞器

8???進(jìn)階

總結(jié)