ElasticSearch 启动时加载 Analyzer 源码分析
ElasticSearch 啟動時加載 Analyzer 源碼分析
本文介紹 ElasticSearch啟動時如何創(chuàng)建、加載Analyzer,主要的參考資料是Lucene中關(guān)于Analyzer官方文檔介紹、ElasticSearch6.3.2源碼中相關(guān)類:AnalysisModule、AnalysisPlugin、AnalyzerProvider、各種Tokenizer類和它們對應的TokenizerFactory。另外還參考了一個具體的基于ElasticSearch采用HanLP進行中文分詞的插件:elasticsearch-analysis-hanlp
這篇文章的主要目的是搞懂:AnalysisModule、AnalysisPlugin、AnalyzerProvider、某個具體的Tokenizer,比如HanLPStandardAnalyzer、和TokenizerFactory 之間的關(guān)系。這里面肯定是用過了某個(某些)設(shè)置模式的。搞懂了這個自己也能照葫蘆畫瓢,開發(fā)自定義的Plugin了。
分詞插件
1 Tokenizer
對比HanLP中文分詞器和ElasticSearch中內(nèi)置的標準分詞器(StandardTokenizer),發(fā)現(xiàn)elasticsearch-analysis-hanlp的實現(xiàn)方法和ElasticSearch中實現(xiàn)的標準分詞插件二者幾乎是一個套路。
HanLP提供了各種各樣的中文分詞方式,比如:標準分詞、索引分詞、NLP分詞……因此,HanLPTokenizerFactory implements TokenizerFactory,實現(xiàn)了create()方法,負責創(chuàng)建各類分詞器。
這種寫法和ElasticSearch源碼里面的StandardTokenizerFactory寫法如出一轍。
2 Analyzer
把Analyzer想象成一部生產(chǎn)Token的機器,輸入Text,輸出Token。
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.
這部機器可以以不同的方式生產(chǎn)Token。比如:對于英文,一般以文本中的空格作為分隔符,輸入Text,輸出Token。
對于中文,中文文本沒有空格了,因此需要借助一些中文分詞算法,輸入Text,輸出Token。
對于HTML這樣的文本,那就需要根據(jù)HTML標簽作為分隔符,輸入Text,輸出Token。
TokenStreamComponents內(nèi)部類封裝了生產(chǎn)Token的方式,看源碼注釋**This class encapsulates the outer components of a token stream.It provides access to the source Tokenizer and .... **。主要是封裝了Tokenizer
/**
* This class encapsulates the outer components of a token stream. It provides
* access to the source ({@link Tokenizer}) and the outer end (sink), an
* instance of {@link TokenFilter} which also serves as the
* {@link TokenStream} returned by
* {@link Analyzer#tokenStream(String, Reader)}.
*/
public static class TokenStreamComponents {
/**
* Original source of the tokens.
*/
protected final Tokenizer source;
/**
* Sink tokenstream, such as the outer tokenfilter decorating
* the chain. This can be the source if there are no filters.
*/
protected final TokenStream sink;
若要自定義Analyzer,只需繼承Analyzer類,重寫createComponents()方法,提供一個Tokenizer就可以了。比如:HanLPStandardAnalyzer重寫的方法如下:
@Override
protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
// AccessController.doPrivileged((PrivilegedAction) () -> HanLP.Config.Normalization = true);
Tokenizer tokenizer = new HanLPTokenizer(HanLP.newSegment(), configuration);
return new Analyzer.TokenStreamComponents(tokenizer);
}
另外,也可參考ElasticSearch中提供的StandardAnalyzer.java,它實現(xiàn)了ElasticSearch查詢分析過程中的標準分詞,它繼承了StopwordAnalyzerBase.java,這樣可以在生產(chǎn)Token的時候,過濾掉 stop words。
3 AnalyzerProvider
AnalyzerProvider封裝了Analyzer,它的構(gòu)造方法實例化一個Analyzer,并為Analyzer 提供了一些名稱、版本相關(guān)的信息:
public class HanLPAnalyzerProvider extends AbstractIndexAnalyzerProvider<Analyzer> {
private final Analyzer analyzer;
AbstractIndexAnalyzerProvider 里面有 name 和 Version信息(Constructs a new analyzer component, with the index name and its settings and the analyzer name.)
public abstract class AbstractIndexAnalyzerProvider<T extends Analyzer> extends AbstractIndexComponent implements AnalyzerProvider<T> {
private final String name;
protected final Version version;
4 AnalysisPlugin
AnalysisHanLPPlugin負責注冊各種各樣的分詞器。在定義索引的時候需要指定某個字段的Analyzer名稱,比如下面 name 字段中的文本在都使用名稱為hanlp_standard分詞器分詞后,寫入ElasticSearch索引。
"name": {
"type": "text",
"analyzer": "hanlp_standard",
"fields": {
"raw": {
"type": "keyword"
}
}
},
AnalysisPlugin主要是下面三個方法,用來獲取:CharFilter、TokenFilter、Tokenizer。關(guān)于這三個的區(qū)別可參考下節(jié):索引分析過程。
/**
* Override to add additional {@link CharFilter}s. See {@link #requriesAnalysisSettings(AnalysisProvider)}
* how to on get the configuration from the index.
*/
default Map<String, AnalysisProvider<CharFilterFactory>> getCharFilters() {
return emptyMap();
}
/**
* Override to add additional {@link TokenFilter}s. See {@link #requriesAnalysisSettings(AnalysisProvider)}
* how to on get the configuration from the index.
*/
default Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
return emptyMap();
}
/**
* Override to add additional {@link Tokenizer}s. See {@link #requriesAnalysisSettings(AnalysisProvider)}
* how to on get the configuration from the index.
*/
default Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
return emptyMap();
}
ElasticSearch如何加載Analyzer插件
這里主要參考ElasticSearch啟動過程中相關(guān)源代碼。在創(chuàng)建PluginService過程中初始化各種Analyzer, Node.java
//加載 modules 和 plugins 目錄下的內(nèi)容
this.pluginsService = new PluginsService(tmpSettings, environment.configFile(), environment.modulesFile(), environment.pluginsFile(), classpathPlugins);
貌似是通過創(chuàng)建的ClassLoader,不管是module還是plugin都視為bundle,以SPI方式接入底層Lucene,PluginService.java
// load modules
if (modulesDirectory != null) {
Set<Bundle> modules = getModuleBundles(modulesDirectory);
for (Bundle bundle : modules) {
modulesList.add(bundle.plugin);
}
seenBundles.addAll(modules);
}
// now, find all the ones that are in plugins/
if (pluginsDirectory != null) {
List<BundleCollection> plugins = findBundles(pluginsDirectory, "plugin");
for (final BundleCollection plugin : plugins) {
final Collection<Bundle> bundles = plugin.bundles();
for (final Bundle bundle : bundles) {
pluginsList.add(bundle.plugin);
}
seenBundles.addAll(bundles);
pluginsNames.add(plugin.name());
}
加載 module/plugin jar文件:
try (DirectoryStream<Path> jarStream = Files.newDirectoryStream(dir, "*.jar")) {
for (Path jar : jarStream) {
// normalize with toRealPath to get symlinks out of our hair
URL url = jar.toRealPath().toUri().toURL();
if (urls.add(url) == false) {
throw new IllegalStateException("duplicate codebase: " + url);
}
}
}
//...
// create a child to load the plugin in this bundle
ClassLoader parentLoader = PluginLoaderIndirection.createLoader(getClass().getClassLoader(), extendedLoaders);
ClassLoader loader = URLClassLoader.newInstance(bundle.urls.toArray(new URL[0]), parentLoader);
當PluginService載入了所有的plugin后,過濾出與Analysis相關(guān)的Plugin,創(chuàng)建AnalysisModule
//從plugin service 中過濾出 與Analysis相關(guān)的plugin
AnalysisModule analysisModule = new AnalysisModule(this.environment, pluginsService.filterPlugins(AnalysisPlugin.class));
注冊各種分詞器、filters、analyzer的名稱:(這樣在創(chuàng)建索引的時候,為某個索引字段指定分詞器,就是用的這里的注冊了的名稱)
NamedRegistry<AnalysisProvider<CharFilterFactory>> charFilters = setupCharFilters(plugins);
NamedRegistry<AnalysisProvider<TokenFilterFactory>> tokenFilters = setupTokenFilters(plugins, hunspellService);
NamedRegistry<AnalysisProvider<TokenizerFactory>> tokenizers = setupTokenizers(plugins);
NamedRegistry<AnalysisProvider<AnalyzerProvider<?>>> analyzers = setupAnalyzers(plugins);
//....
private NamedRegistry<AnalysisProvider<AnalyzerProvider<?>>> setupAnalyzers(List<AnalysisPlugin> plugins) {
NamedRegistry<AnalysisProvider<AnalyzerProvider<?>>> analyzers = new NamedRegistry<>("analyzer");
analyzers.register("default", StandardAnalyzerProvider::new);
analyzers.register("standard", StandardAnalyzerProvider::new);
//....
public StandardAnalyzerProvider(IndexSettings indexSettings, Environment env, String name, Settings settings) {
//....
standardAnalyzer = new StandardAnalyzer(stopWords);
standardAnalyzer.setVersion(version);
}
引用一段《An Introduction to Information Retrieval》中關(guān)于 token、type、term、dictionary概念的解釋:(這里的type和ElasticSearch索引中的type是不一樣的,ElasticSearch索引中的type以后版本將不支持了)
A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of all tokens containing the same character sequence. A term is a (perhaps normalized) type that is included in the IR system's dictionary.
For example, if the document to be indexed is to
sleep perchance to dream, then there are 5 tokens, but only 4 types (since there are 2 instances of to). However, if to is omitted from the index (as a stop word) then there will be only 3 terms: sleep, perchance, and dream.
索引分析過程
個人覺得Tokenization和Analysis過程有交叉的地方。Lucene中定義的Analysis是指:將字符串轉(zhuǎn)化成Tokens的過程,Analysis主要有四個方面:
The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There are four main classes in the package from which all analysis processes are derived. These are:
Analyzer
CharFilter
Tokenizer
TokenFilter
這四個的區(qū)別如下:(以中文處理舉例)
比如一句中文:“這是一篇關(guān)于ElasticSearch Analyzer的文章”,CharFilter過濾其中的某個字。Tokenizer是將這句話進行中文分詞:這是、一篇、關(guān)于、ElasticSearch、Analyzer、的、文章;分詞得到的結(jié)果就是一個個的Token。TokenFilter則是過濾某些Token。
The Analyzer is a factory for analysis chains. Analyzers don't process text, Analyzers construct CharFilters, Tokenizers, and/or TokenFilters that process text. An Analyzer has two tasks: to produce TokenStreams that accept a reader and produces tokens, and to wrap or otherwise pre-process Reader objects.
具體可參考:Lucene7.6.0。在Lucene中,Analyzer不處理文本,它只是構(gòu)建CharFilters、Tokenizer、TokenFilters, 然后讓它們來處理文本。
參考資料
lucene7.6.0 Analysis官方文檔
ElasticSearch6.3.2源碼
HanLP進行中文分詞的插件:elasticsearch-analysis-hanlp
原文:https://www.cnblogs.com/hapjin/p/10151887.html
總結(jié)
以上是生活随笔為你收集整理的ElasticSearch 启动时加载 Analyzer 源码分析的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 奶油胶手机壳多久能干
- 下一篇: Mac包管理工具brew