Stanford Named Entity Recognizer (NER) 斯坦福命名实体识别(NER)
以下翻譯內(nèi)容來自:https://nlp.stanford.edu/software/CRF-NER.html
About
關(guān)于
Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances, including models trained on just theCoNLL 2003?English training data.
斯坦福NER是一個基于Java語言實(shí)現(xiàn)的命名實(shí)體識別器。命名實(shí)體識別(NER)標(biāo)注了文檔中的單詞序列它們是東西的名字,例如人名、公司名或基因、專有名稱。它帶有用于命名實(shí)體識別的精心設(shè)計的特征提取器,以及定義特征提取器的許多選項(xiàng)。包括英語的命名實(shí)體識別器的下載,尤其善于識別3類命名實(shí)體(人名、組織名、地名)。除此之外,我們還為不同的語言和環(huán)境提供了其它模型,包括在CoNLL2003英文訓(xùn)練數(shù)據(jù)的訓(xùn)練模型。
Stanford NER is also known as CRFClassifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. That is, by training your own models on labeled data, you can actually use this code to build sequence models for NER or any other task. (CRF models were pioneered by?Lafferty, McCallum, and Pereira (2001); see?Sutton and McCallum (2006)?or?Sutton and McCallum (2010)?for more comprehensible introductions.)
斯坦福NER也被稱為CRF分類器。這個軟件提供了(任意階)線性鏈條件隨機(jī)場(CRF)序列模型的一般實(shí)現(xiàn)。也就是說,通過在標(biāo)記數(shù)據(jù)上訓(xùn)練您自己的模型,您實(shí)際上可以使用這段代碼為NER或任何其他任務(wù)構(gòu)建序列模型。(CRF 模型由?Lafferty, McCallum, and Pereira (2001); 參考Sutton and McCallum (2006)?或Sutton and McCallum (2010)?介紹更容易理解.)
The original CRF code is by Jenny Finkel. The feature extractors are by Dan Klein, Christopher Manning, and Jenny Finkel. Much of the documentation and usability is due to Anna Rafferty. More recent code development has been done by various Stanford NLP Group members.
原始CRF代碼由Jenny Finkel編寫,特征提取器由Dan Klein、Christopher Manning和Jenny Finkel. 設(shè)計。大部分文檔和可用性都?xì)w功于Anna Rafferty。最近的代碼開發(fā)是由斯坦福NLP小組的成員完成的。
Stanford NER is available for download,?licensed under the?GNU General Public License?(v2 or later). Source is included. The package includes components for command-line invocation (look at the shell scripts and batch files included in the download), running as a server (look at?NERServer?in the sources jar file), and a Java API (look at the simple examples in the?NERDemo.java?file included in the download, and then at the javadocs). Stanford NER code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the?fullGPL, which allows many free uses. For distributors of?proprietary software,?commercial licensing?is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gifts.
斯坦福NER提供下載,許可證在?GNU General Public License?(V2或之后的版本)下面。源代碼包括不同組件的包,用于命令行調(diào)用(包含在下載中的shell腳本和批文件),以服務(wù)器運(yùn)行(jar文件中的NERServer),以及Java API(間NERDemo.java的簡單示例,文件包含在下載中,還有javadocs)。斯坦福NER代碼是雙重許可的(類似于MySQL,等)。開源許可在fullGPL之下的,它允許多種免費(fèi)用途。對于專利軟件的分銷商,可以獲得商業(yè)許可。如果你不需要商業(yè)許可,但想支持維護(hù)這些工具,歡迎饋贈。
Citation
引用
The CRF sequence models provided here do not precisely correspond to any published paper, but the correct paper to cite for the model and software is:
此處提供的CRF序列模型與任何已發(fā)表的論文并不完全對應(yīng),但模型和軟件的正確引用論文為:
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling.?Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005),?pp. 363-370.http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdfThe software provided here is similar to the baseline local+Viterbi model in that paper, but adds new distributional similarity based features (in the?-distSim?classifiers). Distributional similarity features improve performance but the models require somewhat more memory. Our big English NER models were trained on a mixture of CoNLL, MUC-6, MUC-7 and ACE named entity corpora, and as a result the models are fairly robust across domains.
軟件提供了類似于本文中的基線local+Viterbi模型,但是添加了新的基于分布相似性的特性(在-distSim分類器中)。分布相似特征提高了性能,但是模型需要更多的內(nèi)存。我們得大英文NER模型在CoNLL, MUC-6, MUC-7 和ACE混合的預(yù)料上訓(xùn)練了命名實(shí)體,并且這個模型在跨領(lǐng)域上顯示除了魯棒性。
Getting started
入門指南
You can try out?Stanford NER CRF classifiers?or?Stanford NER as part of Stanford CoreNLP?on the web, to understand what Stanford NER is and whether it will be useful to you.
你可以嘗試斯坦福NER CRF分類器或者在web上將斯坦福NER作為斯坦福核心自然語言處理的一部分,來理解什么是斯坦福NER并且感受它對你是否有用。
To use the software on your computer,?download the zip file. You then unzip the file by either double-clicing on the zip file, using a program for unpacking zip files, or by using the?unzip?command. This shord create a?stanford-ner?folder. There is no installation procedure, you should be able to run Stanford NER from that folder. Normally, Stanford NER is run from the command line (i.e., shell or terminal). Current releases of Stanford NER require Java 1.8 or later. Either make sure you have or get?Java 8?or consider running an earlier version of the software (versions through 3.4.1 support Java 6 and 7)..
要在你自己的電腦上使用這個軟件,請下載zip文件。然后,通過對zip文件進(jìn)行雙擊、使用解壓縮zip文件的程序或使用解壓縮命令來解壓縮該文件。將會創(chuàng)建一個stanford-ner文件夾。沒有安裝程序,你可以從文件夾中運(yùn)行斯坦福NER.通常,斯坦福NER從命令行上運(yùn)行(shell或終端)。當(dāng)前的Stanford NER版本需要Java 1.8或更高版本。要么確保您已經(jīng)擁有或獲得了Java 8,要么考慮運(yùn)行該軟件的較早版本(從3.4.1到3.4.1的版本都支持Java 6和7)。
NER GUI
NER的圖形用戶界面
Providing java is on your PATH, you should be able to run an NER GUI demonstration by just clicking. It might work to double-click on the stanford-ner.jar archive but this may well fail as the operating system does not give Java enough memory for our NER system, so it is safer to instead double click on the ner-gui.bat icon (Windows) or ner-gui.sh (Linux/Unix/MacOSX). Then, using the top option from the Classifier menu, load a CRF classifier from the classifiers directory of the distribution. You can then either load a text file or web page from the File menu, or decide to use the default text in the window. Finally, you can now named entity tag the text by pressing the Run NER button.
在你的路徑上提供java,你可以通過單擊直接運(yùn)行NER GUI。你可以雙擊stanford-ner.jar的方式運(yùn)行,如果失敗可能是因?yàn)橄到y(tǒng)沒有給你的NER sysytemjava提供足夠的內(nèi)存。因此,更保險的方法是雙擊Windows系統(tǒng)ner-gui.bat圖標(biāo),或者是Linux/Unix/MacOSX系統(tǒng)ner-gui.sh。然后,使用分類器菜單中的top選項(xiàng),從分布式分類器目錄中加載CRF分類器。您可以從“File”菜單中加載文本文件或web頁面,或者決定使用窗口中的默認(rèn)文本。最后,您現(xiàn)在可以通過單擊Run NER按鈕為文本添加命名實(shí)體標(biāo)記。
Single CRF NER Classifier from command-line
命令行中的單個CRF NER分類器
From a command line, you need to have java on your PATH and the stanford-ner.jar file in your CLASSPATH. (The way of doing this depends on your OS/shell.) The supplied?ner.bat?and?ner.sh?should work to allow you to tag a single file, when running from inside the Stanford NER folder. For example, for Windows:
在命令行中,您需要在路徑上使用java,在CLASSPATH路徑中使用stanford-ner.jar文件(這取決于您的操作系統(tǒng)/shell)。當(dāng)從Stanford NER文件夾中運(yùn)行時,所提供的ner.bat和ner.sh應(yīng)該能夠允許您標(biāo)記單個文件。例如,Windows系統(tǒng)
ner fileThis corresponds to the full command:
對應(yīng)的完整命令:
java -mx600m -cp "*;lib\*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txtOr on Unix/Linux you should be able to parse the test file in the distribution directory with the command:
或者在Unix/Linux上,您應(yīng)該能夠使用以下命令解析分發(fā)目錄中的測試文件:
java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txtHere's an output option that will print out entities and their class to the first two columns of a tab-separated columns output file:
這里有一個output選項(xiàng),它將實(shí)體及他們的類別打印到以制表符分隔的列輸出文件的前兩列:
java -mx600m -cp "*;lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -outputFormat tabbedEntities -textFile sample.txt > sample.tsvFull Stanford NER functionality
全部的斯坦福NER功能
This standalone distribution also allows access to the full NER capabilities of the Stanford CoreNLP pipeline. These capabilities can be accessed via the?NERClassifierCombiner?class. NERClassifierCombiner allows for multiple CRFs to be used together, and has options for recognizing numeric sequence patterns and time patterns with the rule-based NER of SUTime.
這個獨(dú)立的發(fā)行版還允許訪問Stanford CoreNLP管道的完整NER功能。這些功能可以通過NERClassifierCombiner類訪問。NERClassifierCombiner能夠讓多個CRF一起使用,并有選項(xiàng)識別數(shù)字序列模式以及SUTime基于規(guī)則NER的時間模式。
To use NERClassifierCombiner at the command-line, the jars in lib directory and stanford-ner.jar must be in the CLASSPATH. Here is an example command:
在命令行中使用NERClassifierCombiner,lib目錄下的jar包和stanford-ner.jar必須位于類路徑下。這是一個命令行的例子:
java -mx1g -cp "*:lib/*" edu.stanford.nlp.ie.NERClassifierCombiner -textFile sample.txt -ner.model classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gzThe one difference you should see from above is that?Sunday?is now recognized as a DATE.
你應(yīng)該從上面看到的一個區(qū)別是,Sunday?現(xiàn)在被識別為一個日期。
Programmatic use via API
通過API的編程使用
You can call Stanford NER from your own code. The file?NERDemo.java?included in the distribution illustrates several ways of calling the system programatically. We suggest that you start from there, and then look at the javado, etc. as needed.
你能從你自己的代碼中調(diào)用Stanford NER。NERDemo.java包含了多種闡述使用多種方法實(shí)用的調(diào)用系統(tǒng)。我們建議你從哪里開始。然后,根據(jù)需要查看javadoc
Programmatic use via a service
通過服務(wù)器編程使用
Stanford NER can also be set up to run as a server listening on a socket.
斯坦福NER還可以在socket上設(shè)置為服務(wù)器運(yùn)行的監(jiān)聽
Questions
問題
You can look at a Powerpoint Introduction to NER and the Stanford NER package [ppt] [pdf]. There is also a list of?Frequently Asked Questions?(FAQ), with answers! This includes some information on training models. Further documentation is provided in the included?README.txt?and in the javadocs.
你們可以看一下關(guān)于NER和斯坦福NER包的幻燈片介紹[ppt] [pdf]。還有一個常見問題列表(FAQ),有答案!包括一些關(guān)于訓(xùn)練模型的信息。所包含的README.txt和javadocs中提供了更多的文檔。
Have a support question? Ask us on?Stack Overflow?using the tag?stanford-nlp.
有一個支持的問題?使用stanford-nlp標(biāo)記詢問堆棧溢出
Feedback and bug reports / fixes can be sent to our?mailing lists.
反饋和bug報告/修復(fù)可以發(fā)送到我們的郵件列表。
Mailing Lists
郵件列表
We have 3 mailing lists for the Stanford Named Entity Recognizer, all of which are shared with other JavaNLP tools (with the exclusion of the parser). Each address is at?@lists.stanford.edu:
You have to subscribe to be able to use this list. Join the list via?this webpage?or by emailing?java-nlp-user-join@lists.stanford.edu. (Leave the subject and message body empty.) You can also?look at the list archives.
?
Download
下載
Download Stanford Named Entity Recognizer version 3.9.2
下載斯坦福命名實(shí)體識別器版本3.9.2
The download is a 151M zipped file (mainly consisting of classifier data objects). If you unpack that file, you should have everything needed for English NER (or use as a general CRF). It includes batch files for running under Windows or Unix/Linux/MacOSX, a simple GUI, and the ability to run as a server. Stanford NER requires Java v1.8+. If you want to use Stanford NER for other languages, you'll also need to download model files for those languages; see further below.
下載的是一個151M的壓縮文件(主要由分類器數(shù)據(jù)對象組成)。如果您解壓縮該文件,您擁有了英語NER(或作為通用CRF使用)所需的所有內(nèi)容。它包括了在Windows、Unix/Linux/MacOS上運(yùn)行的批處理文件、一個簡單的GUI, 以及在服務(wù)器上運(yùn)行的能力。斯坦福NER需要Java1.8+的運(yùn)行環(huán)境。如果你希望使用斯坦福NER在其它語言上,你也需要下載那些語言的模型文件。如下所示:
Extensions: Packages by others using Stanford NER
擴(kuò)展:使用Standform NER的其它包
For some (computer) languages, there are more up-to-date interfaces to Stanford NER available by using it inside?Stanford CoreNLP, and you are better off getting those from the CoreNLP page and using them....
對于某些(計算機(jī))語言,在Stanford CoreNLP中使用Stanford NER可以獲得更多最新的接口,您最好從CoreNLP頁面獲得這些接口并使用它們……
- Apache Tika:?Named Entity Recognition (NER) with Tika.
- JavaScript/npm:
- Pranav Herur has written?ner-server.?Source?on github.
- Nikhil Srivastava has written?ner.?Source?on github.
- Varun Chatterji has written?stanford-ner.?Source?on github.
- .NET/F#/C#:?Sergey Tihon has?ported Stanford NER to F# (and other .NET languages, such as C#), using IKVM. See also pages on:?GitHub?and?NuGet.
- Perl:?Kieren Diment has written?Text-NLP-Stanford-EntityExtract, a Perl module that provides an interface to Stanford NER running as a server.
- PHP:?Patrick Schur in 2017 wrote?PHP wrapper for Stanford POS and NER taggers. Also on?packagist. Second choice:?PHP-Stanford-NLP. Supports POS Tagger, NER, Parser. By Anthony Gentile (agentile).
- Python:
- Dat Hoang wrote?pyner, a Python interface to Stanford NER.?[Old version.]
- NLTK (2.0+)?contains an interface to Stanford NER written by Nitin Madnani:?documentation?(note: set the character encoding or you get ASCII by default!),?code,?on Github.
- scrapy-corenlp, a Python?Scrapy?(web page scraping) middleware by Jithesh E. J.?PyPI.
- Ruby:?tiendung has written?a Ruby Binding?for the Stanford POS tagger and Named Entity Recognizer.
- UIMA:?Florian Laws made a Stanford NER?UIMA?annotator using a modified version of Stanford NER, which is available on his?homepage.?[Old version.]
Models
模型
Included with Stanford NER are a 4 class model trained on the CoNLL 2003?eng.train, a 7 class model trained on the MUC 6 and MUC 7 training data sets, and a 3 class model trained on both data sets and some additional data (including ACE 2002 and limited amounts of in-house data) on the intersection of those class sets. (The training data for the 3 class model does not include any material from the CoNLL?eng.testa?or?eng.testb?data sets, nor any of the MUC 6 or 7 test or devtest datasets, nor Alan Ritter's Twitter NER data, so all of these remain valid tests of its performance.)
所包含的斯坦福NER是一個4類模型,基于CoNLL 2003的英文語料訓(xùn)練的, 一個7類模型在MUC6 和 MUC7訓(xùn)練數(shù)據(jù)集,以及一個3類模型在以上2個數(shù)據(jù)集以及額外的數(shù)據(jù)上訓(xùn)練(包括 ACE 2002 和一些內(nèi)部數(shù)據(jù)集 )基于這些類集合的交集(3類模型的訓(xùn)練數(shù)據(jù)不包括任何來自信息CoNLL?eng.testa或eng.testb數(shù)據(jù)集,也沒有任何MUC 6或7測試或devtest數(shù)據(jù)集,也不包含Alan Ritter's Twitter NER數(shù)據(jù),所有這些都是對其性能的有效測試的)
| 3 class: | Location, Person, Organization |
| 4 class: | Location, Person, Organization, Misc |
| 7 class: | Location, Person, Organization, Money, Percent, Date, Time |
?
These models each use distributional similarity features, which provide considerable performance gain at the cost of increasing their size and runtime. We also have models that are the same except without the distributional similarity features. You can find them in our English models jar. You can either unpack the jar file or add it to the classpath; if you add the jar file to the classpath, you can then load the models from the pathedu/stanford/nlp/models/.... You can run?jar -tf <jar-file>?to get the list of files in the jar file.
這些模型每一個都使用了分布的相似特征,提供了重要的性能收益以增加了他們的大小和運(yùn)行時間為代價。我們也有模型由相同的期待沒有相同的特征分布。你可以在英語models.jar找到他們。你也可以解壓縮jar包或者將其添加到classpath中,如果你添加jar文件到classpath,你能從路徑edu/stanford/nlp/models/...加載模型。你可以運(yùn)行?jar -tf <jar-file>?來獲取jar文件中的文件列表。
Also available are caseless versions of these models, better for use on texts that are mainly lower or upper case, rather than follow the conventions of standard English
CoreNLP models jars download page
?
Important note:?There was a problem with the v3.6.0 English Caseless NER model. See?this page.
?
German
A German NER model is available, based on work by Manaal Faruqui and Sebastian Padó. You can find it in the CoreNLP German models jar. For citation and other information relating to the German classifiers, please seeSebastian Pado's German NER page?(but the models there are now many years old; you should use the better models that we have!). It is a 4 class IOB1 classifier (see, e.g.,?Memory-Based Shallow Parsing?by Erik F. Tjong Kim Sang). The tags given to words are: I-LOC, I-PER, I-ORG, I-MISC, B-LOC, B-PER, B-ORG, B-MISC, O. It is trained over the CoNLL 2003 data with distributional similarity classes built from the Huge German Corpus.
CoreNLP models jars download page
?
Here are a couple of commands using these models, two sample files, and a couple of notes. Running on TSV files: the models were saved with options for testing on German CoNLL NER files. While the models use just the surface word form, the input reader expects the word in the first column and the class in the fifth colum (1-indexed colums). You can either make the input like that or else change the expectations with, say, the option?-map "word=0,answer=1"?(0-indexed columns). These models were also trained on data with straight ASCII quotes and BIO entity tags. Also, be careful of the text encoding: The default is Unicode; use?-encoding iso-8859-15?if the text is in 8-bit encoding.?
TSV mini test file:?german-ner.tsv?— Text mini test file:?german-ner.txt? java -cp "*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier edu/stanford/nlp/models/ner/german.conll.hgc_175m_600.crf.ser.gz -testFile german-ner.tsv java -cp "*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier edu/stanford/nlp/models/ner/german.conll.hgc_175m_600.crf.ser.gz -tokenizerOptions latexQuotes=false -textFile german-ner.txt?
Spanish
From version 3.4.1 forward, we have a Spanish model available for NER. It is included in the Spanish corenlp models jar.
CoreNLP models jars download page
?
Chinese
中文
We also provide Chinese models built from the Ontonotes Chinese named entity data. There are two models, one using distributional similarity clusters and one without. These are designed to be run on?word-segmented Chinese. So, if you want to use these on normal Chinese text, you will first need to run?Stanford Word Segmenter?or some other Chinese word segmenter, and then run NER on the output of that!
我們也提供了基于Ontonotes中文命名實(shí)體數(shù)據(jù)獲得的中文模型。包括2個模型,一個使用了分布相似聚類,另一個沒有。這些設(shè)計運(yùn)行在中文分詞上。如果你希望在一般的中文文本上使用,你首先需要運(yùn)行斯坦福分詞器或其它中文分詞器,然后運(yùn)行NER輸出.
CoreNLP models jars download page
?
Online Demo
We have an?online demo?of several of our NER models. Special thanks to?Dat Hoang, who provided the initial version. Note that the online demo demonstrates single CRF models; in order to see the effect of the time annotator or the combined models, see?CoreNLP.
?
Release History
?
| 3.9.2 | 2018-10-16 | Updated for compatibility |
| 3.9.1 | 2018-02-27 | KBP ner models for Chinese and Spanish |
| 3.8.0 | 2017-06-09 | Updated for compatibility |
| 3.7.0 | 2016-10-31 | Improvements to Chinese and German NER |
| 3.6.0 | 2015-12-09 | Updated for compatibility |
| 3.5.2 | 2015-04-20 | synch standalone and CoreNLP functionality |
| 3.5.1 | 2015-01-29 | Substantial accuracy improvements |
| 3.5.0 | 2014-10-26 | Upgrade to Java 8 |
| 3.4.1 | 2014-08-27 | Added Spanish models |
| 3.4 | 2014-06-16 | Fix serialization of new models |
| 3.3.1 | 2014-01-04 | Bugfix release |
| 3.3.0 | 2013-11-12 | Updated for compatibility |
| 3.2.0 | 2013-06-20 | Improved line by line handling |
| 1.2.8 | 2013-04-04 | -nthreads option |
| 1.2.7 | 2012-11-11 | Add Chinese model, include Wikipedia data in 3-class English model |
| 1.2.6 | 2012-07-09 | Minor bug fixes |
| 1.2.5 | 2012-05-22 | Fix encoding issue |
| 1.2.4 | 2012-04-07 | Caseless versions of models supported |
| 1.2.3 | 2012-01-06 | Minor bug fixes |
| 1.2.2 | 2011-09-14 | Improved thread safety |
| 1.2.1 | 2011-06-19 | Models reduced in size but on average improved in accuracy (improved distsim clusters) |
| 1.2 | 2011-05-16 | Normal download includes 3, 4, and 7 class models. Updated for compatibility with other software releases. |
| 1.1.1 | 2009-01-16 | Minor bug and usability fixes, and changed API (in particular the methods to classify and output tagged text) |
| 1.1 | 2008-05-07 | Additional feature flags, various code updates |
| 1.0 | 2006-09-18 | Initial release |
總結(jié)
以上是生活随笔為你收集整理的Stanford Named Entity Recognizer (NER) 斯坦福命名实体识别(NER)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。