nlp自然语言处理_自然语言处理(NLP):不要重新发明轮子
nlp自然語言處理
介紹 (Introduction)
Natural language processing (NLP) is an intimidating name for an intimidating field. Generating useful insight from unstructured text is hard, and there are countless techniques and algorithms out there, each with their own use-cases and complexities. As a developer with minimal NLP exposure, it can be difficult to know which methods to use, and how to implement them. I’m here to help.
自然語言處理(NLP)是令人生畏的字段的一個令人生畏的名稱。 從非結構化文本中生成有用的見解是很困難的 ,并且存在無數的技術和算法,每種技術和算法都有自己的用例和復雜性。 作為具有最少NLP暴露的開發人員,可能很難知道要使用哪種方法以及如何實現它們。 我是來幫忙的。
If I were to offer perfect results for minimal effort, you would be right to be skeptical. Instead, using the 80/20 Principle, I’ll show you how to quickly (20%) deliver solutions, without significantly sacrificing outcomes (80%).
如果我以最小的努力提供完美的結果,那么您將持懷疑態度是正確的。 相反,我將使用80/20原理向您展示如何快速(20%)交付解決方案,而又不顯著犧牲成果(80%)。
“The 80/20 Principle exerts that a minority of causes, inputs, or efforts usually leads to a majority of results, outputs, or rewards”
“ 80/20原則認為,少數原因,投入或努力通常會導致大部分結果,產出或回報”
-Richard Koch, author of The 80/20 Principle
-理查德·科赫(Richard Koch),《 80/20原理》的作者
How exactly will we achieve this goal? With some fantastic Python libraries! Instead of reinventing the wheel, we may stand on the shoulders of giants and innovate quickly. With pre-tested implementations and pre-trained models, we will focus on applying these methods and creating value.
我們究竟將如何實現這一目標? 擁有一些很棒的Python庫! 與其重新發明輪子,不如我們站在巨人的肩膀上并Swift進行創新。 通過預測試的實現和預訓練的模型,我們將專注于應用這些方法并創造價值。
This article is intended for developers looking to quickly integrate natural language processing into their projects. With an emphasis on ease of use and rapid results comes the downside of reduced performance. In my experience, 80% of cutting-edge is plenty for projects, but look elsewhere for NLP research :)
本文適用于希望將自然語言處理快速集成到其項目中的開發人員。 強調易用性和快速的結果是降低性能的不利方面。 根據我的經驗,80%的尖端技術可以滿足項目的需求,但是可以將其用于NLP研究:)
Without further ado, let’s begin!
事不宜遲,讓我們開始吧!
什么是NLP? (What is NLP?)
Natural language processing is a subfield of linguistics, computer science, and artificial intelligence, allowing for the automatic processing of text by software. NLP gives machines the ability to read, understand, and respond to messy, unstructured text.
自然語言處理是語言學,計算機科學和人工智能的一個子領域,允許通過軟件自動處理文本。 NLP使機器能夠閱讀,理解和響應雜亂的非結構化文本。
People often treat NLP as a subset of machine learning, but the reality is more nuanced.
人們通常將NLP視為機器學習的子集,但實際情況卻更加細微。
(Image by Author)(圖片由作者提供)Some NLP tools rely on machine learning, and some even use deep learning. However these methods often rely on large datasets and are difficult to implement. Instead, we will focus on simpler, rule-based methods to speed up the development cycle.
一些NLP工具依賴于機器學習,有些甚至使用深度學習。 但是,這些方法通常依賴于大型數據集,并且難以實現。 相反,我們將專注于基于規則的更簡單方法,以加快開發周期。
術語 (Terminology)
Starting with the smallest unit of data, a character is a single letter, number, or punctuation. A word is a list of characters, and a sentence is a list of words. A document is a list of sentences, and a corpus is a list of documents.
從最小的數據單位開始, 字符是單個字母,數字或標點符號。 單詞是字符列表, 句子是單詞列表。 文檔是句子列表,而語料庫是文檔列表。
預處理 (Pre-Processing)
Pre-processing is perhaps the most important step to a NLP project, and involves cleaning your inputs so your models can ignore the noise and focus on what matters most. A strong pre-processing pipeline will improve the performance of all your models, so I cannot stress it’s value enough.
預處理可能是NLP項目最重要的步驟,涉及清理輸入內容,以便模型可以忽略噪聲并集中精力處理最重要的事情。 強大的預處理管道將改善您所有模型的性能,因此我無法強調它的價值。
Below are some common pre-processing steps:
以下是一些常見的預處理步驟:
Segmentation: Given a long list of characters, we might separate documents by white space, sentences by periods, and words by spaces. Implementation details will vary based on the dataset.
細分 :給定一長串字符,我們可以按空格分隔文檔,按句點分隔句子,并按空格分隔單詞。 實施細節將根據數據集而有所不同。
Make Lowercase: Capitalization generally does not add value, and makes string comparison trickier. Just make everything lowercase.
小寫 :大寫通常不會增加價值,并使字符串比較棘手。 只需將所有內容都小寫即可。
Remove Punctuation: We may want to remove commas, quotes, and other punctuation that does not add to the meaning.
刪除標點符號 :我們可能希望刪除逗號,引號和其他不會增加含義的標點符號。
Remove Stopwords: Stopwords are words like ‘she’, ‘the’, and ‘of’ that do not add to the meaning of a text, and can distract from the more relevant keywords.
刪除停用詞:停用詞是“ she”,“ the”和“ of”之類的詞,它們不會增加文本的含義,并且會分散相關性。
Remove Other: Depending on your application, you may want to remove certain words that do not add value. For example, if evaluating course reviews, words like ‘professor’ and ‘course’ may not be useful.
刪除其他 :根據您的應用程序,您可能希望刪除某些不會增加價值的單詞。 例如,如果評估課程評論,則“教授”和“課程”之類的詞可能沒有用。
Stemming/Lemmatization: Both stemming and lemmatization generate the root form of inflected words (ex: ‘running’ to ‘run’). Stemming is faster, but does not guarantee the root is an English word. Lemmatization uses a corpus to ensure the root is a word, at the expense of speed.
詞干/詞法化 :詞干和詞法化都會產生詞尾的詞根形式(例如:“運行”到“運行”)。 詞干速度更快,但不能保證詞根是英語單詞。 詞法化使用語料庫來確保詞根是單詞,但要犧牲速度。
Part of Speech Tagging: POS tagging marks words with their part of speech (nouns, verbs, prepositions) based on definition and context. For example, we can focus on nouns for keyword extraction.
詞性標注 :POS標記根據定義和上下文用詞性(名詞,動詞,介詞)標記單詞。 例如,我們可以專注于名詞以進行關鍵字提取。
For a more thorough introduction to these concepts, check out this amazing guide:
有關這些概念的更全面介紹,請查看此出色的指南:
These steps are the foundation of a successful pre-processing pipeline. Depending on your dataset and task, you may skip certain steps or add new ones. Manually observe your data through pre-processing, and correct issues as they arise.
這些步驟是成功的預處理流程的基礎。 根據您的數據集和任務,您可以跳過某些步驟或添加新步驟。 通過預處理手動觀察數據,并在出現問題時進行糾正。
Python庫 (Python Libraries)
python.orgpython.orgLet’s take a look at a couple leading Python libraries for NLP. These tools will handle most of the heavy lifting during and especially after pre-processing
讓我們看一下幾個領先的NLP Python庫。 這些工具將在預處理過程中以及在預處理之后處理大部分繁重的工作
NLTK (NLTK)
The Natural Language Tool Kit is the most widely-used NLP library for Python. Developed at UPenn for academic purposes, NLTK has a plethora of features and corpora. NLTK is great for playing with data and running pre-processing.
自然語言工具包是Python使用最廣泛的NLP庫。 NLTK由UPenn開發,用于學術目的,具有許多功能和語料庫。 NLTK非常適合處理數據和運行預處理。
Here is an example from the NLTK website showing how simple it is to tokenize a sentence and tag parts of speech.
這是NLTK網站上的示例,顯示標記句子和標記語音部分有多簡單。
空間 (SpaCy)
SpaCy is a modern and opinionated package. While NLTK has multiple implementations of each feature, SpaCy keeps only the best performing ones. Spacy supports a wide range of features, read the docs for more details:
SpaCy是一種現代且自以為是的軟件包。 盡管NLTK每種功能都有多種實現,但SpaCy僅保留性能最佳的功能。 Spacy支持多種功能,請閱讀文檔以獲取更多詳細信息:
In just a few lines, we are able to perform Named Entity Recognition with SpaCy. Many other tasks can be accomplished quickly using the SpaCy API.
僅需幾行,我們就可以使用SpaCy執行命名實體識別。 使用SpaCy API可以快速完成許多其他任務。
GenSim (GenSim)
Unlike NLTK and SpaCy, GenSim specifically tackles the problem of information retrieval (IR). Developed with an emphasis on memory management, GenSim contains many models for document similarity, including Latent Semantic Indexing, Word2Vec, and FastText.
與NLTK和SpaCy不同,GenSim專門解決信息檢索(IR)問題。 GenSim著重于內存管理,它包含許多用于文檔相似性的模型,包括潛在語義索引,Word2Vec和FastText。
Below is an example of a pre-trained GenSim Word2Vec model that finds word similarities. Without worrying about the messy details, we can quickly get results.
以下是發現單詞相似性的預訓練GenSim Word2Vec模型的示例。 無需擔心混亂的細節,我們可以快速獲得結果。
和更多… (And More…)
This list is by no means comprehensive, but covers a range of features and use-cases. I recommend checking this repository for more tools and references.
該列表絕不是全面的,而是涵蓋了一系列功能和用例。 我建議檢查此存儲庫以獲取更多工具和參考。
應用領域 (Applications)
Now that we have discussed pre-processing methods and Python libraries, let’s put it all together with a few examples. For each, I’ll cover a couple of NLP algorithms, pick one based on our rapid development goals, and create a simple implementation using one of the libraries.
現在我們已經討論了預處理方法和Python庫,讓我們將其與一些示例放在一起。 對于每種方法,我將介紹幾種NLP算法,根據我們的快速開發目標選擇一種,并使用其中一種庫創建簡單的實現。
應用程序1:預處理 (Application #1: Pre-Processing)
Pre-processing is a critical part of any NLP solution, so let’s see how we can speed up the process with Python libraries. In my experience, NLTK has all the tools we need, with customization for unique use cases. Let’s load a sample corpus.
預處理是任何NLP解決方案的關鍵部分,因此讓我們看看如何使用Python庫加快處理速度。 以我的經驗,NLTK擁有我們需要的所有工具,并且可以針對獨特的用例進行定制。 讓我們加載樣本語料庫。
Following the pipeline defined above, we can use NLTK to implement segmentation, removing punctuation and stopwords, performing lemmatization, and more. Look how easy it is to remove stopwords:
按照上面定義的管道,我們可以使用NLTK來實現分段,刪除標點和停用詞,執行詞形化等等。 看看刪除停用詞有多么容易:
The entire pre-processing pipeline took me less than 40 lines of Python. See the full code here. Remember, this is a generalized example, and you should modify the process as needed for your specific use case.
整個預處理流程占用了我不到40行Python。 在此處查看完整代碼。 請記住,這是一個概括的示例,您應根據特定用例的需要修改流程。
應用程序2:文檔聚類 (Application #2: Document Clustering)
Document clustering is a common task in natural language processing, so let’s discuss some ways to do it. The general idea here is to assign each document a vector representing the topics discussed:
文檔聚類是自然語言處理中的常見任務,因此讓我們討論一些方法。 這里的總體思路是為每個文檔分配一個表示所討論主題的向量:
(Image by Author)(圖片由作者提供)If the vectors are two dimensional, we can visualize the documents, like above. In this example, we see documents A and B are closely related, while D and F are loosely related. Using a distance metric, we can calculate similarity even when these vectors are three, one hundred, or one thousand dimensional.
如果向量是二維的,我們可以像上面那樣可視化文檔。 在此示例中,我們看到文檔A和B緊密相關,而D和F松散相關。 使用距離度量,即使這些向量是3維,100維或1000維,我們也可以計算相似度。
The next question is how to construct these vectors for each document, using the unstructured text input. Here there are a few options, from simplest to most complex:
下一個問題是如何使用非結構化文本輸入為每個文檔構造這些向量。 這里有一些選項,從最簡單到最復雜:
Bag of Words: Assign each unique word a dimension. The vector for a given document is the frequency each word occurs.
單詞袋 :為每個唯一的單詞分配一個維度。 給定文檔的向量是每個單詞出現的頻率。
Term Frequency — Inverse Document Frequency (TF-IDF): Scale the Bag of Words representation by how common a word is in other documents. If two documents share a rare word, they are more similar than if they share a common one.
術語頻率-反向文檔頻率(TF-IDF) :根據單詞在其他文檔中的普遍程度來縮放“詞袋”表示。 如果兩個文檔共享一個稀有單詞,則比共享一個公共文檔更相似。
Latent Semantic Indexing (LSI): Bag of Words and TF-IDF can create highly dimensional vectors, which makes distance measures less accurate. LSI collapses these vectors to a more manageable size while minimizing information loss.
潛在語義索引(LSI) :單詞袋和TF-IDF可以創建高維向量,這會使距離度量的準確性降低。 LSI將這些向量壓縮到更易于管理的大小,同時最大程度地減少了信息丟失。
Word2Vec: Using a neural network, learn word associations from a large text corpus. Then add up the vectors for each word to get a document vector.
Word2Vec :使用神經網絡從大型文本語料庫學習單詞聯想。 然后將每個單詞的向量相加以獲得文檔向量。
Doc2Vec: Building upon Word2Vec but using a better method to approximate the document vector from its list of word vectors.
Doc2Vec :建立在Word2Vec的基礎上,但使用更好的方法從其單詞向量列表中近似文檔向量。
Word2Vec and Doc2Vec are quite complicated and require large datasets to learn word embeddings. We could use pre-trained models, but they may not scale well to tasks within niche fields. Instead, we will use Bag of Words, TF-IDF, and LSI.
Word2Vec和Doc2Vec相當復雜,需要大量數據集才能學習單詞嵌入。 我們可以使用經過預訓練的模型,但它們可能無法很好地適應利基領域內的任務。 相反,我們將使用單詞袋,TF-IDF和LSI。
Now to choose our library. GenSim is specifically built for this task and contains easy implementations of all three algorithms, so let’s use GenSim.
現在選擇我們的圖書館。 GenSim是專門為此任務而構建的,并且包含所有三種算法的簡單實現,因此讓我們使用GenSim。
For this example, let’s use the Brown corpus again. It has 15 documents for categories of text, such as ‘adventure’, ‘editorial’, ‘news’, etc. After running our NLTK pre-processing routine, we can begin applying the GenSim models.
對于此示例,讓我們再次使用Brown語料庫。 它有15個文本類別的文檔,例如“冒險”,“編輯”,“新聞”等。運行NLTK預處理例程后,我們可以開始應用GenSim模型。
First, we create a dictionary mapping tokens to unique indexes.
首先,我們創建一個字典,將令牌映射到唯一索引。
Next, we iteratively apply Bag of Words, Term Frequency — Inverse Document Frequency, and Latent Semantic Indexing:
接下來,我們迭代地應用單詞袋,術語頻率-逆文檔頻率和潛在語義索引:
In just ~10 lines of Python, we handled three separate models, and extracted vector representations for our documents. Using cosine similarity for vector comparison, we can find the most similar documents.
在大約10行Python中,我們處理了三個單獨的模型,并提取了文檔的矢量表示。 使用余弦相似度進行矢量比較,我們可以找到最相似的文檔。
And just like that, we have results! Adventure texts are most similar to fiction and romance, while editorials are similar to news and government. It checks out. Check the full code here.
這樣,我們就可以取得結果! 冒險文本與小說和浪漫史最相似,而社論與新聞和政府相似。 它簽出。 在此處檢查完整代碼。
應用程序3:情感分析 (Application #3: Sentiment Analysis)
Sentiment analysis is the interpretation of unstructured text as positive, negative, or neutral. Sentiment analysis is a useful tool for analyzing reviews, measuring brand, building AI chatbots, and more.
情感分析是將非結構化文本解釋為肯定,否定或中性的內容。 情緒分析是一種有用的工具,可用于分析評論,評估品牌,構建AI聊天機器人等。
Unlike document clustering, where pre-processing was applied, we do not use pre-processing in sentiment analysis. The punctuation, flow, and context of a passage can reveal a lot about the sentiment, so we do not want to remove them. Instead, we jump straight into the models.
與文檔聚類不同,文檔聚類應用了預處理,因此在情感分析中我們不使用預處理 。 段落的標點,流和上下文可以揭示很多有關情緒的信息,因此我們不想刪除它們。 相反,我們直接進入模型。
Keeping things simple and effective, I recommend using pattern-based sentiment analysis. By searching for specific keywords, sentence structure, and punctuation marks, these models measure the text’s polarity. Here are two libraries with built-in sentiment analyzers:
為了使事情簡單有效,我建議使用基于模式的情感分析。 通過搜索特定的關鍵字,句子結構和標點符號,這些模型可以測量文本的極性。 這是兩個帶有內置情緒分析器的庫:
VADER Sentiment Analysis:
VADER情緒分析 :
VADER stands for Valence Aware Dictionary and sEntiment Recognizer, and is an extension of NLTK for sentiment analysis. It uses patterns to calculate sentiment, and works especially well with emojis and texting slang. It’s also super easy to implement.
VADER代表Valence Aware詞典和情感識別器,并且是NLTK用于情感分析的擴展。 它使用模式來計算情感,并且特別適用于表情符號和短信語。 它也超級容易實現。
TextBlob Sentiment Analysis:
TextBlob情緒分析 :
A similar tool is TextBlob for sentiment analysis. TextBlob is actually a versatile library similar to NLTK and SpaCy. Regarding its sentiment analysis tool, it differs from VADER in reporting both polarity and subjectivity. From my personal experience, I prefer VADER, but each has its own strengths and weaknesses. TextBlob is also exceedingly easy to implement:
類似的工具是TextBlob,用于情感分析。 實際上,TextBlob是類似于NLTK和SpaCy的通用庫。 關于情感分析工具,它在報告極性和主觀性方面與VADER不同。 從我的個人經驗來看,我更喜歡VADER,但每個人都有自己的優點和缺點。 TextBlob也非常容易實現:
Note: Pattern based models do not perform well on such small texts as in the examples above. I recommend sentiment analysis on texts averaging four sentences. For a quick demonstration of this, refer to the Jupyter Notebook.
注意:基于模式的模型在上述示例中的小文本上效果不佳。 我建議對平均四個句子的文本進行情感分析。 有關此內容的快速演示,請參閱Jupyter Notebook 。
其他應用 (Other Applications)
Here are a couple of additional topics and some useful algorithms and tools to accelerate your development.
這里有幾個其他主題以及一些有用的算法和工具,可以加快您的開發速度。
Keyword Extraction: Named Entity Recognition (NER) using SpaCy, Rapid Automatic Keyword Extraction (RAKE) using ntlk-rake
關鍵字提取:使用SpaCy的命名實體識別(NER),使用ntlk-rake的快速自動關鍵字提取(RAKE)
Text Summarization: TextRank (similar to PageRank) using PyTextRank SpaCy extension, TF-IDF using GenSim
文字摘要 :使用PyTextRank SpaCy擴展名的TextRank(類似于PageRank),使用GenSim的TF-IDF
Spell Check: PyEnchant, SymSpell Python ports
拼寫檢查 :PyEnchant,SymSpell Python端口
Hopefully, these examples help demonstrate the plethora of resources available for natural language processing in Python. Regardless of the problem, chances are someone has developed a library to streamline the process. Using these libraries can yield great results in a short time frame.
希望這些示例有助于說明Python中用于自然語言處理的大量資源。 不管出現什么問題,都有可能有人開發了一個庫來簡化流程。 使用這些庫可以在很短的時間內產生很好的結果。
技巧和竅門 (Tips and Tricks)
With an introduction to NLP, an overview of Python libraries, and some example applications, you’re almost ready to tackle your own challenges. Finally, I have a few tips and tricks to make the most of these resources.
通過介紹NLP,Python庫概述和一些示例應用程序,您幾乎已經準備好應對自己的挑戰。 最后,我有一些技巧和竅門,可以充分利用這些資源。
Python Tooling: I recommend Poetry for dependency management, Jupyter Notebook for testing new models, Black and/or Flake8 for linting, and GitHub for version management.
Python工具 :我建議使用Poetry進行依賴關系管理,建議使用Jupyter Notebook來測試新模型,使用Black和/或Flake8進行linting,以及GitHub進行版本管理。
Stay organized: It can be easy to jump around from library to library, copying in code to test a dozen ideas. Instead, I recommend a more measured approach. You don’t want to miss a great solution in your haste.
保持井井有條 :可以輕松地從一個庫跳到另一個庫,復制代碼以測試十幾個想法。 相反,我建議采用一種更嚴格的方法。 您不想在匆忙中錯過一個很好的解決方案。
Pre-Processing: Garbage in, garbage out. It’s super important to implement a strong pre-processing pipeline to clean your inputs. Visually check the processed text to ensure everything is working as expected.
預處理 :垃圾進,垃圾出。 實施強大的預處理管道來清理您的輸入非常重要。 目視檢查已處理的文本,以確保一切正常。
Presenting Results: Choosing how to present your results can make a huge difference. If your outputted text look a little rough, consider presenting aggregate statistics or numeric results instead.
呈現結果 :選擇如何呈現結果會產生巨大的變化。 如果輸出的文本看起來有些粗糙,請考慮顯示匯總統計信息或數字結果。
You should be well equipped to tackle some real-world NLP projects now. Good luck, and happy coding :)
您現在應該有能力應對一些實際的NLP項目。 祝您好運,并祝您編程愉快:)
If you do anything cool with this information, leave a response in the comments. If you have any feedback or insights, feel free to connect with me on LinkedIn. Thanks for reading!
如果您對這些信息感興趣,請在評論中留下答復。 如果您有任何反饋或見解,請隨時在 LinkedIn 上與我聯系 。 謝謝閱讀!
翻譯自: https://towardsdatascience.com/natural-language-processing-nlp-dont-reinvent-the-wheel-8cf3204383dd
nlp自然語言處理
總結
以上是生活随笔為你收集整理的nlp自然语言处理_自然语言处理(NLP):不要重新发明轮子的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 魅族造车实锤!申请多枚“无界汽车”商标
- 下一篇: 机器学习导论�_机器学习导论