自然语言处理综述_自然语言处理
自然語言處理綜述
Aren't we all initially got surprised when smart devices understood what we were telling them? And in fact, it answered in the most friendly manner too, isn't it? Like Apple’s Siri and Amazon’s Alexa comprehend when we ask the weather, for directions, or to play a certain genre of music. Ever since then I was wondering how do these computers get our language. This long due curiosity rekindled me and I thought to write a blog as a newbie on this.
當智能設備理解了我們告訴他們的內容后,我們所有人最初并不感到驚訝嗎? 實際上,它也以最友好的方式回答,不是嗎? 就像蘋果公司的Siri和亞馬遜公司的Alexa一樣,當我們詢問天氣,方向或播放某種音樂時,他們就會明白。 從那時起,我一直在想這些計算機如何獲得我們的語言。 這種長期的好奇心使我重新燃起了生命,我想以此為博客寫一個新手。
In this article, I will be using a popular NLP library called NLTK. Natural Language Toolkit or NLTK is one of the most powerful and probably the most popular natural language processing libraries. Not only does it have the most comprehensive library for python-based programming, but it also supports the most number of different human languages.
在本文中,我將使用一個流行的名為NLTK的NLP庫 。 自然語言工具包或NLTK是功能最強大且可能是最受歡迎的自然語言處理庫之一。 它不僅具有用于基于python的編程的最全面的庫,而且還支持大多數不同的人類語言。
What is Natural Language Processing?
什么是自然語言處理?
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to train computers to process and analyze large amounts of natural language data.
自然語言處理(NLP)是語言學,計算機科學,信息工程和人工智能的一個子領域,與計算機和人類語言之間的相互作用有關,尤其是如何訓練計算機以處理和分析大量自然語言數據。
Why sorting of Unstructured Datatype is so important?
為什么對非結構化數據類型進行排序如此重要?
For every tick of the clock, the world generates the overwhelming amount of data!!, yeah, this is mind-boggling!! and the majority of the data falls under unstructured datatype. The data formats such as text, audio, video, image are classic examples of unstructured data. The Unstructured Datatype will not be having fixed dimensions and structures like traditional row and column structure of relational databases. Therefore it’s more difficult to analyze and not easily searchable. Having said that, it is also important for business organizations to find ways of addressing challenges and embracing opportunities to derive insights and prosper in highly competitive environments to be successful. However, with the help of natural language processing and machine learning, this is changing fast.
每一刻時鐘,世界都會產生大量數據!是的,這真是令人難以置信! 并且大多數數據屬于非結構化數據類型。 文本,音頻,視頻,圖像等數據格式是非結構化數據的經典示例。 非結構化數據類型將沒有固定的維度和結構,如關系數據庫的傳統行和列結構。 因此,它更難以分析且不易搜索。 話雖如此,對于企業組織來說,找到應對挑戰和把握機遇的方法也很重要,以便在高競爭環境中獲得見識并取得成功。 但是,借助自然語言處理和機器學習,這種情況正在Swift改變。
Are Computers confused with our Natural Language?
計算機與我們的自然語言混淆了嗎?
Human language is one of the powerful tools of communication. The words, the tone, the sentences, the gestures which we use draw information. There are countless different ways of assembling words in a phrase. Words can also have many shades of meaning and, to comprehend human language with the intended meaning is a challenge. A linguistic paradox is a phrase or sentence that contradicts itself, for example, “oh, this is my open secret”, “can you please act naturally”, though it sounds pointedly foolish, we humans can understand and use in everyday speech but for machines, natural language’s ambiguity and inaccurate characteristics are the hurdles to sail-off.
語言是交流的強大工具之一。 我們使用的單詞,語氣,句子,手勢會吸引信息。 在短語中組合單詞的方式有無數種。 單詞也可以具有多種含義,要使人類語言具有預期的含義是一個挑戰。 語言悖論是與自己矛盾的短語或句子,例如,“哦,這是我的公開秘密”,“您能自然地行動嗎”,雖然聽起來很愚蠢,但我們人類可以在日常語音中理解和使用,但對于機器,自然語言的歧義和不正確的特征是航行的障礙。
Most used NLP Libraries
最常用的NLP庫
In the past, only pioneers could be part of NLP projects those who would have superior knowledge in mathematics, computer learning, and linguistics in natural language processing. Now developers can use ready-made libraries to simplify pre-processing of texts so that they can concentrate on creating machine learning models. These libraries have enabled text comprehension, interpretation, sentiment analysis through only a few lines of code. Most popular NLP libraries are:
過去,只有先驅者才能成為NLP項目的一部分,他們將對數學,計算機學習和自然語言處理方面的語言有豐富的知識。 現在,開發人員可以使用現成的庫來簡化文本的預處理,以便他們可以專注于創建機器學習模型。 這些庫僅通過幾行代碼就可以進行文本理解,解釋和情感分析。 最受歡迎的NLP庫是:
Spark NLP, NLTK, PyTorch-Transformers, TextBlob, Spacy, Stanford CoreNLP, Apache OpenNLP, Allen NLP, GenSim, NLP Architecture, sci-kit learn.
Spark NLP,NLTK,PyTorch-Transformers,TextBlob,Spacy,Stanford CoreNLP,Apache OpenNLP,Allen NLP,GenSim,NLP Architecture,Sci-kit學習。
The question is from where should we start and how?
問題是我們應該從哪里開始,如何開始?
Have you ever observed how kids start to understand and learn a language? yeah, by picking each word and then sentence formations, right! Making computers understand our language is more or less similar to it.
您是否曾經觀察過孩子如何開始理解和學習語言? 是的,先選擇每個單詞,然后再選擇句子形式,對! 使計算機理解我們的語言或多或少類似于它。
Pre-processing Steps :
預處理步驟:
Sentence Tokenization(Sentence Segmentation)To make computers understand the natural language, the first step is to break the paragraphs into the sentences. Punctuation marks are such an easy way out for splitting the sentences apart.
句子標記化(Sentence Segmentation)為了使計算機理解自然語言,第一步是將段落分成句子。 標點符號是將句子分開的簡便方法。
nltk.download('punkt')text = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)
print("The number of sentences in the paragrah:",len(sentences))for sentence in sentences:
print(sentence)OUTPUT:The number of sentences in the paragraph: 3 Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area.
2. Word Tokenization(Word Segmentation)By now we have separated sentences with us and the next step is to break the sentences into words which are often called Tokens.
2.單詞標記化(單詞分割)到目前為止,我們已經將句子分隔開了,下一步是將這些句子分解為通常稱為標記的單詞。
The way of creating a space in one’s own life helps for good, similarly, Space between the words helps in breaking the words apart in a phrase. We can consider punctuation marks as separate tokens as well, as punctuation has a purpose too.
在自己的生活中創造空間的方式有益于美好,同樣,單詞之間的空間有助于將單詞在短語中分開。 我們也可以將標點符號也視為單獨的標記,因為標點符號也是有目的的。
for sentence in sentences:words = nltk.word_tokenize(sentence)
print("The number of words in a sentence:", len(words))
print(words)OUTPUT:The number of words in a sentence: 32
['Home', 'Farm', 'is', 'one', 'of', 'the', 'biggest', 'junior', 'football', 'clubs', 'in', 'Ireland', 'and', 'their', 'senior', 'team', ',', 'from', '1970', 'up', 'to', 'the', 'late', '1990s', ',', 'played', 'in', 'the', 'League', 'of', 'Ireland', '.'] The number of words in a sentence: 18
['However', ',', 'the', 'link', 'between', 'Home', 'Farm', 'and', 'the', 'senior', 'team', 'was', 'severed', 'in', 'the', 'late', '1990s', '.'] The number of words in a sentence: 22
['The', 'senior', 'side', 'was', 'briefly', 'known', 'as', 'Home', 'Farm', 'Fingal', 'in', 'an', 'effort', 'to', 'identify', 'it', 'with', 'the', 'north', 'Dublin', 'area', '.']
The prerequisite to use word_tokenize() or sent_tokenize() functions in the program, we should have punkt package downloaded.
在程序中使用word_tokenize()或sent_tokenize()函數的前提條件是,我們應該已下載punkt軟件包。
3. Stemming and Text Lemmatization
3.詞干和詞法化
In every text document, we usually come across different forms of words like write, writes, writing with an alike meaning, and the same base word. But how to make a computer to analyze such words?That’s when Text Lemmatization and Stemming comes in the picture.
在每個文本文檔中,我們通常會遇到不同形式的單詞,例如寫,寫,具有相似含義的寫詞和相同的基本單詞。 但是如何使計算機分析此類單詞呢? 那就是圖片的詞法化和詞法提取的時候。
Stemming and Text Lemmatization are the normalization techniques that offer the same idea of chopping the ends of a word to the core word. While both of them want to solve the same problem, but they are going about it in entirely different ways. Stemming is often a crude heuristic process whereas Lemmatization is a vocabulary -based morphological base word. Let’s just take a closer look!
詞干和文本詞法歸類化是歸一化技術,它們提供將單詞的結尾切成核心單詞的相同思想。 雖然他們兩個都想解決相同的問題,但是他們以完全不同的方式來解決這個問題。 詞干提取通常是一個粗略的啟發式過程,而詞干提取則是基于詞匯的詞法基礎詞。 讓我們仔細看看!
Stemming- Words are reduced to their stem word. A word stem need not be the same root as a dictionary-based morphological(smallest unit) root, it just is an equal to or smaller form of the word.
詞干 -單詞被簡化為詞干。 詞干不必與基于字典的詞法(最小單位)詞根相同,而可以等于或小于該詞的形式。
from nltk.stem import PorterStemmer#create an object of class PorterStemmerporter = PorterStemmer()#A list of words to be stemmed
word_list = ['running', ',', 'driving', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']print("{0:20}{1:20}".format("Word","Porter Stemmer"))for word in word_list:
print("{0:20}{1:20}".format(word,porter.stem(word)))OUTPUT:
Word Porter Stemmer
running run
, ,
driving drive
sung sung
between between
lasted last
was wa
paticipated paticip
before befor
severed sever
1990s 1990
. .
Stemming is not as easy as it looks :(we might get into two issues such as under-stemming and over-stemming of a word.
詞干看起來并不容易:(我們可能會遇到兩個問題,例如單詞的詞干 不足和詞干 過度 。
Lemmatization-When we think that stemming is the best estimate method to snip a word based on how it appears and meanwhile, on the other hand, lemmatization is a method that seems to be even more planned way of pruning the word. Their dictionary process includes resolving words. Indeed a word’s lemma is its dictionary or canonical form.
詞法化 -當我們認為詞干是根據單詞出現的方式來截斷單詞的最佳估計方法時,另一方面,詞法化似乎是一種修剪單詞的更有計劃的方法。 他們的詞典處理過程包括解析單詞。 確實,單詞的引理是其字典或規范形式。
nltk.download('wordnet')from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()#A list of words to lemmatizeword_list = ['running', ',', 'drives', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']print("{0:20}{1:20}".format("Word","Lemma"))for word in word_list:
print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))OUTPUT:Word Lemma
running running
, ,
drives drive
sung sung
between between
lasted lasted
was wa
paticipated paticipated
before before
severed severed
1990s 1990s
. .
If speed is needed, then resorting to stemming is better. But it’s better to use lemmatization when accuracy is needed.
如果需要速度,則最好采用阻止。 但是,當需要準確性時,最好使用定理。
4. Stop Words‘in’, ‘at’, ‘on’, ‘so’.. etc are considered as stop words. Stop words don't play an important role in NLP, but the removal of stop words necessarily plays an important role during sentiment analysis.
4.“在”,“在”,“在”,“如此”等上的停用詞被視為停用詞。 停用詞在NLP中并不重要,但是在情感分析過程中停用詞的去除必定起著重要作用。
NLTK comes with the stopwords for 16 different languages that contain stop word lists.
NLTK隨附了包含停用詞列表的16種不同語言的停用詞。
from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))print("The stop words in NLTK lib are:", stop_words)para="""Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."""tokenized_para=word_tokenize(para)
modified_token_list=[word for word in tokenized_para if not word in stop_words]
print("After removing the stop words in the sentence:")
print(modified_token_list)OUTPUT:The stop words in NLTK lib are: {'about', 'ma', "shouldn't", 's', 'does', 't', 'our', 'mightn', 'doing', 'while', 'ourselves', 'themselves', 'will', 'some', 'you', "aren't", 'by', "needn't", 'in', 'can', 'he', 'into', 'as', 'being', 'between', 'very', 'after', 'couldn', 'himself', 'herself', 'had', 'its', 've', 'him', 'll', "isn't", 'through', 'should', 'was', 'now', 'them', "you'll", 'again', 'who', 'don', 'been', 'they', 'weren', "you're", 'both', 'd', 'me', 'didn', "won't", "you'd", 'only', 'itself', 'hadn', "should've", 'than', 'how', 'few', 're', 'down', 'these', 'y', "haven't", "mightn't", 'won', "hadn't", 'other', 'above', 'all', "doesn't", 'isn', "that'll", 'not', 'yourselves', 'at', 'mustn', "it's", 'on', 'the', 'for', "didn't", 'what', "mustn't", 'his', 'haven', 'doesn', "you've", 'are', 'out', 'hers', 'with', 'has', 'she', 'most', 'ain', 'those', 'when', 'myself', 'before', 'their', 'during', 'there', 'or', 'until', 'that', 'more', "hasn't", 'o', 'we', 'and', "shan't", 'which', 'because', "don't", 'why', 'shan', 'an', 'my', 'if', 'did', 'having', "couldn't", 'your', 'theirs', 'aren', 'just', 'further', 'here', 'of', "wouldn't", 'be', 'too', 'her', 'no', 'same', 'it', 'is', 'were', 'yourself', 'have', 'off', 'this', 'needn', 'once', "wasn't", 'against', 'wouldn', 'up', 'a', 'i', 'below', "weren't", 'over', 'own', 'then', 'so', 'do', 'from', 'shouldn', 'am', 'under', 'any', 'yours', 'ours', 'hasn', 'such', 'nor', 'wasn', 'to', 'where', 'm', "she's", 'each', 'whom', 'but'} After removing the stopwords in the sentence:
['Home', 'Farm', 'one', 'biggest', 'junior', 'football', 'clubs', 'Ireland', 'senior', 'team', ',', '1970', 'late', '1990s', ',', 'played', 'League', 'Ireland', '.', 'However', ',', 'link', 'Home', 'Farm', 'senior', 'team', 'severed', 'late', '1990s', '.', 'The', 'senior', 'side', 'briefly', 'known', 'Home', 'Farm', 'Fingal', 'effort', 'identify', 'north', 'Dublin', 'area', '.']
5. POS TaggingDown the memories lane of our early English grammar classes, can we all remember how our teachers used to give relevant instructions around basic parts of speech to have effective communication? Yeah, good old days!!Let's teach parts of speech to our computers too. :)
5. POS標記我們早期英語語法課的記憶里,我們都還記得我們的老師曾經如何圍繞基本的言語給予相關指導以進行有效的交流嗎? 是的,過去美好!讓我們也將詞性教學到我們的計算機上。 :)
The eight parts of speech are nouns, verbs, pronouns, adjectives, adverbs, prepositions, conjunctions, and interjections.
語音的八個部分是名詞,動詞,代詞,形容詞,副詞,介詞,連詞和感嘆詞。
POS Tagging is an ability to identify and assign parts of speech to the words in a sentence. There are different methods to tag, but we will be using the universal style of tagging.
POS標記是一種識別語音部分并將其分配給句子中單詞的功能。 標記的方法不同,但是我們將使用通用的標記樣式。
nltk.download('averaged_perceptron_tagger')nltk.download('universal_tagset')
pos_tag= [nltk.pos_tag(i,tagset="universal") for i in words]
print(pos_tag)[[('Home', 'NOUN'), ('Farm', 'NOUN'), ('is', 'VERB'), ('one', 'NUM'), ('of', 'ADP'), ('the', 'DET'), ('biggest', 'ADJ'), ('junior', 'NOUN'), ('football', 'NOUN'), ('clubs', 'NOUN'), ('in', 'ADP'), ('Ireland', 'NOUN'), ('and', 'CONJ'), ('their', 'PRON'), ('senior', 'ADJ'), ('team', 'NOUN'), (',', '.'), ('from', 'ADP'), ('1970', 'NUM'), ('up', 'ADP'), ('to', 'PRT'), ('the', 'DET'), ('late', 'ADJ'), ('1990s', 'NUM'), (',', '.'), ('played', 'VERB'), ('in', 'ADP'), ('the', 'DET'), ('League', 'NOUN'), ('of', 'ADP'), ('Ireland', 'NOUN'), ('.', '.')]
One of the applications of POS tagging to analyze the qualities of a product in feedback, by sorting the adjectives in the customers’ review we can evaluate the sentiment of the feedback. Say example, how was your shopping with us?
POS標記用于分析產品在反饋中的質量的一種應用,通過對客戶評論中的形容詞進行分類,我們可以評估反饋的情緒。 舉例來說, 您如何與我們一起購物?
6. ChunkingChunking is used to add more structure to the sentence by tagging the following parts of speech (POS). Also named as shallow parsing. The resulting word group is named “chunks.” There are no such predefined rules to perform chunking.
6.分塊分塊用于通過標記以下詞性(POS)為句子添加更多結構。 也稱為淺層解析。 所得的單詞組稱為“塊”。 沒有此類預定義規則可以執行分塊。
Phrase structure conventions:
短語結構約定:
- S(Sentence) → NP VP. S(句子)→NP VP。
- NP → {Determiner, Noun, Pronoun, Proper name}. NP→{確定詞,名詞,代詞,專有名稱}。
- VP → V (NP)(PP)(Adverb). VP→V(NP)(PP)(副詞)。
- PP → Pronoun (NP). PP→代詞(NP)。
- AP → Adjective (PP). AP→形容詞(PP)。
I never had a good time with complex regular expressions, I used to remain as far as I could but off late realized, how important it is to have a grip on regular expressions in data science. Let’s start by understanding the simple instance.
我從來沒有過過使用復雜的正則表達式的美好時光,我曾經盡我所能,但后來才意識到,掌握數據科學中的正則表達式是多么重要。 讓我們從了解簡單實例開始。
If we need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below
如果我們需要標記句子中的名詞,動詞(過去式),形容詞和協調連接。 您可以使用以下規則
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
塊:{<NN。?> * <VBD。?> * <JJ。?> * <CC>?}
import nltkfrom nltk.tokenize import word_tokenizecontent = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."tokenized_text = nltk.word_tokenize(content)
print("After Split:",tokenized_text)
tokens_tag = pos_tag(tokenized_text)
print("After Token:",tokens_tag)patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)OUTPUT:After Regex: chunk.RegexpParser with 1 stages: RegexpChunkParser with 1 rules: <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'> After Chunking
(S (mychunk Home/NN Farm/NN) is/VBZ one/CD of/IN the/DT
(mychunk biggest/JJS)
(mychunk junior/NN football/NN clubs/NNS) in/IN
(mychunk Ireland/NNP and/CC) their/PRP$
(mychunk senior/JJ)
(mychunk team/NN) ,/, from/IN 1970/CD up/IN to/TO the/DT (mychunk late/JJ) 1990s/CD ,/, played/VBN in/IN the/DT (mychunk League/NNP) of/IN (mychunk Ireland/NNP) ./.)
7. Wordnet
7.詞網
Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to generate a synonym or antonym.
Wordnet是NLTK語料庫閱讀器,英語的詞匯數據庫。 它可用于生成同義詞或反義詞。
from nltk.corpus import wordnetsynonyms = []antonyms = []for syn in wordnet.synsets("active"):
for lemmas in syn.lemmas():
synonyms.append(lemmas.name())for syn in wordnet.synsets("active"):
for lemmas in syn.lemmas():
if lemmas.antonyms():
antonyms.append(lemmas.antonyms()[0].name())print("Synonyms are:",synonyms)
print("Antonyms are:",antonyms)OUTPUT:Synonyms are: ['active_agent', 'active', 'active_voice', 'active', 'active', 'active', 'active', 'combat-ready', 'fighting', 'active', 'active', 'participating', 'active', 'active', 'active', 'active', 'alive', 'active', 'active', 'active', 'dynamic', 'active', 'active', 'active'] Antonyms are: ['passive_voice', 'inactive', 'passive', 'inactive', 'inactive', 'inactive', 'quiet', 'passive', 'stative', 'extinct', 'dormant', 'inactive']
8. Bag of WordsA bag of words model turns the raw text into words, and the frequency for the words in the text is also counted.
8.單詞袋單詞袋模型將原始文本轉換為單詞,并計算單詞在單詞中的出現頻率。
import nltkimport re # to match regular expressions
import numpy as nptext="Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)
for i in range(len(sentences)):
sentences[i] = sentences[i].lower()
sentences[i] = re.sub(r'\W', ' ', sentences[i])
sentences[i] = re.sub(r'\s+', ' ', sentences[i])bag_of_words = {}
for sentence in sentences:
words = nltk.word_tokenize(sentence)
for word in words:
if word not in bag_of_words.keys():
bag_of_words[word] = 1
else:
bag_of_words[word] += 1
print(bag_of_words)OUTPUT:{'home': 3, 'farm': 3, 'is': 1, 'one': 1, 'of': 2, 'the': 8, 'biggest': 1, 'junior': 1, 'football': 1, 'clubs': 1, 'in': 4, 'ireland': 2, 'and': 2, 'their': 1, 'senior': 3, 'team': 2, 'from': 1, '1970': 1, 'up': 1, 'to': 2, 'late': 2, '1990s': 2, 'played': 1, 'league': 1, 'however': 1, 'link': 1, 'between': 1, 'was': 2, 'severed': 1, 'side': 1, 'briefly': 1, 'known': 1, 'as': 1, 'fingal': 1, 'an': 1, 'effort': 1, 'identify': 1, 'it': 1, 'with': 1, 'north': 1, 'dublin': 1, 'area': 1}
9. TF-IDF
9.特遣部隊
TF-IDF stands for Term Frequency — Inverse document frequency.
TF-IDF代表術語頻率-反向文檔頻率 。
Text data needs to be converted to the numerical format where each word is represented in the matrix form. The encoding of a given word is the vector in which the corresponding element is set to one, and all other elements are zero. Thus TF-IDF technique is also referred to as Word Embedding.
文本數據需要轉換為數字格式,其中每個單詞都以矩陣形式表示。 給定單詞的編碼是將相應元素設置為1并將所有其他元素設置為零的向量。 因此,TF-IDF技術也稱為詞嵌入 。
TF-IDF works on two concepts:
TF-IDF處理兩個概念:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
TF(t)=(術語t在文檔中出現的次數)/(文檔中術語的總數)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
IDF(t)= log_e(文件總數/其中帶有術語t的文件數)
from sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.feature_extraction.text import CountVectorizer
import pandas as pddocs=["Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland",
"However, the link between Home Farm and the senior team was severed in the late 1990s",
" The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area"]#instantiate CountVectorizer()
cv=CountVectorizer()# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(docs)tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])# sort ascending
df_idf.sort_values(by=['idf_weights'])# count matrix
count_vector=cv.transform(docs)# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(count_vector)feature_names = cv.get_feature_names()#get tfidf vector for the document
first_document_vector=tf_idf_vector[0]#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)tfidf
of 0.374810
ireland 0.374810
the 0.332054
in 0.221369
1970 0.187405
football 0.187405
up 0.187405
as 0.000000
an 0.000000and so on..
What are these scores telling us? The more common the word across documents, the lower its score, and the more unique a word the higher the score will be.
這些分數告訴我們什么? 文檔中的單詞越常見,其得分就越低,單詞越獨特,得分就會越高。
So far, we learned the steps of cleaning and preprocessing the text. What can we do with the sorted data after all this? We could use this data for sentiment analysis, chatbot, market intelligence. Maybe build a recommender system based on user purchases or item reviews or customer segmentation with clustering.
到目前為止,我們學習了清理和預處理文本的步驟。 所有這些之后,我們該如何處理排序后的數據? 我們可以使用這些數據進行情感分析,聊天機器人,市場情報。 也許可以基于用戶購買或商品評論或具有集群的客戶細分來構建推薦系統。
Computers are still not accurate with human language as much as they are with numbers. With the massive proportion of text data generated every day, NLP is indeed becoming ever more significant to make sense of the data and is being used in many other applications. Hence there are endless ways to explore NLP.
計算機對人類語言的準確性仍然不如數字。 隨著每天生成大量文本數據,NLP確實變得越來越重要以理清數據,并在許多其他應用程序中得到使用。 因此,有無數種探索NLP的方法。
翻譯自: https://medium.com/analytics-vidhya/natural-language-processing-bedb2e1c8ceb
自然語言處理綜述
總結
以上是生活随笔為你收集整理的自然语言处理综述_自然语言处理的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 火傀儡怎么做(火到底是什么)
- 下一篇: 苹果xr与苹果x区别(苹果官网报价)