nlp文本相似度_用几行代码在Python中搜索相似文本:一个NLP项目
nlp文本相似度
自然語言處理 (Natural Language Processing)
什么是自然語言處理? (What is Natural Language Processing?)
Natural Language Processing (NLP) refers to developing an application that understands human languages. There are so many use cases for NLPs nowadays. Because people are generating thousands of gigabytes of text data every day through blogs, social media comments, product reviews, news archives, official reports, and many more. Search Engines are the biggest example of NLPs. I don’t think you will find very many people around you who never used search engines.
自然語言處理(NLP)是指開發能夠理解人類語言的應用程序。 如今,NLP的用例很多。 因為人們每天通過博客,社交媒體評論,產品評論,新聞檔案,官方報告等生成大量千兆字節的文本數據。 搜索引擎是NLP的最大示例。 我認為您周圍不會有很多從未使用過搜索引擎的人。
項目概況 (Project Overview)
In my experience, the best way to learn is by doing a project. In this article, I will explain NLP with a real project. The dataset I will use is called ‘people_wiki.csv’. I found this dataset in Kaggle. Feel free to download the dataset from here:
以我的經驗,最好的學習方法是做一個項目。 在本文中,我將用一個真實的項目解釋NLP。 我將使用的數據集稱為“ people_wiki.csv”。 我在Kaggle中找到了這個數據集。 可以從此處免費下載數據集:
The dataset contains the name of some famous people, their Wikipedia URL, and the text of their Wikipedia page. So, the dataset is very big. The goal of this project is, to find people of related backgrounds. In the end, if you provide the algorithm a name of a famous person, it will return the name of a predefined number of people who have a similar background according to the Wikipedia information. Does this sound a bit like a search engine?
數據集包含一些著名人物的姓名,他們的Wikipedia URL和他們的Wikipedia頁面的文本。 因此,數據集非常大。 該項目的目標是尋找具有相關背景的人員。 最后,如果您提供算法一個名人的名字,它將根據Wikipedia信息返回預定義數量的具有相似背景的人的名字。 這聽起來有點像搜索引擎嗎?
分步實施 (Step By Step Implementation)
df = pd.read_csv('people_wiki.csv')
df.head()
2. Vectorize the ‘text’ column
2.向量化“文本”列
如何向量化? (How to Vectorize?)
In Python’s scikit-learn library, there is a function named ‘count vectorizer’. This function provides an index to each word and generates a vector that contains the number of appearances of each word in a piece of text. Here, I will demonstrate it with a small text for your understanding. Suppose, this is our text:
在Python的scikit-learn庫中,有一個名為“ count vectorizer ”的函數。 此函數為每個單詞提供一個索引,并生成一個矢量,其中包含一段文本中每個單詞的出現次數。 在這里,我將用一小段文字進行演示,以供您理解。 假設這是我們的文本:
text = ["Jen is a good student. Jen plays guiter as well"]Let’s import the function from the scikit_learn library and fit the text in the function.
讓我們從scikit_learn庫中導入函數,并在函數中放入文本。
vectorizer = CountVectorizer()vectorizer.fit(text)
Here, I am printing the vocabulary:
在這里,我正在打印詞匯表:
print(vectorizer.vocabulary_)#Output:{'jen': 4, 'is': 3, 'good': 1, 'student': 6, 'plays': 5, 'guiter': 2, 'as': 0, 'well': 7}
Look, each word of the text received a number. Those numbers are the index of that word. It has eight significant words. So, the index is from 0 to 7. Next, we need to transform the text. I will print the transformed vector as an array.
看,文本的每個單詞都收到一個數字。 這些數字是該單詞的索引。 它有八個重要詞。 因此,索引是從0到7。接下來,我們需要轉換文本。 我將轉換后的向量打印為數組。
vector = vectorizer.transform(text)print(vector.toarray())
Here is the output: [[1 1 1 1 2 1 1 1]]. ‘Jen’ has index 4 and it appeared twice. So in this output vector, the 4th indexed element is 2. All the other words appeared only once. So the elements of the vector are ones.
輸出為:[[1 1 1 1 2 1 1 1]]。 “ Jen”的索引為4,它出現了兩次。 因此,在此輸出向量中,第四個索引元素為2。所有其他單詞僅出現一次。 因此向量的元素是1。
Now, vectorize the ‘text’ column of the dataset, using the same technique.
現在,使用相同的技術對數據集的“文本”列進行矢量化處理。
vect = CountVectorizer()word_weight = vect.fit_transform(df['text'])
In the demonstration, I used ‘fit’ first and then ‘transform’ later’. But conveniently, you can use fit and transform both at once. This word_weight is the vectors of numbers as I explained before. There will be one such vector for each row of text in the ‘text’ column.
在演示中,我先使用“適合”,然后再使用“變換”。 但是很方便,您可以同時使用fit和transform。 正如我之前解釋的,這個word_weight是數字的向量。 “文本”列中的每一行文本都會有一個這樣的向量。
3. Fit this ‘word_weight’ from the previous step in the Nearest Neighbors function.
3.在“ 最近的鄰居”功能中,按照上一步中的步驟安裝此“ word_weight”。
The idea of the nearest neighbor’s function is to calculate the distance of a predefined number of training points from the required point. If it’s not clear, do not worry. Look at the implementation, it will be easier for you.
最近鄰函數的想法是計算預定數量的訓練點與所需點的距離。 如果不清楚,請不要擔心。 看一下實現,對您來說會更容易。
nn = NearestNeighbors(metric = 'euclidean')nn.fit(word_weight)
4. Find 10 people with similar backgrounds as President Barak Obama.
4.找到10位與巴拉克·奧巴馬總統背景相似的人。
First, find the index of ‘Barak Obama’ from the dataset.
首先,從數據集中找到“巴拉克·奧巴馬”的索引。
obama_index = df[df['name'] == 'Barack Obama'].index[0]Calculate the distance and the indices of 10 people who have the closest background as President Obama. In the word weight vector, the index of the text that contains the information about ‘Barak Obama’ should be in the same index as the dataset. we need to pass that index and the number of the person we want. That should return the calculated distance of those persons from ‘Barak Obama’ and the indices of those persons.
計算與奧巴馬總統背景最接近的10個人的距離和指數。 在單詞權重向量中,包含有關“巴拉克·奧巴馬”的信息的文本索引應與數據集位于同一索引中。 我們需要傳遞該索引和所需人員的數量。 那應該返回這些人與“巴拉克·奧巴馬”的距離以及這些人的指數。
distances, indices = nn.kneighbors(word_weight[obama_index], n_neighbors = 10)Organize the result in a DataFrame.
在DataFrame中組織結果。
neighbors = pd.DataFrame({'distance': distances.flatten(), 'id': indices.flatten()})print(neighbors)
Let’s find the name of the persons from the indexes. There are several ways to find names from the index. I used the merge function. I just merged the ‘neighbors’ DataFrame above with the original DataFrame ‘df’ using the id column as the common column. Sorted values on distance. President Obama should have no distance from himself. So, he came on top.
讓我們從索引中找到人員的名字。 有幾種方法可以從索引中查找名稱。 我使用了合并功能。 我只是使用id列作為公共列,將上面的'neighbors'數據框與原始DataFrame'df'合并。 距離上的排序值。 奧巴馬總統與自己應該沒有距離。 因此,他名列前茅。
nearest_info = (df.merge(neighbors, right_on = 'id', left_index = True).sort_values('distance')[['id', 'name', 'distance']])print(nearest_info)
These are the 10 people closest to President Obama according to the information provided in Wikipedia. Results make sense, right?
根據維基百科提供的信息,這是最接近奧巴馬總統的10個人。 結果有意義,對不對?
A similar texts search could be useful in many areas such as searching for similar articles, similar resume, similar profiles as in this project, similar news items, similar songs. I hope you find this small project useful.
相似文本搜索在許多領域都可能有用,例如搜索相似的文章,相似的履歷,與本項目相似的個人資料,相似的新聞條目,相似的歌曲。 我希望這個小項目對您有用。
Recommended Reading:
推薦讀物:
翻譯自: https://medium.com/towards-artificial-intelligence/similar-texts-search-in-python-with-a-few-lines-of-code-an-nlp-project-9ace2861d261
nlp文本相似度
總結
以上是生活随笔為你收集整理的nlp文本相似度_用几行代码在Python中搜索相似文本:一个NLP项目的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: “硬件杀手”击溃RTX 4090!SE新
- 下一篇: 快手关注的人发作品会有提醒吗