當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

可视化 nltk_词嵌入：具有Genism，NLTK和t-SNE可视化的Word2Vec

發布時間：2023/12/15 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了可视化 nltk_词嵌入：具有Genism，NLTK和t-SNE可视化的Word2Vec 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

可視化 nltk

什么是詞嵌入？ (What is Word Embeddings?)

In extremely simplified terms, Word Embeddings are the writings changed over into numbers, as there might be diverse numerical portrayals of a similar book. Be that as it may, before we jump into the subtleties of Word Embeddings, the accompanying inquiry ought to be posed — Why do we need Word Embeddings?

用極為簡化的術語來說，詞嵌入是將作品轉換為數字，因為相似書籍的數字刻畫可能多種多樣。就是這樣，在我們深入研究單詞嵌入的微妙之處之前，應該提出伴隨的詢問-為什么我們需要單詞嵌入？

For reasons unknown, many Machine Learning calculations and practically all Deep Learning architectures are not capable of handling strings or raw content in their crude structure. They require numbers as contributions to play out such a vocation, be it grouping, relapse, and so forth in broader terms. What’s more, with the tremendous measure of information that is available in the content organization, it is easy to extract information out of it and manufacture applications.

由于未知原因，許多機器學習計算以及幾乎所有的深度學習架構都無法處理其原始結構中的字符串或原始內容。他們需要數字作為貢獻來發揮這種職業，更廣義地說，是分組，復發等等。而且，借助內容組織中可用的大量信息，可以很容易地從中提取信息并制造應用程序。

Some live uses of text applications are — opinion investigation of surveys by Myntra, Amazon, and so on., record or news arrangement or grouping by Google and so forth. Several word embedding approaches currently exist and all of them have their pros and cons. We will discuss one of them here: Word2Vec.

文本應用程序的一些實時用途是-Myntra，Amazon等對調查的意見調查，Google的記錄或新聞編排或分組等等。當前存在幾種詞嵌入方法，并且每種方法都有其優缺點。我們將在這里討論其中之一：Word2Vec。

For instance, believe our corpus to be a solitary sentence “The quick brown fox jumps over the lazy dog”. Our sentence is [‘the’,’ quick’,’brown’,’ fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]. Presently the one-hot encoding for separate words are,

例如，相信我們的語料庫是一個單獨的句子：“敏捷的棕色狐貍跳過了懶狗”。我們的句子是['the'，'fast'，'brown'，fox'，'jumps'，'over'，'the'，'lazy'，'dog']。目前，用于單獨單詞的一鍵編碼是

The -> [1,0,0,0,0,0,0,0,0] , quick -> [0,1,0,0,0,0,0,0,0] brown -> [0,0,1,0,0,0,0,0,0] , fox -> [0,0,0,1,0,0,0,0,0] , jumps -> [0,0,0,0,1,0,0,0,0] ,over -> [0,0,0,0,1,0,0,0,0] , the -> [0,0,0,0,0,0,1,0,0] ,lazy -> [0,0,0,0,0,0,0,1,0], dog -> [0,0,0,0,0,0,0,0,1]

-> [1,0,0,0,0,0,0,0,0]，快速-> [0,1,0,0,0,0,0,0,0,0]棕色-> [0 ，0,1,0,0,0,0,0,0]，狐貍-> [0,0,0,1,0,0,0,0,0]，跳轉-> [0,0,0 ，0,1,0,0,0,0]-> [0,0,0,0,1,0,0,0,0]上-> [0,0,0,0,0 ，0,1,0,0]，懶惰-> [0,0,0,0,0,0,0,1,0]，狗-> [0,0,0,0,0,0,0 ，0,1]

Woed2Vec ExampleWoed2Vec示例

Word2Vec： (Word2Vec:)

Word2vec is a gathering of related models that are utilized to create word embeddings. These models are shallow, two-layer neural systems that are prepared to remake etymological settings of words. Word2vec takes as its info an enormous corpus of text and produces a vector space, normally of a few hundred measurements, with every extraordinary word in the corpus being allocated a comparing vector in the space.

Word2vec是用于創建單詞嵌入的相關模型的集合。這些模型是兩層淺淺的神經系統，可以重塑單詞的詞源設置。 Word2vec將巨大的文本語料庫作為其信息，并產生一個向量空間，通常具有幾百個度量值，并且該語料庫中的每個非凡單詞都會在該空間中分配一個比較向量。

Word vectors are situated in the vector space to such an extent that words that share regular settings in the corpus are found near each other in the space

詞向量位于向量空間中的程度應使在語料庫中共享常規設置的詞在空間中彼此靠近

First — An input layer, Middle-Hidden layers, Last- Output layer第一層-輸入層，中間隱藏層，最后輸出層

“A man can be distinguished by the organization he keeps”, comparably a word can be recognized by the gathering of words that are utilized with it often, this is the possibility that Word2Vec depends on. Word2Vec has two variations, one dependent on the Skip Gram model and the other one dependent on the Continuous Bag of words model.

“一個人可以通過他所擁有的組織來區分”，相對而言，一個單詞可以通過經常使用的單詞的集合而被識別，這就是Word2Vec所依賴的可能性。 Word2Vec有兩種變體，一種依賴于Skip Gram模型，另一種依賴于單詞連續袋模型。

跳格模型： (Skip Gram Model:)

For the Skip-Gram model, the undertaking of the basic neural system is: Given an info word in a sentence, the system will foresee how likely it is for each word in the jargon being that information word’s close by word. The preparation guides to the neural system are word sets which comprise of the info word and its close by words.

對于Skip-Gram模型，基本神經系統的工作是：給定句子中的一個信息詞，系統將預測行話中每個單詞被該信息詞逐個關閉的可能性。神經系統的準備指南是單詞集，該單詞集由信息詞及其附近的詞組成。

For instance, consider the sentence “ The quick brown fox jumps over the lazy dog.” and a window size of 2. The preparation models are All together for the guides to be prepared by the neural system, we need to speak to the words in some numerical structure. We utilize one-hot vectors, in which the situation of the information word is “1” and every single other position is “0”. So, the contributions to the neural system simply input one-hot vectors, and the yield is additionally a vector with the component of the one-hot vector, containing, for each word in the jargon, the likelihood that an arbitrarily chosen close byword is that jargon word.

例如，考慮句子“快速的棕色狐貍跳過懶狗”。窗口大小為2。準備模型全部結合在一起，以便神經系統準備指南，我們需要對某些數字結構中的單詞說話。我們利用一個熱向量，其中信息字的情況為“ 1”，而每個其他位置為“ 0”。因此，對神經系統的貢獻只需輸入一個熱門向量，而產量又是一個包含一個熱門向量分量的向量，對于行話中的每個單詞，包含任意選擇的封閉單詞的可能性為那個專業術語。

Presently how about we take a gander at the design of the neural system. For instance, accept we utilize a jargon of size V, and a shrouded layer of size N, the accompanying chart shows the system’s design:

目前，我們如何研究神經系統。例如，接受我們使用大小為V的行話和大小為N的覆蓋層，下面的圖表顯示了系統的設計：

Skip-Gram Model WorkingSkip-Gram模型工作

連續詞袋模型： (Continuous Bag of Words Model:)

The continuous Bag-of-Words model (CBOW) is just the opposite of Skip-Gram. For the CBOW model, the task of the simple neural network is: Given a context of words (surrounding a word) in a sentence, the network will predict how likely it is for each word in the vocabulary is the word.

連續詞袋模型( CBOW )與Skip-Gram相反。對于CBOW模型，簡單神經網絡的任務是：給定句子中單詞的上下文(圍繞單詞)，網絡將預測詞匯表中每個單詞對單詞的可能性。

In Continuous Bag-of-Words model, we attempt to foresee a word utilizing its encompassing words(context words), the contribution to the model is the one-hot encoded vector of the setting words inside the window size, the window size is a hyper boundary and alludes to the quantity of setting words on either side(words happening when the current word.) that are utilized to anticipate it.

在連續詞袋模型中，我們嘗試利用其包含的詞(上下文詞)預見一個詞，該模型的貢獻是窗口大小內設置詞的單次熱編碼矢量，窗口大小為超邊界，并暗示用于預測的任一側設置單詞的數量(當當前單詞出現時出現的單詞)。

“ The quick brown fox jumps over the lazy dog.”. Suppose the word viable is ‘sluggish’, now for a window size of 2, the information vector will have ones at positions comparing to the words ‘quick’, fox’,’ over’, ’the’, ’lazy’, and ‘dog’.

“ 敏捷的棕色狐貍跳過了懶狗。”。假設“可行”一詞是“緩慢的”，現在對于窗口大小為2，信息向量在與“快速”，“狐貍”，“在”，“該”，“懶惰”和“狗'。

CBOW Model WorkingCBOW模型工作

實現方式： (Implementation:)

Below I define four parameters that we used to define a Word2Vec model:

下面，我定義了四個用于定義Word2Vec模型的參數：

·size: The size means the dimensionality of word vectors. It defines the number of tokens used to represent each word. For example, rake a look at the picture above. The size would be equal to 4 in this example. Each input word would be represented by 4 tokens: King, Queen, Women, Princess. Rule-of-thumb: If a dataset is small, then the size should be small too. If a dataset is large, then size should be greater too. It’s the question of tuning.

· size：大小表示單詞向量的維數。它定義了用于表示每個單詞的令牌數量。例如，看一下上面的圖片。在此示例中，大小將等于4。每個輸入詞將由4個標記表示：國王，女王，婦女，公主。經驗法則：如果數據集很小，那么大小也應該很小。如果數據集很大，那么大小也應該更大。這是調音的問題。

·window: The maximum distance between the target word and its neighboring word. For example, let’s take the phrase “agama is a reptile “ with 4 words (suppose that we do not exclude the stop words). If the window size is 2, then the vector of the word “agama” is directly affected by the word “is” and “a”. Rule-of-thumb: a smaller window should provide terms that are more related (of course, the exclusion of stop words should be considered).

· 窗口：目標單詞與其相鄰單詞之間的最大距離。例如，讓我們使用帶有4個單詞的短語“ agama is a reptile”(假設我們不排除停用詞)。如果窗口大小為2，則單詞“ agama”的向量直接受到單詞“ is”和“ a”的影響。經驗法則：較小的窗口應提供相關性更高的術語(當然，應考慮排除停用詞)。

·min_count: Ignores all words with a total frequency lower than this. For example, if the word frequency is extremely low, then this word might be considered as unimportant.

· min_count：忽略所有總頻率低于此頻率的單詞。例如，如果單詞頻率極低，則該單詞可能不重要。

·sg: Selects training algorithm: 1 for Skip-Gram; 0 for CBOW (Continuous Bag of Words).

· sg：選擇訓練算法：Skip-Gram為1； CBOW(連續詞袋)為0。

·workers: The number of worker threads used to train the model.

· worker：用于訓練模型的工作線程數。

模型構建： (The model building:)

Used the hotel-reviews dataset from the Kaggle repository. Click here for the dataset

使用了Kaggle存儲庫中的hotel-reviews數據集。單擊此處獲取數據集

腳步- (Steps-)

Clean the data

清理數據

Build a corpus

建立語料庫

Train a Word2Vec Model

訓練Word2Vec模型

Visualize t-SNE representations of the most common words

可視化最常用單詞的t-SNE表示

import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import re
import nltk
import gensim
from gensim.models import word2vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inlinenltk.download('stopwords')nltk.download(‘stopwords’)nltk.download('停用詞')

Loading the hotel-reviews dataset in data and viewing it’s top 5 rows.

將酒店評價數據集加載到數據中并查看其前5行。

data = pd.read_csv('/content/hotel-reviews.csv',sep=',',encoding='utf-8',error_bad_lines=False)data.head()Top 5 Rows of the dataset數據集的前5行

Viewing the Columns of the dataset

查看數據集的列

data.columns5 columns of the data5列數據

To remove all the stop words

刪除所有停用詞

STOP_WORDS = nltk.corpus.stopwords.words()

Extraction of Clean_sentence of the dataset.

提取數據集的Clean_sentence。

def clean_sentence(val):"remove chars that are not letters or numbers, downcase, then remove stop words"regex = re.compile('([^\s\w]|_)+')
sentence = regex.sub('', val).lower()
sentence = sentence.split(" ")
for word in list(sentence):
if word in STOP_WORDS:
sentence.remove(word)
sentence = " ".join(sentence)
return sentence

Drop nans, then apply ‘clean_sentence’ function to Description”

刪除nans，然后將'clean_sentence'函數應用于Description”

def clean_dataframe(data):"drop nans, then apply 'clean_sentence' function to Description"data = data.dropna(how="any")
for col in ['Description']:
data[col] = data[col].apply(clean_sentence)
return data

Clean_Data

data = clean_dataframe(data)
data.head(5)Clean Data- Description清潔數據-說明

Building the corpus of the dataset — Creates a list of lists containing words from each sentence

建立數據集的語料庫—創建一個包含每個句子中的單詞的列表列表

def build_corpus(data):"Creates a list of lists containing words from each sentence"corpus = []
for col in ['Description']:
for sentence in data[col].iteritems():
word_list = sentence[1].split(" ")
corpus.append(word_list)return corpus

View the build Corpus

查看構建語料庫

corpus = build_corpus(data)
corpus[0:10]Corpus from the dataset數據集的語料庫

Importing word2vec from genism and calculating the word-vector of the word.

從遺傳學中導入word2vec并計算單詞的單詞向量。

model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=2, workers=4)
model.wv['luxurious']Word Vector of luxurious豪華詞矢量

t-SNE：t分布隨機鄰居嵌入： (t-SNE: t-Distributed Stochastic Neighbor Embedding:)

t-Distributed Stochastic Neighbor Embedding is a non-straight dimensionality decrease calculation utilized for investigating high-dimensional information. It maps multi-dimensional information to at least two measurements appropriate for human perception.

t分布隨機鄰居嵌入是用于研究高維信息的非直維降幅計算。它將多維信息映射到至少兩個適合人類感知的度量。

How t-SNE works?

t-SNE如何工作？

The intuition of what and how t-SNE works.

t-SNE工作原理和原理的直覺。

Suppose you have a 50-dimensional data set, as it is like an impossible task for us to visualize and get a sense of it. We have to convert that 50D data set to something which we can visualize or with which we can play around. This is where t-SNE comes into the picture it converts the higher dimensional data into the lower dimensional data by following steps-

假設您有一個50維的數據集，因為對于我們來說，可視化和理解它是一項不可能的任務。我們必須將50D數據集轉換為可以可視化或可以使用的數據。這是t-SNE進入圖片的地方，它通過以下步驟將高維數據轉換為低維數據：

It measures the similarity between the two data points and it does for every pair. Similar data points will have more value of similarity and the different data points will have less value.

它測量兩個數據點之間的相似度，并針對每一對測量相似度。相似的數據點將具有更多的相似性值，而不同的數據點將具有較少的價值。

Then it converts that similarity distance to probability(joint probability) according to the normal distribution.

然后根據正態分布將該相似距離轉換為概率(聯合概率)。

As I said in the first point, it does the similarity check for every point. Thus it will have the similarity matrix `S1` for every point. This is all calculation it does for our data points that lie in higher-dimensional space.

正如我在第一點所述，它對每個點進行相似性檢查。因此，對于每個點，它將具有相似度矩陣“ S1”。這是對位于高維空間中的數據點所做的所有計算。

Now, t-SNE arranges all of the data points randomly on the required lower-dimensional (let’s suppose 2).

現在，t-SNE將所有數據點隨機排列在所需的較低維度上(假設2)。

And it does all of the same calculation for lower dimensional data points as it does for higher ones — calculating similarity distance but with a major difference it assigns probability according to t- distribution instead of normal distribution and this is because it is called t-SNE not simple SNE.

對于低維數據點，它與高維數據點都進行相同的計算-計算相似距離，但主要區別在于，它根據t分布而不是正態分布來分配概率，這是因為它被稱為t-SNE不是簡單的SNE。

Now we also have the similarity matrix for lower dimensional data points. Let’s call it S2.

現在，我們還具有針對低維數據點的相似度矩陣。我們稱它為S2。

Now, what t-SNE does is it compares matrix S1 and S2 and tries to make the difference between matrix S1 and S2 much smaller by doing some complex mathematics.

現在，t-SNE所做的是比較矩陣S1和S2，并通過做一些復雜的數學嘗試使矩陣S1和S2之間的差異小得多。

In the end, we will have lower-dimensional data points that try to capture even complex relationships at which PCA fails.

最后，我們將具有較低維的數據點，這些數據點試圖捕獲PCA失敗時的甚至復雜的關系。

So on a very high level, this is how t-SNE works.

因此，在非常高的水平上，這就是t-SNE的工作方式。

def tsne_plot(model):"Creates and TSNE model and plots it"labels = []
tokens = []
for word in model.wv.vocab:
tokens.append(model[word])
labels.append(word)
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(16, 16))
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()tsne_plot(model)

Now, let’s see the more selective model:

現在，讓我們看看更具選擇性的模型：

# A more selective modelmodel1 = word2vec.Word2Vec(corpus, size=100, window=20, min_count=3, workers=4)tsne_plot(model1)Selective Plot for the datasets Selective Plot for the dataset數據集的選擇圖數據集的選擇圖

The most similar words that are similar to a target word

與目標詞相似的最相似詞

model.most_similar('walking')Words similar to Walking與步行相似的詞 model.most_similar('pretty')Words similar to Pretty與Pretty類似的詞

進一步改進： (Further improvements:)

Training of word2vec is a very computationally expensive process. With millions of words, the training may take a lot of time. Some methods to counter this are negative sampling and Hierarchical softmax. A good link to understand both can be found here.

word2vec的訓練是一個計算量非常大的過程。數以百萬計的單詞可能需要花費大量時間。解決此問題的一些方法是負采樣和分層softmax。可以在這里找到了解兩者的良好鏈接。

Hope this helps :)

希望這可以幫助：)

Follow if you like my posts.

如果您喜歡我的帖子，請關注。

For more help, check my Github :- https://github.com/Afaf-Athar/Word2Vec

有關更多幫助，請檢查我的Github：-https: //github.com/Afaf-Athar/Word2Vec

Additional Resources I found Useful:1. https://www.kaggle.com/harmanpreet93/train-word2vec-on-hotel-reviews-dataset
2. https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1
3.https://github.com/nltk/nltk/blob/develop/nltk/test/gensim.doctest
4. Kullback-Liebler Divergence: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
5. Good hyperparameter Information: https://distill.pub/2016/misread-tsne/
6. L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579–2605, 2008.

Please leave comments for any clarifications or questions.

如有任何澄清或疑問，請留下評論。

Happy learning 😃

快樂學習😃

翻譯自: https://medium.com/@afafathar3007/word-embedding-word2vec-with-genism-nltk-and-t-sne-visualization-43eae8ab3e2e