當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

词嵌入网络嵌入_词嵌入简介

發(fā)布時(shí)間：2023/12/15 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了词嵌入网络嵌入_词嵌入简介小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

詞嵌入網(wǎng)絡(luò)嵌入

深度學(xué)習(xí) ，自然語(yǔ)言處理 (Deep Learning, Natural Language Processing)

Word embedding is a method to capture the “meaning” of the word via low dimension vector and it can be used in a variety of tasks in Natural Language Processing (NLP).

單詞嵌入是一種通過(guò)低維向量捕獲單詞“含義”的方法，可用于自然語(yǔ)言處理(NLP)的各種任務(wù)中。

Before beginning word embedding tutorial we should have an understanding of vector space and similarity matrix.

在開(kāi)始單詞嵌入教程之前，我們應(yīng)該了解向量空間和相似度矩陣 。

向量空間 (Vector Space)

A sequence of numbers that is used to identify a point in space is called vector and if we have a whole bunch of vectors that all belong to the same dataset it will be called a vector space.

用于識(shí)別空間中一個(gè)點(diǎn)的數(shù)字序列稱為向量，如果我們有一堆全部屬于同一數(shù)據(jù)集的向量，則稱為向量空間 。

Words in the text can also be represented in the higher dimension in vector space where words having the same meaning will have similar representations. For example,

文本中的單詞也可以在向量空間中以較高的維度表示，其中具有相同含義的單詞將具有相似的表示形式。例如，

photo by Allision Parrish from Github來(lái)自Github的Allision Parrish攝

The above image shows a vector representation of words on the scale of cuteness and size of animals. we can see that there is a semantic relationship between words on bases of similar properties. It is difficult to represent the higher dimensional relationship between words but the maths behind is the same so it works similarly in a higher dimension also.

上圖顯示了可愛(ài)程度和動(dòng)物大小的單詞矢量表示。我們可以看到，基于相似屬性的單詞之間存在語(yǔ)義關(guān)系。很難表示單詞之間的高維關(guān)系，但是后面的數(shù)學(xué)是相同的，因此它在高維上也類似地工作。

相似度矩陣 (Similarity matrix)

It is used to calculate the distance between vectors in the vector space. it measures similarity or distance between two data points in vector space. This allows us to capture words that are used in similar ways to result in having similar representation naturally capturing their meaning. there is a lot of similarity matrix available but we will discuss Euclidean distance and Cosine similarity.

它用于計(jì)算向量空間中向量之間的距離。它測(cè)量向量空間中兩個(gè)數(shù)據(jù)點(diǎn)之間的相似度或距離。這使我們能夠捕獲以相似方式使用的單詞，從而導(dǎo)致具有相似的表示形式自然地捕獲其含義。有很多可用的相似度矩陣，但我們將討論歐幾里得距離和余弦相似度。

歐氏距離 (Euclidean distance)

One way to calulate how far two data points are in vector space is to calculate Euclidean distance.

計(jì)算向量空間中兩個(gè)數(shù)據(jù)點(diǎn)的距離的一種方法是計(jì)算歐幾里得距離。

import mathdef distance2d(x1, y1, x2, y2):
return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)

So, the distance between “capybara” (70, 30) and “panda” (74, 40) from the above image example:

因此，根據(jù)上圖示例，“水豚”(70、30)和“熊貓”(74、40)之間的距離：

… is less than the distance between “tarantula” and “elephant” from the above image example:

…小于上圖示例中的“狼蛛”和“大象”之間的距離：

This shows that “pandas” and “capybara” are more similar as compared to “tarantula” and “elephant”.

這表明“熊貓”和“水豚”比“狼蛛”和“大象”更相似。

余弦相似度 (Cosine similarity)

It is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

它是內(nèi)積空間的兩個(gè)非零向量之間相似性的量度，用于測(cè)量它們之間角度的余弦。

from numpy import dot
from numpy.linalg import norm
cos_sim = dot(a, b)/(norm(a)*norm(b))

現(xiàn)在的問(wèn)題是什么是詞嵌入，為什么我們要使用它們？ (Now Question is what is word embedding and why do we use them?)

In simple words, they are a vector representation of words in sentences, documents, etc.,

簡(jiǎn)單來(lái)說(shuō)，它們是句子，文檔等中單詞的向量表示，

Word embedding is a learning representation of words in the form of numeric vectors. It learns a densely distributed representation for a predefined fixed-sized vocabulary from a corpus of text. The word embedding representation is capable to reveal many hidden relationships between words. For example, vector(“king”) — vector(“l(fā)ords”) is similar to vector(“queen”) — vector(“princess”)

單詞嵌入是數(shù)字向量形式的單詞的學(xué)習(xí)表示。它從文本語(yǔ)料庫(kù)中學(xué)習(xí)預(yù)定義的固定大小詞匯的密集分布表示形式。單詞嵌入表示法能夠揭示單詞之間的許多隱藏關(guān)系。例如，vector(“ king”)— vector(“ lords”)類似于vector(“ queen”)— vector(“ princess”)

It is an improvement over the traditional methods to represent word such as bag-of-word model which produces large sparse vectors which are computationally impractical to represent an entire vocabulary. These representations were sparse due to its vast vocabularies and a given word or document would be represented by a large vector comprised mostly of zero values a sparse representation.

它是對(duì)表示單詞的傳統(tǒng)方法(例如單詞袋模型)的一種改進(jìn)，它產(chǎn)生了較大的稀疏矢量，這些矢量在計(jì)算上不切實(shí)際，無(wú)法代表整個(gè)詞匯。這些表述由于其詞匯量龐大而稀疏，給定的單詞或文檔將由一個(gè)大型矢量表示，該矢量主要由零值表示。

Two popular methods of learning word embeddings from the text include:

從文本中學(xué)習(xí)單詞嵌入的兩種流行方法包括：

1. Word2Vec.

1. Word2Vec 。

2. GloVe.

2. GloVe 。

There are pre-trained models that were trained over a large corpus of text. We can use them for our use case.

有一些經(jīng)過(guò)訓(xùn)練的模型，這些模型經(jīng)過(guò)大量文本訓(xùn)練。我們可以將它們用于我們的用例。

In addition to these methods, a word embedding can be learned using deep learning model. This can be a slower approach but we can design it for our own use case the model will be trained on a specific training dataset as per our own requirement. Keras provides a very easy and flexible Embedding layer that can be used for neural networks on text data.

除了這些方法，還可以使用深度學(xué)習(xí)模型來(lái)學(xué)習(xí)單詞嵌入。這可能是一種較慢的方法，但是我們可以針對(duì)自己的用例進(jìn)行設(shè)計(jì)，然后根據(jù)我們自己的要求在特定的訓(xùn)練數(shù)據(jù)集上對(duì)模型進(jìn)行訓(xùn)練。 Keras提供了一個(gè)非常簡(jiǎn)單和靈活的嵌入層，可用于文本數(shù)據(jù)的神經(jīng)網(wǎng)絡(luò)。

在

導(dǎo)入模塊 (Importing Module)

Let’s get started with importing our dataset, module, and checking its head. I took a dataset from Kaggle IMBD Movie Review-NLP.

讓我們開(kāi)始導(dǎo)入數(shù)據(jù)集，模塊并檢查其頭部。我從Kaggle IMBD電影評(píng)論-NLP中獲取了一個(gè)數(shù)據(jù)集。

import pandas as pd
import numpy as np
from numpy import array
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

We’ll use Scikit-learn to divide our dataset into a training set and test set. We’ll train the word embedding on 70% of the data and test it on 30%.

我們將使用Scikit-learn將數(shù)據(jù)集分為訓(xùn)練集和測(cè)試集。我們將在70％的數(shù)據(jù)上訓(xùn)練嵌入單詞，并在30％的數(shù)據(jù)上對(duì)其進(jìn)行測(cè)試。

完整的編碼所有文件 (INTEGER ENCODING ALL THE DOCUMENTS)

After this, all the unique words will be represented by an integer. For this, we are using one_hot function available in the Keras. Note that the vocab_size is specified as a total number of unique words so as to ensure unique integer encoding for each and every word.

此后，所有唯一詞將由一個(gè)整數(shù)表示。為此，我們使用Keras中可用的one_hot函數(shù)。請(qǐng)注意， vocab_size被指定為唯一單詞的總數(shù)，以確保每個(gè)單詞的唯一整數(shù)編碼 。

Note one important thing that the integer encoding for the word remains the same in different text. eg ‘year’ is denoted by 23518 in each and every document.

注意一件事，單詞的整數(shù)編碼在不同的文本中保持相同。 例如，在每個(gè)文檔中，“年份”都用23518表示。

Let’s now have a look at one of the reviews. We’ll compare this sentence with its transformation as we move in the next steps.

現(xiàn)在讓我們看看其中一項(xiàng)評(píng)論。在下一步中，我們將比較此句子及其轉(zhuǎn)換。

I really didn't like this movie because it didn't really bring across the messages and ideas L'Engle brought out in her novel. We had read the novel in our English class and i absolutely loved it, i'm afraid i can't say the same for the film. There were some serious differences between the novel and the adapted version and it just didn't do any credit to the imaginative genius that is Madeleine L'Engle! This is the reason i gave it such a poor rating. Don't see this movie if you are a big fan of L'Engle's texts because you will be sorely disappointed. However, if you are watching the movie for entertainment purposes (or educational as was my case) then it is an alright movie!

This review will be converted into integer representation where each number represents a unique word.

該評(píng)論將轉(zhuǎn)換為整數(shù)表示，其中每個(gè)數(shù)字代表一個(gè)唯一的單詞。

[24608, 32542, 30289, 58025, 50966, 19624, 43296, 35850, 30289, 32542, 31519, 11569, 30465, 7968, 12928, 34105, 8750, 49668, 38039, 40264, 3503, 45016, 63074, 41404, 53275, 30465, 45016, 40264, 28666, 47101, 44909, 12928, 24608, 62202, 46727, 35850, 24425, 5515, 24608, 25601, 35725, 30465, 10577, 55918, 30465, 13875, 62286, 22967, 5067, 9001, 33291, 1247, 30465, 45016, 12928, 30465, 23555, 44142, 12928, 35850, 41976, 30289, 20229, 15687, 7845, 50705, 30465, 58301, 14031, 11556, 1495, 26143, 8750, 50966, 1495, 30465, 63056, 24608, 39847, 35850, 30936, 54227, 33469, 55622, 8193, 3111, 50966, 19624, 9403, 51670, 40033, 54227, 42254, 52367, 44935, 63226, 17625, 43296, 51670, 65642, 30053, 42863, 34757, 32894, 9403, 51670, 40033, 1112, 30465, 19624, 55918, 55169, 57666, 10193, 50176, 59413, 10480, 63135, 56156, 64520, 35850, 1495, 49938, 59074, 19624]

填充文本(使相同長(zhǎng)度的文本) (Padding theText (to make the very text of the same length))

The Keras Embedding layer requires all individual documents to be of the same length. Hence we will pad the shorter documents with 0 for now. Therefore now in Keras Embedding layer, the ‘input_length’ will be equal to the length (ie no of words) of the document with maximum length or a maximum number of words.

Keras嵌入層要求所有單個(gè)文檔的長(zhǎng)度都相同。 因此，我們現(xiàn)在將較短的文檔填充0。因此，現(xiàn)在在Keras嵌入層中， “ input_length”將等于具有最大長(zhǎng)度或最大單詞數(shù)的文檔的長(zhǎng)度(即單詞數(shù))。

To pad the shorter documents I am using pad_sequences function from the Keras library.

為了填充較短的文檔，我使用Keras庫(kù)中的pad_sequences函數(shù)。

The maximum number of words in any document is : 1719

Here, we found that the maximum words that a sentence hold is 1719. so we will be padding according to it. In padding, we will be adding zeros(0) in a shorter sentence than max_length. In shorter length sentences “0 ” will be added at the beginning of the sentence.

在這里，我們發(fā)現(xiàn)一個(gè)句子容納的最大單詞數(shù)為1719。因此我們將根據(jù)它進(jìn)行填充。在填充中，我們將在比max_length更短的句子中添加zeros(0)。在較短的句子中，將在句子的開(kāi)頭添加“ 0”。

For example:

例如：

array([ 0, 0, 0, ..., 32875, 18129, 60728])

我們將使用KERAS嵌入層創(chuàng)建嵌入 (WE WILL BE CREATING THE EMBEDDINGS using KERAS EMBEDDING LAYER)

Now all the text are of the same length (after padding). And so now we are ready to create and use the embedding layer.

現(xiàn)在，所有文本的長(zhǎng)度相同(填充后)。因此，現(xiàn)在我們可以創(chuàng)建和使用嵌入層了。

PARAMETERS OF THE EMBEDDING LAYER — -

嵌入層的參數(shù)--

‘Input_dim’ = the vocab size that we will choose. It is the number of unique words in the vocabulary.

'Input_dim'=我們將選擇的唱頭大小 。它是詞匯表中唯一詞的數(shù)量。

‘Output_dim’ = the number of dimensions we wish to embed into. Each word can be represented by a vector of the same dimensions.

'Output_dim'=我們希望嵌入的尺寸數(shù) 。每個(gè)單詞可以用相同維數(shù)的向量表示。

‘Input_length’ = length of the maximum text. which is stored in the maxlen variable in the example.

'Input_length'=最大文本的長(zhǎng)度 。在示例中存儲(chǔ)在maxlen變量中。

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 1719, 8) 527680
_________________________________________________________________
flatten_1 (Flatten) (None, 13752) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 13753
=================================================================
Total params: 541,433
Trainable params: 541,433
Non-trainable params: 0
_________________________________________________________________
None

Let’s now check the model accuracy on our training set.

現(xiàn)在，讓我們?cè)谟?xùn)練集中檢查模型的準(zhǔn)確性。

6000/6000 [==============================] - 1s 170us/step
Training Accuracy is 100.0

The next step we can do is check its accuracy on the test set.

下一步，我們可以在測(cè)試集上檢查其準(zhǔn)確性。

4000/4000 [==============================] - 1s 179us/step
Testing Accuracy is 86.57500147819519

We are getting train accuracy as 100% because on that data we train embedding but for test data, there are some words used which are unseen so we are getting a bit less accuracy.

我們得到的訓(xùn)練精度為100％，因?yàn)樵谠摂?shù)據(jù)上我們進(jìn)行了嵌入訓(xùn)練，但對(duì)于測(cè)試數(shù)據(jù)，由于使用了一些看不見(jiàn)的單詞，因此準(zhǔn)確性有所降低。

In practice, I would recommend performing a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding. That will surely improve performance on test data.

實(shí)際上，我建議使用固定的預(yù)訓(xùn)練嵌入來(lái)執(zhí)行單詞嵌入，并嘗試在預(yù)訓(xùn)練嵌入的基礎(chǔ)上進(jìn)行學(xué)習(xí)。這肯定會(huì)提高測(cè)試數(shù)據(jù)的性能。

下一步是什么 (What’s Next)

Now we have learned how to represent words in the form of continuous numbers. As compared to other forms of text representation such as bag-of-words or TF-IDF(term frequency-inverse document frequency), etc., Word embedding gives much better semantic relationships between words. It can significantly improve the performance of natural language processing(NLP) tasks.

現(xiàn)在我們已經(jīng)學(xué)習(xí)了如何以連續(xù)數(shù)字的形式表示單詞。與其他形式的文本表示形式(例如詞袋或TF-IDF(術(shù)語(yǔ)頻率-反向文檔頻率)等)相比，詞嵌入可提供更好的詞間語(yǔ)義關(guān)系。它可以顯著提高自然語(yǔ)言處理(NLP)任務(wù)的性能。

Now, I would suggest you try yourself word embedding on your own NLP task and you will find significant improvement in the performance. you can also experiment with implementing word embeddings on the same dataset by using pre-trained word embeddings such as Word2Vec as fixed and on top of it, you can perform learning.

現(xiàn)在，我建議您嘗試將單詞嵌入到您自己的NLP任務(wù)中，您會(huì)發(fā)現(xiàn)性能有了顯著提高。您還可以嘗試使用固定訓(xùn)練的單詞嵌入(例如Word2Vec)在同一數(shù)據(jù)集上實(shí)現(xiàn)單詞嵌入，并在此之上進(jìn)行學(xué)習(xí)。

Most often, you will notice that the pre-trained models will have a higher accuracy on the testing set the reason for that is it already had trained on a large variety of NLP datasets. But if you have enough data and want to perform a specific task than it will be a better choice to train your own word embedding.

大多數(shù)情況下，您會(huì)注意到預(yù)訓(xùn)練的模型在測(cè)試集上的準(zhǔn)確性更高，原因是它已經(jīng)在各種NLP數(shù)據(jù)集上進(jìn)行了訓(xùn)練。但是，如果您有足夠的數(shù)據(jù)并且想要執(zhí)行特定任務(wù)，那么訓(xùn)練您自己的單詞嵌入將是一個(gè)更好的選擇。

Code for Word Embedding is Available on GitHub.

GitHub上提供了Word嵌入代碼。

Thanks for the read. I hope this helps you understanding Word Embedding and its importance in natural language processing (NLP).

感謝您的閱讀。我希望這可以幫助您理解單詞嵌入及其在自然語(yǔ)言處理(NLP)中的重要性。

Follow me up at Medium. As always, I welcome feedback and constructive criticism and can be reached on Linkedin.

在Medium跟我來(lái)。與往常一樣，我歡迎您提供反饋和建設(shè)性的批評(píng)，可以通過(guò)Linkedin與我們聯(lián)系 。

翻譯自: https://medium.com/towards-artificial-intelligence/introduction-to-word-embedding-5ba5cf97d296