當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

BERT Word Embeddings Tutorial

發(fā)布時(shí)間：2025/3/21 编程问答 16 豆豆

生活随笔收集整理的這篇文章主要介紹了 BERT Word Embeddings Tutorial 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

本文譯自?BERT Word Emebddings Tutorial，我將其中部分內(nèi)容進(jìn)行了精簡(jiǎn)。轉(zhuǎn)載請(qǐng)注明出處

1. Loading Pre-Trained BERT

通過 Hugging Face 安裝 BERT 的 PyTorch 接口，該庫還包含其它預(yù)訓(xùn)練語言模型的接口，如 OpenAI 的 GPT 和 GPT-2

如果您在 Google Colab 上運(yùn)行此代碼，每次重新連接時(shí)都必須安裝此庫

!pip install transformers

BERT 是由 Google 發(fā)布的預(yù)訓(xùn)練模型，該模型使用 Wikipedia 和?Book Corpus?數(shù)據(jù)進(jìn)行訓(xùn)練（Book Corpus 是一個(gè)包含不同類型的 10000 + 本書的數(shù)據(jù)集）。Google 發(fā)布了一系列 BERT 的變體，但我們?cè)谶@里使用的是兩種可用尺寸（"base" 和 "large"）中較小的一種，并且我們?cè)O(shè)置忽略單詞大小寫

transformers?提供了許多應(yīng)用于不同任務(wù)的 BERT 模型。在這里，我們使用最基本的?BertModel，這個(gè)接口的輸出不針對(duì)任何特定任務(wù)，因此用它提取 embeddings 是個(gè)不錯(cuò)的選擇

現(xiàn)在讓我們導(dǎo)入 PyTorch，預(yù)訓(xùn)練 BERT 模型以及 BERT tokenizer

import torch
from transformers import BertTokenizer, BertModel
?
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
# logging.basicConfig(level=logging.INFO)
?
import matplotlib.pyplot as plt
%matplotlib inline
?
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

2. Input Formatting

由于 BERT 是一個(gè)預(yù)訓(xùn)練模型，需要輸入特定格式的數(shù)據(jù)，因此我們需要：

A?special token,?[SEP],?to mark the end of a sentence, or the separation between two sentences

A?special token,?[CLS],?at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is.

Tokens that conform with the fixed vocabulary used in BERT

The?Token IDs?for the tokens, from BERT’s tokenizer

Mask IDs?to indicate which elements in the sequence are tokens and which are padding elements

Segment IDs?used to distinguish different sentences

Positional Embeddings?used to show token position within the sequence

幸運(yùn)的是，使用?tokenizer.encode_plus?這個(gè)函數(shù)可以幫我們處理好一切。但是，由于這只是使用 BERT 的介紹，因此我們將主要以手動(dòng)方式執(zhí)行這些步驟

有關(guān)?tokenizer.encode_plus?這個(gè)函數(shù)的使用示例，可以這篇文章

2.1 Special Tokens

BERT 可以將一個(gè)或兩個(gè)句子作為輸入。如果是兩個(gè)句子，則使用?[SEP]?將它們分隔，并且?[CLS]?標(biāo)記總是出現(xiàn)在文本的開頭；如果是一個(gè)句子，也始終需要兩個(gè)標(biāo)記，此時(shí)?[SEP]?表示句子的結(jié)束。舉個(gè)例子

2 個(gè)句子的輸入：

[CLS] The man went to the store. [SEP] He bought a gallon of milk.

1 個(gè)句子的輸入：

[CLS] The man went to the store. [SEP]

2.2 Tokenization

BERT 提供了?tokenize?方法，下面我們看看它是如何處理句子的

text = "Here is the sentence I want embeddings for."
marked_text = "[CLS] " + text + " [SEP]"
?
# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
?
# Print out the tokens.
print (tokenized_text)

# ['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]']

注意 "embeddings" 這個(gè)詞是如何表示的：['em', '##bed', '##ding', '##s']

原始單詞已被拆分為較小的子詞和字符。這些子詞中前面兩個(gè)##哈希符號(hào)表示該子詞或字符是較大字的一部分。因此，例如 '##bed' 和 'bed' 這兩個(gè) token 不相同；第一個(gè)用于子詞 "bed" 出現(xiàn)在較大詞中時(shí)，第二個(gè)是獨(dú)立的 token

為什么會(huì)這樣？因?yàn)?BERT 的 tokenizer 是使用 WordPiece 模型創(chuàng)建的。這個(gè)模型貪婪地創(chuàng)建了一個(gè)固定大小的詞匯表，其中包含了最適合我們語言的固定數(shù)量的字符、子詞和單詞。由于我們 BERT 模型的 tokenizer 限制詞匯量為 30000，因此 WordPiece 模型生成的詞匯表包含所有英文字符以及該模型所訓(xùn)練英語預(yù)料庫中找到的約 30000 個(gè)最常見的單詞和子詞。該詞匯表包含四類東西：

整個(gè)詞

出現(xiàn)在單詞開頭或單獨(dú)出現(xiàn)的子詞（"embddings" 中的 "em" 與 "go get em" 中的 "em" 向量相同）

不在單詞開頭的子詞，前面會(huì)添加上 "##"

單個(gè)字符

具體來說，tokenzier 首先檢查整個(gè)單詞是否在詞匯表中，如果不在，它會(huì)嘗試將單詞分解為詞匯表中最大可能的子詞，如果子詞也沒有，它就會(huì)將整個(gè)單詞分解為單個(gè)字符。所以我們至少可以將一個(gè)單詞分解為單子字符的集合。基于此，不在詞匯表中的單詞不會(huì)分配給 "UNK" 這種萬能的標(biāo)記，而是分解為子詞和字符標(biāo)記

因此，即使 "embeddings" 這個(gè)詞不在詞匯表中，我們也不會(huì)將這個(gè)詞視為未知詞匯，而是將其分為子詞 tokens ['em', '##bed', '##ding', '##s']，這將保留單詞的一些上下文含義。我們甚至可以平均這些子詞的嵌入向量以生成原始單詞的近似向量。有關(guān) WordPeice 的更多信息，請(qǐng)參考原論文

下面是我們?cè)~匯表中的一些示例

list(tokenizer.vocab.keys())[5000:5020]

['knight',
'lap',
'survey',
'ma',
'##ow',
'noise',
'billy',
'##ium',
'shooting',
'guide',
'bedroom',
'priest',
'resistance',
'motor',
'homes',
'sounded',
'giant',
'##mer',
'150',
'scenes']

將文本分解為標(biāo)記后，我們必須將句子轉(zhuǎn)換為詞匯索引列表。從這開始，我們將使用下面的例句，其中兩個(gè)句子都包含 "bank" 這個(gè)詞，且它們的含義不同

# Define a new example sentence with multiple meanings of the word "bank"
text = "After stealing money from the bank vault, the bank robber was seen " \
"fishing on the Mississippi river bank."
?
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
?
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
?
# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
?
# Display the words with their indeces.
for tup in zip(tokenized_text, indexed_tokens):
print('{:<12} {:>6,}'.format(tup[0], tup[1]))

[CLS] 101
after 2,044
stealing 11,065
money 2,769
from 2,013
the 1,996
bank 2,924
vault 11,632
, 1,010
the 1,996
bank 2,924
robber 27,307
was 2,001
seen 2,464
fishing 5,645
on 2,006
the 1,996
mississippi 5,900
river 2,314
bank 2,924
. 1,012
[SEP] 102

2.3 Segment ID

BERT 希望用 0 和 1 區(qū)分兩個(gè)句子。也就是說，對(duì)于?tokenized_text?中的每個(gè) token，我們必須指明它屬于哪個(gè)句子。如果是單句，只需要輸入一系列 1；如果是兩個(gè)句子，請(qǐng)將第一個(gè)句子中的每個(gè)單詞（包括 [SEP]）指定為 0，第二個(gè)句子指定為 1

# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)

3. Extracting Embeddings

3.1 Running BERT on our text

接下來，我們需要將數(shù)據(jù)轉(zhuǎn)換為 PyTorch tensor 類型

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

調(diào)用?from_pretrained?函數(shù)將從互聯(lián)網(wǎng)上獲取模型。當(dāng)我們加載?bert-base-uncased?時(shí)，我們會(huì)在 logging 記錄中看到模型的定義。該模型是一個(gè)具有 12 層的深度神經(jīng)網(wǎng)絡(luò)，解釋每層的功能不在本文的范圍內(nèi)，您可以查看我博客之前的內(nèi)容來學(xué)習(xí)相關(guān)信息

model.eval()?會(huì)使得我們的模型處于測(cè)試模式，而不是訓(xùn)練模式。在測(cè)試模式下，模型將會(huì)關(guān)閉 dropout regularization

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
?
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

接下來，讓我們把示例文本傳入模型，并獲取網(wǎng)絡(luò)的隱藏狀態(tài)

torch.no_grad()?告訴 PyTorch 在前向傳播的過程中不構(gòu)造計(jì)算圖（因?yàn)槲覀儾粫?huì)在這里反向傳播），這有助于減少內(nèi)存消耗并加快運(yùn)行速度

# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
?
outputs = model(tokens_tensor, segments_tensors)
?
# Evaluating the model will return a different number of objects based on
# how it's configured in the `from_pretrained` call earlier. In this case,
# becase we set `output_hidden_states = True`, the third item will be the
# hidden states from all layers. See the documentation for more details:
# https://huggingface.co/transformers/model_doc/bert.html#bertmodel
hidden_states = outputs[2]

3.2 Understanding the Output

hidden_states?包含的信息有點(diǎn)復(fù)雜，該變量有四個(gè)維度，分別是：

The Layer number（13 layers）

The batch number（1 sentence）

The word / token number（22 tokens in our sentence）

The hidden unit / feature number（768 features）

ちょっと待って，13 層？前面不是說 BERT 只有 12 層嗎？因?yàn)樽钋懊娴囊粚邮?Word Embedding 層，剩下的是 12 個(gè) Encoder Layer

第二個(gè)維度（batch size）是一次向模型提交多個(gè)句子時(shí)使用的；不過，在這里我們只有一個(gè)句子

print ("Number of layers:", len(hidden_states), " (initial embeddings + 12 BERT layers)")
?
layer_i = 0
print ("Number of batches:", len(hidden_states[layer_i]))
?
batch_i = 0
print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
?
token_i = 0
print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))

Number of layers: 13 (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 22
Number of hidden units: 768

通過快速瀏覽指定 token 和網(wǎng)絡(luò)層的數(shù)值范圍，您會(huì)發(fā)現(xiàn)其中大部分值介于 [-2, 2]，少數(shù)在 - 12 附近

# For the 5th token in our sentence, select its feature values from layer 5.
token_i = 5
layer_i = 5
vec = hidden_states[layer_i][batch_i][token_i]
?
# Plot the values as a histogram to show their distribution.
plt.figure(figsize=(10,10))
plt.hist(vec, bins=200)
plt.show()

按層對(duì)值進(jìn)行分組是有意義的，但是為了使用，我們希望它按 token 進(jìn)行分組

當(dāng)前的維度：[layers, batchs, tokens, features]

期望的維度：[tokens, layers, features]

幸運(yùn)的是，PyTorch 的?permute?函數(shù)可以輕松的重新排列維度。但是目前?hidden_states?第一個(gè)維度是 list，所以我們要先結(jié)合各層，使其成為一個(gè) tensor

# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)
?
token_embeddings.size()
# torch.Size([13, 1, 22, 768])

接著我們消掉 "batch" 維度，因?yàn)槲覀儾恍枰?/p>

# Remove dimension 1, the "batches".
token_embeddings = token_embeddings.squeeze(dim=1)
?
token_embeddings.size()
# torch.Size([13, 22, 768])

最后，我們使用?permute?函數(shù)來交換維度

# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
?
token_embeddings.size()
# torch.Size([22, 13, 768])

3.3 Creating word and sentence vectors from hidden states

我們希望為每個(gè)詞獲取單獨(dú)的向量，或者為整個(gè)句子獲取單獨(dú)的向量。但是對(duì)于輸入的每個(gè)詞，我們有 13 個(gè)向量，每個(gè)向量的長度為 768。為了獲得單個(gè)向量，我們需要將一些層的向量組合起來。但是，哪個(gè)層或組合哪些層比較好？

Word Vectors

我們用兩種方式創(chuàng)建詞向量。第一種方式是拼接最后四層，則每個(gè)單詞的向量長度為?4*768=3072

# Stores the token vectors, with shape [22 x 3,072]
token_vecs_cat = []
?
# `token_embeddings` is a [22 x 12 x 768] tensor.
?
# For each token in the sentence...
for token in token_embeddings:
?
# `token` is a [12 x 768] tensor
?
# Concatenate the vectors (that is, append them together) from
# the last four layers.
# Each layer vector is 768 values, so `cat_vec` is length 3072.
cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
?
# Use `cat_vec` to represent `token`.
token_vecs_cat.append(cat_vec)
?
print ('Shape is: %d x %d' % (len(token_vecs_cat), len(token_vecs_cat[0])))
# Shape is: 22 x 3072

第二種方式是將最后四層相加

# Stores the token vectors, with shape [22 x 768]
token_vecs_sum = []
?
# `token_embeddings` is a [22 x 12 x 768] tensor.
?
# For each token in the sentence...
for token in token_embeddings:
?
# `token` is a [12 x 768] tensor
?
# Sum the vectors from the last four layers.
sum_vec = torch.sum(token[-4:], dim=0)
?
# Use `sum_vec` to represent `token`.
token_vecs_sum.append(sum_vec)
?
print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))
# Shape is: 22 x 768

Sentence Vectors

有很多種策略可以獲得一個(gè)句子的單個(gè)向量表示，其中一種簡(jiǎn)單的方法是將倒數(shù)第 2 層所有 token 的向量求平均

# `hidden_states` has shape [13 x 1 x 22 x 768]
?
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]
?
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
?
print("Our final sentence embedding vector of shape:", sentence_embedding.size())
# Our final sentence embedding vector of shape: torch.Size([768])

3.4 Confirming contextually dependent vectors

為了確認(rèn)這些向量的值是上下文相關(guān)的，我們可以檢查一下例句中 "bank" 這個(gè)詞的向量

“After stealing money from the?bank?vault, the?bank?robber was seen fishing on the Mississippi river?bank.”

for i, token_str in enumerate(tokenized_text):
print(i, token_str)

0 [CLS]
1 after
2 stealing
3 money
4 from
5 the
6 bank
7 vault
8 ,
9 the
10 bank
11 robber
12 was
13 seen
14 fishing
15 on
16 the
17 mississippi
18 river
19 bank
20 .
21 [SEP]

在這個(gè)例子中，我們通過累加最后四層的單詞向量，然后打印出來進(jìn)行比較

print('First 5 vector values for each instance of "bank".')
print('')
print("bank vault ", str(token_vecs_sum[6][:5]))
print("bank robber ", str(token_vecs_sum[10][:5]))
print("river bank ", str(token_vecs_sum[19][:5]))

First 5 vector values for each instance of "bank".
?
bank vault tensor([ 3.3596, -2.9805, -1.5421, 0.7065, ...])
bank robber tensor([ 2.7359, -2.5577, -1.3094, 0.6797, ...])
river bank tensor([ 1.5266, -0.8895, -0.5152, -0.9298, ...])

很明顯值不同，但是通過計(jì)算向量之間的余弦相似度可以更精確的進(jìn)行比較

from scipy.spatial.distance import cosine
?
# Calculate the cosine similarity between the word bank
# in "bank robber" vs "bank vault" (same meaning).
same_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[6])
?
# Calculate the cosine similarity between the word bank
# in "bank robber" vs "river bank" (different meanings).
diff_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[19])
?
print('Vector similarity for *similar* meanings: %.2f' % same_bank) # 0.94
print('Vector similarity for *different* meanings: %.2f' % diff_bank) # 0.69

3.5 Pooling Strategy & Layer Choice

BERT Authors

BERT 作者通過將不同的向量組合作為輸入特征提供給 NER 任務(wù)，并觀察所得的 F1 分?jǐn)?shù)

雖然最后四層拼接在此特定任務(wù)上產(chǎn)生了最佳結(jié)果，但許多其他方法效果也不差，通常建議針對(duì)特定應(yīng)用測(cè)試不同版本，結(jié)果可能會(huì)有所不同

Han Xiao's BERT-as-service

肖涵在 Github 上創(chuàng)建了一個(gè)名為?bert-as-service?的開源項(xiàng)目，該項(xiàng)目旨在使用 BERT 為您的文本創(chuàng)建單詞嵌入。他嘗試了各種方法來組合這些嵌入，并在項(xiàng)目的?FAQ?頁面上分享了一些結(jié)論和基本原理

肖涵的觀點(diǎn)認(rèn)為：

第一層是嵌入層，由于它沒有上下文信息，因此同一個(gè)詞在不同語境下的向量是相同的

隨著進(jìn)入網(wǎng)絡(luò)的更深層次，單詞嵌入從每一層中獲得了越來越多的上下文信息

但是，當(dāng)您接近最后一層時(shí)，詞嵌入將開始獲取 BERT 特定預(yù)訓(xùn)練任務(wù)的信息（MLM 和 NSP）

倒數(shù)第二層的詞嵌入比較合理

總結(jié)

以上是生活随笔為你收集整理的BERT Word Embeddings Tutorial的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Transformer-XL解读（论文
下一篇：预训练模型transformers综合总