當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用LSTM建立seq2seq模型进行语言翻译

發(fā)布時間：2024/7/5 编程问答 21 豆豆

生活随笔收集整理的這篇文章主要介紹了使用LSTM建立seq2seq模型进行语言翻译小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

- 1. 數(shù)據(jù)處理
- 2. 編碼器、解碼器數(shù)據(jù)
- - 2.1 編碼器
  - 2.2 解碼器
  - 2.3 模型
- 3. 訓練
- 4. 推理模型
- 5. 采樣

參考基于深度學習的自然語言處理

1. 數(shù)據(jù)處理

讀取數(shù)據(jù)

with open('deu.txt', 'r', encoding='utf-8') as f:lines = f.read().split('\n') print("文檔有 {} 行。".format(len(lines))) num_samples = 20000 # 使用的語料行數(shù) lines_to_use = lines[ : min(num_samples, len(lines)-1)] print(lines_to_use)

替換數(shù)字

import re print(lines_to_use[19516]) for i in range(len(lines_to_use)):lines_to_use[i] = re.sub('\d', ' _NUMBER_ ', lines_to_use[i])# 用 ' _NUMBER_ ' 替換數(shù)字（\d） print(lines_to_use[19516])

輸出：（數(shù)字被替換了）

Turn to channel 1. Wechsle auf Kanal eins. Turn to channel _NUMBER_ . Wechsle auf Kanal eins.

切分輸入，輸出

input_texts = [] # 輸入句子集 target_texts = [] # 輸出句子集 input_words = set() # 輸入詞集合 target_words = set() # 輸出詞集合 for line in lines_to_use:x, y = line.split('\t')y = 'BEGIN_ ' + y + ' _END' # 輸出加上開始結(jié)束標記input_texts.append(x)target_texts.append(y)for word in x.split():if word not in input_words:input_words.add(word)for word in y.split():if word not in target_words:target_words.add(word)

輸入輸出句子的最大長度

max_input_seq_len = max([len(seq.split()) for seq in input_texts]) # 11 max_target_seq_len = max([len(seq.split()) for seq in target_texts]) # 15

輸入輸出 tokens 個數(shù)

input_words = sorted(list(input_words)) target_words = sorted(list(target_words)) num_encoder_tokens = len(input_words) # 5724 num_decoder_tokens = len(target_words) # 9126

建立 tokens 與 id 的映射關(guān)系

inputToken_idx = {token : i for (i, token) in enumerate(input_words)} outputToken_idx = {token : i for (i, token) in enumerate(target_words)}

idx_inputToken = {i : token for (i, token) in enumerate(input_words)} idx_outputToken = {i : token for (i, token) in enumerate(target_words)}

2. 編碼器、解碼器數(shù)據(jù)

注意維度的意義

import numpy as np encoder_input_data = np.zeros((len(input_texts), max_input_seq_len),# 句子數(shù)量，最大輸入句子長度dtype=np.float32 )decoder_input_data = np.zeros((len(target_texts), max_target_seq_len),# 句子數(shù)量，最大輸出句子長度dtype=np.float32 )decoder_output_data = np.zeros((len(target_texts), max_target_seq_len, num_decoder_tokens),# 句子數(shù)量，最大輸出句子長度, 輸出 tokens ids 個數(shù)dtype=np.float32 )

填充矩陣

for i,(input_text, target_text) in enumerate(zip(input_texts, target_texts)):for t, word in enumerate(input_text.split()):encoder_input_data[i, t] = inputToken_idx[word]for t, word in enumerate(target_text.split()):decoder_input_data[i, t] = outputToken_idx[word]if t > 0:# 解碼器的輸出比輸入提前一個時間步decoder_output_data[i, t-1, outputToken_idx[word]] = 1.

2.1 編碼器

from keras.layers import Input, LSTM, Embedding, Dense from keras.models import Modelembedding_size = 256 # 嵌入維度 rnn_size = 64 # 編碼器 encoder_inputs = Input(shape=(None,)) encoder_after_embedding = Embedding(input_dim=num_encoder_tokens, # 單詞個數(shù)output_dim=embedding_size)(encoder_inputs) encoder_lstm = LSTM(units=rnn_size, return_state=True) # return_state: Boolean. Whether to return # the last state in addition to the output. _, state_h, state_c = encoder_lstm(encoder_after_embedding) encoder_states = [state_h, state_c] # 思想向量

2.2 解碼器

# 解碼器 decoder_inputs = Input(shape=(None,)) decoder_after_embedding = Embedding(input_dim=num_decoder_tokens, # 單詞個數(shù)output_dim=embedding_size)(decoder_inputs) decoder_lstm = LSTM(units=rnn_size, return_sequences=True, return_state=True) decoder_outputs, _, _ = decoder_lstm(decoder_after_embedding,initial_state=encoder_states) # 使用 encoder 輸出的思想向量初始化 decoder 的 LSTM 的初始狀態(tài) decoder_dense = Dense(num_decoder_tokens, activation='softmax') # 輸出詞個數(shù),多分類 decoder_outputs = decoder_dense(decoder_outputs)

2.3 模型

model = Model([encoder_inputs, decoder_inputs], decoder_outputs) model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy']) model.summary()from keras.utils import plot_model plot_model(model,to_file='model.png')

輸出:

Model: "model_1" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) (None, None) 0 __________________________________________________________________________________________________ input_2 (InputLayer) (None, None) 0 __________________________________________________________________________________________________ embedding_1 (Embedding) (None, None, 256) 1465344 input_1[0][0] __________________________________________________________________________________________________ embedding_2 (Embedding) (None, None, 256) 2336256 input_2[0][0] __________________________________________________________________________________________________ lstm_1 (LSTM) [(None, 64), (None, 82176 embedding_1[0][0] __________________________________________________________________________________________________ lstm_2 (LSTM) [(None, None, 64), ( 82176 embedding_2[0][0] ![在這里插入圖片描述](https://img-blog.csdnimg.cn/20201215221559994.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzIxMjAxMjY3,size_16,color_FFFFFF,t_70)lstm_1[0][1] lstm_1[0][2] __________________________________________________________________________________________________ dense_1 (Dense) (None, None, 9126) 593190 lstm_2[0][0] ================================================================================================== Total params: 4,559,142 Trainable params: 4,559,142 Non-trainable params: 0 __________________________________________________________________________________________________

3. 訓練

訓練 + 回調(diào)函數(shù)保存最佳模型

from keras.callbacks import ModelCheckpointfilepath='weights.best.h5'# 有一次提升, 則覆蓋一次 save_best_only=True checkpoint = ModelCheckpoint(filepath, monitor='accuracy', verbose=1,save_best_only=True,mode='max',save_freq=2) callbacks_list = [checkpoint] # https://keras.io/api/callbacks/model_checkpoint/history = model.fit(x=[encoder_input_data, decoder_input_data],y=decoder_output_data,batch_size=128,epochs=200,validation_split=0.1,callbacks=callbacks_list) model.save('model.h5')

繪制訓練曲線

import pandas as pd from matplotlib import pyplot as plt loss = history.history['loss'] val_loss = history.history['val_loss'] acc = history.history['accuracy'] val_acc = history.history['val_accuracy']plt.plot(loss, label='train Loss') plt.plot(val_loss, label='valid Loss') plt.title('Training and Validation Loss') plt.legend() plt.grid() plt.show()plt.plot(acc, label='train Acc') plt.plot(val_acc, label='valid Acc') plt.title('Training and Validation Accuracy') plt.legend() plt.grid() plt.show()

4. 推理模型

編碼器

encoder_model = Model(encoder_inputs, encoder_states) # 輸入（帶embedding），輸出思想向量

解碼器

# 編碼器的輸出，作為解碼器的初始狀態(tài) decoder_state_input_h = Input(shape=(rnn_size,)) decoder_state_input_c = Input(shape=(rnn_size,)) decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

初始狀態(tài) + embedding 作為輸入，經(jīng)過LSTM，輸出 decoder_outputs_inf, state_h_inf, state_c_inf

decoder_outputs_inf, state_h_inf, state_c_inf = decoder_lstm(decoder_after_embedding,initial_state=decoder_states_inputs) # 作為下一次推理的狀態(tài)輸入 h, c decoder_states_inf = [state_h_inf, state_c_inf] # LSTM的輸出，接 FC，預測下一個詞是什么 decoder_outputs_inf = decoder_dense(decoder_outputs_inf) decoder_model = Model([decoder_inputs] + decoder_states_inputs,[decoder_outputs_inf] + decoder_states_inf )

5. 采樣

def decode_sequence(input_seq):# encoder_states = [state_h, state_c]states_value = encoder_model.predict(input_seq) # list 2個 array 1*rnn_sizetarget_seq = np.zeros((1, 1))# 目標輸入序列初始為 'BEGIN_' 的 idxtarget_seq[0, 0] = outputToken_idx['BEGIN_']stop = Falsedecoded_sentence = ''while not stop:output_tokens, h, c = decoder_model.predict([target_seq] + states_value)# output_tokens [1*1*9126] h,c [1*rnn_size]sampled_token_idx = np.argmax(output_tokens)sampled_word = idx_outputToken[sampled_token_idx]decoded_sentence += ' ' + sampled_wordif sampled_word == '_END' or len(decoded_sentence) > 60:stop = Truetarget_seq = np.zeros((1, 1))target_seq[0, 0] = sampled_token_idx # 作為下一次預測，輸入# Update statesstates_value = [h, c] # 作為下一次的狀態(tài)輸入return decoded_sentence# 簡單測試采樣 text_to_translate = 'Are you happy ?' encoder_input_to_translate = np.zeros((1, max_input_seq_len),dtype=np.float32) for t, word in enumerate(text_to_translate.split()):encoder_input_to_translate[0, t] = inputToken_idx[word]# encoder_input_to_translate [[ids,...,0,0,0,0]] print(decode_sequence(encoder_input_to_translate))

輸出：

text_to_translate = 'Are you happy?' 輸出： Sind Sie glücklich? _END # 你高興嗎？ text_to_translate = 'Where is my car?' 輸出： Wo ist mein Auto? _END # 我的車呢？ text_to_translate = 'When I see you, I fall in love with you!' 輸出：Sind Sie mit uns gehen. _END # 你跟我們一起去嗎？

注意：

待翻譯句子長度不能超過最大長度
且不能出現(xiàn)沒有出現(xiàn)過的詞匯，如 dear 出現(xiàn)過，但是與標點連著寫dear!沒有出現(xiàn)過，會報錯

總結(jié)

以上是生活随笔為你收集整理的使用LSTM建立seq2seq模型进行语言翻译的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： LeetCode 1566. 重复至少
下一篇： 05.序列模型 W3.序列模型和注意力机