當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

seq2seq模型_Pytorch学习记录-Seq2Seq模型对比

發布時間：2024/4/19 编程问答 44 豆豆

生活随笔收集整理的這篇文章主要介紹了 seq2seq模型_Pytorch学习记录-Seq2Seq模型对比小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Pytorch學習記錄-torchtext和Pytorch的實例4

0. PyTorch Seq2Seq項目介紹

在完成基本的torchtext之后，找到了這個教程，《基于Pytorch和torchtext來理解和實現seq2seq模型》。這個項目主要包括了6個子項目 1. ~~使用神經網絡訓練Seq2Seq~~ 2. ~~使用RNN encoder-decoder訓練短語表示用于統計機器翻譯~~ 3. ~~使用共同學習完成NMT的構建和翻譯~~ 4. ~~打包填充序列、掩碼和推理~~ 5. ~~卷積Seq2Seq~~ 6. ~~Transformer~~

結束Transformer之后隔了兩天沒有學習，這兩天對幾個模型進行對比和總結吧，在完成前三個模型的時候曾經寫過一個總結，今天主要是看一下六個模型的變化以及實現。關鍵是實現，用了15天，但是模型實現部分只能看懂一般Seq2Seq……

7. 總結，從一般Seq2Seq到Transformer

六個模型都是Seq2Seq，都包含有Encoder和Decoder兩部分，只是模型核心不同，并且在層與層或是Encoder與Decoder之間不斷加新東西分別是：LSTM->多層GRU->Attention->PadMaskAttention->CNN->Transformer

1和2是一般Seq2Seq，分別使用了LSTM和它的變體GRU
3和4是對Attention的補充，增加了打包、填充、掩碼
5是使用CNN
6是all-attention，什么高端的都給你用上

7.1 模型架構

Encoder
雜七雜八（attention、pad、mask...）
Decoder
Seq2Seq（整合）

7.2 模型1和2對比

這兩個模型在Encoder部分沒有太大區別，一個使用LSTM，一個使用GRU，輸入內容包括輸入的句子信息和上一層的隱藏層狀態信息（LSTM還有一個單元狀態信息）。
Decoder部分，都是基于同一層Encoder輸出的上下文向量z、上一時刻時間節點的預測單詞（或者是上一時刻時間節點的ground truth單詞，由teaching force rate決定）和Decoder上一層隱藏層狀態。

注意圖中Decoder部分。 OK，這里實現一下。

import torch import torch.nn as nnclass Encoder(nn.Module):def __init__(self, input_dim, hid_dim, emb_dim, dropout):super(Encoder, self).__init__()self.input_dim=input_dimself.emb_dim=emb_dimself.hid_dim=hid_dimself.dropout=dropoutself.embedding=nn.Embedding(input_dim, emb_dim)self.rnn=nn.GRU(emb_dim,hid_dim)self.dropout=nn.Dropout(dropout)def forward(self, src):embedded=self.dropout(self.embedding(src))outputs, hidden=self.rnn(embedded)return hiddenclass Decoder(nn.Module):def __init__(self, output_dim, hid_dim, emb_dim, dropout):super(Decoder.self).__init__()self.output_dim=output_dimself.emb_dim=emb_dimself.hid_dim=hid_dimself.dropout=dropoutself.embedding=nn.Embedding(output_dim,emb_dim)# 在實現的時候，通過將$y_t$和$z$串聯傳入GRU，所以輸入的維度應該是emb_dim+ hid_dim self.rnn=nn.GRU(emb_dim+hid_dim,hid_dim)# linear層輸入的是 $y_t, s_t$ 和 $z$串聯，而隱藏狀態和上下文向量都是$h$維度相同，所以輸入的維度是emb_dim+hid_dim*2 self.out=nn.Linear(emb_dim+hid_dim*2,output_dim)self.dropout=nn.Dropout(dropout)def forward(self, input, hidden, context):input=input.unsqueeze(0)embedded=self.dropout(self.embedding(input))emb_context=torch.cat((embedded,context),dim=2)output, hidden=self.rnn(emb_context, hidden)output=torch.cat((embedded.squeeze(0), hidden.squeeze(0), context.squeeze(0)), dim = 1)prediction=self.out(output)return prediction, hiddenclass Seq2Seq(nn.Module):def __init__(self, encoder, decoder, device):super(Seq2Seq,self).__init__()self.encoder=encoderself.decoder=decoderself.device=deviceassert encoder.hid_dim == decoder.hid_dim, "Hidden dimensions of encoder and decoder must be equal!"def forward(self,src, trg, teacher_forcing_ratio=0.5):batch_size=trg.shape[1]max_len=trg.shape[0]trg_vocab_size=self.decoder.output_dimoutputs=torch.zeros(max_len,batch_size,trg_vocab_size).to(self.device)context=self.encoder(src)hidden=contextinput=trg[0,:]for t in range(1,max_len):output,hidden=self.decoder(input,hidden,context)outputs[t]=outputteacher_force=random.random()<teacher_forcing_ratiotop1=output.max(1)[1]input=(trg[t] if teacher_force else top1)return outputs

但是這樣干的一個問題是，Decoder獲取的是上下文向量是Encoder輸入的所有單詞信息，但是當需要用到具體某一個時間節點的信息時，沒有，仍然是整個句子的全局信息。于是，就有了Attention

7.3 模型2、3、4對比

有了上面的需求，就需要將Encoder所有時間節點的隱藏層狀態輸出，然后進行加權求和。

權值是與Decoder當前時間節點相關聯的一套數值（這個的意思是，對于解碼器的每個時間節點，對于編碼器所有時間節點的hidden state的加權系數是不一樣的），權值即為attention vector，記作a。
a的維度為Encoder序列的時序長度，attention vector中每個分量的數值介于0-1之間，對于解碼器的每個時間節點，a不同，但是所有時間點的權值（即向量a的所有維度）之和為1。

就是說，現在通過attention vector來關注Encoder時間節點信息，通過將Encoder每個時間節點輸出的hidden state和attention vector加權求和之后，得到的w(t)，上下文向量輸入到RNN和線性預測層（要注意的是：在Decoder的第1個時間節點，輸入RNN層的hidden state并不是w而是h，即Encoder最后一個時間節點輸出的隱藏狀態）。

另外在Encoder部分使用的是bidirectional RNN。通過bidirectional RNN，每層可以有兩個RNN網絡。 - 前向RNN從左到右處理句子（圖中綠色） - 后向RNN從右到左處理句子（圖中黃色）在這里要做的就是設置 bidirectional = True ，然后輸入嵌入好的句子。

由于Decoder不是雙向的，它只需要一個上下文向量$ z $作為其初始隱藏狀態$ s_0 $，而Encoder提供有兩個，前向和后向（$ z ^ rightarrow = h_T ^ rightarrow $和$ z ^ leftarrow = h_T ^ leftarrow $）。通過將兩個上下文向量連接在一起，通過線性層$ g $并應用$ tanh $激活函數來解決這個問題。$$z=tanh(g(h_T^rightarrow, h_T^leftarrow)) = tanh(g(z^rightarrow, z^leftarrow)) = s_0$$

由于我們希望我們的模型回顧整個源句，我們返回輸出，源句中每個標記的堆疊前向和后向隱藏狀態。我們還返回hidden，它在解碼器中充當我們的初始隱藏狀態。

OK，來實現一下看看區別，你會發現，Decoder基本類似，

再來看一下PadMaskAttention與Attention的區別。

為什么要進行壓緊？
當我們進行batch個訓練數據一起計算的時候，我們會遇到多個訓練樣例長度不同的情況，這樣我們就會很自然的進行padding，將短句子padding為跟最長的句子一樣。但是這時會出現一個問題，只包含一個單詞的句子和包含20個單詞的句子，pad之后長度一樣，但是前者就有19個pad，這樣會導致LSTM對它的表示通過了非常多無用的字符，這樣得到的句子表示就會有誤差。
解決辦法就是用函數torch.nn.utils.rnn.pack_padded_sequence()以及torch.nn.utils.rnn.pad_packed_sequence()來進行處理。這樣綜上所述，RNN在處理類似變長的句子序列的時候，我們就可以配套使用torch.nn.utils.rnn.pack_padded_sequence()以及torch.nn.utils.rnn.pad_packed_sequence()來避免padding對句子表示的影響

Encoder改變是在forward方法，這里接收源句長度。
- 源句完成嵌入后，使用pack_padded_sequence。數據喂給RNN后輸出的是packed_outputs，這是一個打包后的tensor，包含有句子的隱藏狀態。隱藏狀態是標準張量而不是以任何方式打包，唯一的區別是，由于輸入是打包序列，因此該張量來自序列中的最終非填充元素。 - 然后使用pad_packed_sequence解包我們的packed_outputs，它返回輸出和每個的長度，這是我們不需要的。 - 輸出的第一個維度是填充序列長度，但是由于使用打包填充序列，當填充標記是輸入時，張量的值將全為零。

attention用來計算源句attention值。
上面Attention模型中允許模塊通過源句注意填充的token，這里使用掩碼可以強制attention關注非填充元素。
forward函數放入掩碼輸入，tensor結構是[batch size, source sentence length]，當源句token沒有被填充時，tensor是1，被填充時，tensor是0。

例子： ["hello", "how", "are", "you", "?", , ]->[1, 1, 1, 1, 1, 0, 0]。

在計算注意力之后但在通過softmax函數對其進行歸一化之前應用蒙版。它使用masked_fill應用。這將填充第一個參數（mask == 0）為true的每個元素的張量，其值由第二個參數（-1e10）給出。換句話說，它將采用未標準化的注意力值，并將填充元素上的注意力值更改為-1e10。由于這些數字與其他值相比微不足道，因此當通過softmax層時它們將變為零，從而確保源語句中的填充令牌不會受到關注。

Decoder只做了一點點更改，它需要接受源句子上的掩碼并將其傳遞給注意模塊。由于我們想要在推理期間查看注意力的值，我們也會返回注意力張量。

class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):super(Encoder, self).__init__()self.input_dim=input_dimself.emb_dim=emb_dimself.enc_hid_dim=enc_hid_dimself.dec_hid_dim=dec_hid_dimself.dropout=dropoutself.embedding=nn.Embedding(input_dim,emb_dim)self.rnn=nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)self.fc=nn.Linear(enc_hid_dim*2,dec_hid_dim)self.dropout=nn.Dropout(dropout)def forward(self, src, src_len):embedded=self.dropout(self.embedding(src))# 增加了一個壓緊的操作# 此時，返回的hidden(h_last和c_last)就是剔除padding字符后的hidden state和cell state，都是Variable類型的。# 代表的意思如下（各個句子的表示，GRU只會作用到它實際長度的句子，而不是通過無用的padding字符）# 返回的output是PackedSequence類型, 得到的_代表各個句子的長度packed_embedded=nn.utils.rnn.pack_padded_sequence(embedded, src_len)packed_outputs, hidden=self.rnn(packed_embedded)outputs, _=nn.utils.rnn.pad_packed_sequence(packed_outputs)# hidden[-2,:,:]和hidden[-1,:,:]分別代表前向和后向，通過tanh來激活hidden=torch.tanh(self.fc(torch.cat((hidden[-2,:,:],hidden[-1,:,:]),dim=1)))return outputs, hiddenclass Attention(nn.Module):def __init__(self, enc_hid_dim, dec_hid_dim):super(Attention,self).__init__()self.enc_hid_dim = enc_hid_dimself.dec_hid_dim = dec_hid_dimself.attn=nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)self.v=nn.Parameter(torch.rand(dec_hid_dim))def forward(self,hidden,encoder_outputs,mask):batch_size=encoder_outputs.shape[1]src_len=encoder_outputs.shape[0]hidden=hidden.unsqueeze(1).repeat(1,src_len,1)encoder_outputs = encoder_outputs.permute(1, 0, 2)energy=torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) # permute將tensor的維度換位。energy=energy.permute(0,2,1)v= self.v.repeat(batch_size, 1).unsqueeze(1)attention=torch.bmm(v,energy).squeeze(1)# 和之前的沒有區別，就是在這里進行了處理attention=attention.masked_fill(mask==0,-1e10)return F.softmax(attention, dim = 1)class Decoder(nn.Module):def __init__(self,output_dim,emb_dim,enc_hid_dim,dec_hid_dim,dropout, attention):super(Decoder,self).__init__()self.emb_dim = emb_dimself.enc_hid_dim = enc_hid_dimself.dec_hid_dim = dec_hid_dimself.output_dim = output_dimself.dropout = dropoutself.attention = attentionself.embedding=nn.Embedding(output_dim,emb_dim)self.rnn=nn.GRU((enc_hid_dim*2)+emb_dim,dec_hid_dim)self.out=nn.Linear((enc_hid_dim*2)+dec_hid_dim+emb_dim,output_dim)self.dropout=nn.Dropout(dropout)def forward(self,input,hidden,encoder_outputs):input=input.unsqueeze(0)embedded=self.dropout(self.embedding(input))a=self.attention(hidden,encoder_outputs)a=a.unsqueeze(1)encoder_outputs=encoder_outputs.permute(1,0,2)weighted = torch.bmm(a, encoder_outputs)weighted = weighted.permute(1, 0, 2)rnn_input = torch.cat((embedded, weighted), dim = 2)output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))assert (output == hidden).all()embedded = embedded.squeeze(0)output = output.squeeze(0)weighted = weighted.squeeze(0)output = self.out(torch.cat((output, weighted, embedded), dim = 1))# 增加了注意力張量areturn output, hidden.squeeze(0),a.squeeze(0)class Seq2Seq(nn.Module):def __init__(self, encoder,decoder,pad_idx,sos_idx, eos_idx, device):super().__init__()self.encoder = encoderself.decoder = decoderself.pad_idx = pad_idxself.sos_idx = sos_idxself.eos_idx = eos_idxself.device = devicedef create_mask(self, src):mask=(src!=self.pad_idx).permute(1,0)return maskdef forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):#src = [src sent len, batch size]#src_len = [batch size]#trg = [trg sent len, batch size]#teacher_forcing_ratio is probability to use teacher forcing#e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the timeif trg is None:assert teacher_forcing_ratio == 0, "Must be zero during inference"inference = Truetrg = torch.zeros((100, src.shape[1])).long().fill_(self.sos_idx).to(src.device)else:inference = Falsebatch_size = src.shape[1]max_len = trg.shape[0]trg_vocab_size = self.decoder.output_dim#tensor to store decoder outputsoutputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)#tensor to store attentionattentions = torch.zeros(max_len, batch_size, src.shape[0]).to(self.device)#encoder_outputs is all hidden states of the input sequence, back and forwards#hidden is the final forward and backward hidden states, passed through a linear layerencoder_outputs, hidden = self.encoder(src, src_len)#first input to the decoder is the <sos> tokensoutput = trg[0,:]mask = self.create_mask(src)#mask = [batch size, src sent len]for t in range(1, max_len):output, hidden, attention = self.decoder(output, hidden, encoder_outputs, mask)outputs[t] = outputattentions[t] = attentionteacher_force = random.random() < teacher_forcing_ratiotop1 = output.max(1)[1]output = (trg[t] if teacher_force else top1)if inference and output.item() == self.eos_idx:return outputs[:t], attentions[:t]return outputs, attentions

7.4 模型4、5對比

The English source sentence is encoded (top) and we compute all attention values for the four German target words (center) simultaneously. Our attentions are just dot products between decoder context representations (bottom left) and encoder representations. We add the conditional inputs computed by the attention (center right) to the decoder states which then predict the target words (bottom right). The sigmoid and multiplicative boxes illustrate Gated Linear Units. - 上左encoder部分：通過層疊的卷積抽取輸入源語言（英語）sequence的特征，圖中直進行了一層卷積。卷積之后經過GLU激活做為encoder輸出。 - 下左decoder部分：采用層疊卷積抽取輸出目標語言（德語）sequence的特征，經過GLU激活做為decoder輸出。 - 中左attention部分：把decoder和encoder的輸出做點乘，做為輸入源語言（英語）sequence中每個詞權重。 - 中右Residual connection殘差連接：把attention計算的權重與輸入序列相乘，加入到decoder的輸出中輸出輸出序列。

看了這個結構，和剛才的PadPackMaskAttention模型比較一下，會發現差別蠻大了，再往前，Transformer似乎開始有一些類似的地方。

歸一策略
將殘差塊的輸入和輸出之和乘以√0.5，以求和的方差減半。這假設兩個加法具有相同的方差，這在實踐中并不總是正確但有效。

OK，我們來實現一下，超級復雜

class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, hid_dim, n_layers, kernel_size, dropout, device):super(Encoder, self).__init__()assert kernel_size%2==1,"Kernel size must be odd"self.input_dim=input_dimself.emb_dim=emb_dimself.hid_dim=hid_dimself.kernel_size=kernel_sizeself.dropout=dropoutself.device=device# 殘差連接，歸一策略self.scale=torch.sqrt(torch.FloatTensor([0.5])).to(device)self.tok_embedding=nn.Embedding(input_dim,emb_dim)self.pos_embedding=nn.Embedding(100,emb_dim)self.emb2hid=nn.Linear(emb_dim,hid_dim)self.hid2emb=nn.Linear(hid_dim,emb_dim)self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, out_channels = 2 * hid_dim, kernel_size = kernel_size, padding = (kernel_size - 1) // 2)for _ in range(n_layers)])self.dropout=nn.Dropout(dropout)def forward(self,src):# 對tok和pos都做詞嵌入pos=torch.arange(0,src.shape[1]).unsqueeze(0).repeat(src.shape[0],1).to(self.device)tok_embedded=self.tok_embedding(src)pos_embedded=self.pos_embedding(pos)embedded=self.dropout(tok_embedded+pos_embedded)# 通過linear層將嵌入好的數據傳入Linear轉為hid_dimconv_input=self.emb2hid(embedded)#conv_input = [batch size, hid dim, src sent len]for i , conv in enumerate(self.convs):conved=conv(self.dropout(conv_input))conved=F.glu(conved,dim=1)conved=(conved+conv_input)*self.scaleconv_input=convedconved=self.hid2emb(conved.permute(0,2,1))combined=(conved+embedded)*self.scalereturn conved, combinedclass Decoder(nn.Module):def __init__(self, output_dim, emb_dim, hid_dim, n_layers, kernel_size, dropout, pad_idx, device):super(Decoder, self).__init__()self.output_dim = output_dimself.emb_dim = emb_dimself.hid_dim = hid_dimself.kernel_size = kernel_sizeself.dropout = dropoutself.pad_idx = pad_idxself.device = deviceself.scale=torch.sqrt(torch.FloatTensor([0.5])).to(device)self.tok_embedding = nn.Embedding(output_dim, emb_dim)self.pos_embedding = nn.Embedding(100, emb_dim)self.emb2hid = nn.Linear(emb_dim, hid_dim)self.hid2emb = nn.Linear(hid_dim, emb_dim)self.attn_hid2emb=nn.Linear(hid_dim,emb_dim)self.attn_emb2hid=nn.Linear(emb_dim,hid_dim)self.out=nn.Linear(emb_dim,output_dim)self.convs = nn.ModuleList([nn.Conv1d(hid_dim, 2*hid_dim, kernel_size) for _ in range(n_layers)])self.dropout = nn.Dropout(dropout)def calculate_attention(self, embedded, conved, encoder_conved,encoder_combined):conved_emb=self.attn_hid2emb(conved.permute(0,2,1))combined=(embedded+conved_emb)*self.scaleenergy=torch.matmul(combined, encoder_conved.permute(0,2,1))attention=F.softmax(energy, dim=2)attended_encoding=torch.matmul(attention,(encoder_conved+encoder_combined))attended_encoding = self.attn_emb2hid(attended_encoding)attended_combined = (conved + attended_encoding.permute(0, 2, 1)) * self.scalereturn attention, attended_combineddef forward(self, trg, encoder_conved, encoder_combined):pos=torch.arange(0, trg.shape[1]).unsqueeze(0).repeat(trg.shape[0], 1).to(device)tok_embedded = self.tok_embedding(trg)pos_embedded = self.pos_embedding(pos)#tok_embedded = [batch size, trg sent len, emb dim]#pos_embedded = [batch size, trg sent len, emb dim]embedded = self.dropout(tok_embedded + pos_embedded)conv_input=self.emb2hid(embedded)conv_input=conv_input.permute(0,2,1)for i, conv in enumerate(self.convs):conv_input=self.dropout(conv_input)padding = torch.zeros(conv_input.shape[0], conv_input.shape[1], self.kernel_size-1).fill_(self.pad_idx).to(device)padded_conv_input = torch.cat((padding, conv_input), dim=2)conved=conv(padded_conv_input)conved=F.glu(conved, dim=1)attention, conved=self.calculate_attention(embedded, conved, encoder_conved, encoder_combined)conved=(conved+conv_input)*self.scaleconv_input=convedconved=self.hid2emb(conved.permute(0,2,1))output=self.out(self.dropout(conved))return output, attentionclass Seq2Seq(nn.Module):def __init__(self, encoder, decoder, device):super().__init__()self.encoder = encoderself.decoder = decoderself.device = devicedef forward(self, src, trg):encoder_conved, encoder_combined = self.encoder(src)output, attention = self.decoder(trg, encoder_conved, encoder_combined)return output, attention

總結

以上是生活随笔為你收集整理的seq2seq模型_Pytorch学习记录-Seq2Seq模型对比的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：很多字段的数据要插入另一张表_一文看懂数
下一篇： python进程池调用实例方法_Pyth