李沐动手学深度学习V2-BERT微调和代码实现
一.BERT微調(diào)
1.介紹
自然語言推斷是一個序列級別的文本對分類問題,而微調(diào)BERT只需要一個額外的基于多層感知機(jī)的架構(gòu)對預(yù)訓(xùn)練好的BERT權(quán)重參數(shù)進(jìn)行微調(diào),如下圖所示。下面將下載一個預(yù)訓(xùn)練好的小版本的BERT,然后對其進(jìn)行微調(diào),以便在SNLI數(shù)據(jù)集上進(jìn)行自然語言推斷。
2.加載預(yù)訓(xùn)練的BERT
在前面博客BERT預(yù)訓(xùn)練第二篇:李沐動手學(xué)深度學(xué)習(xí)V2-bert預(yù)訓(xùn)練數(shù)據(jù)集和代碼實(shí)現(xiàn)和 BERT預(yù)訓(xùn)練第三篇:李沐動手學(xué)深度學(xué)習(xí)V2-BERT預(yù)訓(xùn)練和代碼實(shí)現(xiàn)介紹了預(yù)訓(xùn)練的BERT(注意原始的BERT模型是在更大的語料庫上預(yù)訓(xùn)練的,原始的BERT模型有數(shù)以億計(jì)的參數(shù))。在下面提供了兩個版本的預(yù)訓(xùn)練的BERT:“bert.base”與原始的BERT基礎(chǔ)模型一樣大,需要大量的計(jì)算資源才能進(jìn)行微調(diào),而“bert.small”是一個小版本,以便于演示。
import os import torch from torch import nn import d2l.torch import json import multiprocessing d2l.torch.DATA_HUB['bert.base'] = (d2l.torch.DATA_URL + 'bert.base.torch.zip','225d66f04cae318b841a13d32af3acc165f253ac') d2l.torch.DATA_HUB['bert.small'] = (d2l.torch.DATA_URL + 'bert.small.torch.zip','c72329e68a732bef0452e4b96a1c341c8910f81f')兩個預(yù)訓(xùn)練好的BERT模型都包含一個定義詞表的“vocab.json”文件和一個預(yù)訓(xùn)練BERT參數(shù)的“pretrained.params”文件,load_pretrained_model函數(shù)用于加載預(yù)先訓(xùn)練好的BERT參數(shù)。
def load_pretrained_model(pretrained_model,num_hiddens,ffn_num_hiddens,num_heads,num_layers,dropout,max_len,devices):data_dir = d2l.torch.download_extract(pretrained_model)# 定義空詞表以加載預(yù)定義詞表vocab = d2l.torch.Vocab()vocab.idx_to_token = json.load(open(os.path.join(data_dir,'vocab.json')))vocab.token_to_idx = {token:idx for idx,token in enumerate(vocab.idx_to_token)}bert = d2l.torch.BERTModel(len(vocab),num_hiddens=num_hiddens,norm_shape=[256],ffn_num_input=256,ffn_num_hiddens=ffn_num_hiddens,num_heads=num_heads,num_layers=num_layers,dropout=dropout,max_len=max_len,key_size=256,query_size=256,value_size=256,hid_in_features=256,mlm_in_features=256,nsp_in_features=256)# bert = nn.DataParallel(bert,device_ids=devices).to(devices[0])# bert.module.load_state_dict(torch.load(os.path.join(data_dir,'pretrained.params')),strict=False)# 加載預(yù)訓(xùn)練BERT參數(shù)bert.load_state_dict(torch.load(os.path.join(data_dir,'pretrained.params')))return bert,vocab為了便于在大多數(shù)機(jī)器上演示,下面加載和微調(diào)經(jīng)過預(yù)訓(xùn)練BERT的小版本(“bert.mall”)。
devices = d2l.torch.try_all_gpus()[2:4] bert,vocab = load_pretrained_model('bert.small',num_hiddens=256,ffn_num_hiddens=512,num_heads=4,num_layers=2,dropout=0.1,max_len=512,devices=devices)3. 微調(diào)BERT的數(shù)據(jù)集
對于SNLI數(shù)據(jù)集的下游任務(wù)自然語言推斷,定義一個定制的數(shù)據(jù)集類SNLIBERTDataset。在每個樣本中,前提和假設(shè)形成一對文本序列,并被打包成一個BERT輸入序列,片段索引用于區(qū)分BERT輸入序列中的前提和假設(shè)。利用預(yù)定義的BERT輸入序列的最大長度(max_len),持續(xù)移除輸入文本對中較長文本的最后一個標(biāo)記,直到滿足max_len。為了加速生成用于微調(diào)BERT的SNLI數(shù)據(jù)集,使用4個工作進(jìn)程并行生成訓(xùn)練或測試樣本。
class SNLIBERTDataset(torch.utils.data.Dataset):def __init__(self,dataset,max_len,vocab=None):all_premises_hypotheses_tokens = [[p_tokens,h_tokens] for p_tokens,h_tokens in zip(*[d2l.torch.tokenize([s.lower() for s in sentences]) for sentences in dataset[:2]])]self.vocab = vocabself.max_len = max_lenself.labels = torch.tensor(dataset[2])self.all_tokens_id,self.all_segments,self.all_valid_lens = self._preprocess(all_premises_hypotheses_tokens)print(f'read {len(self.all_tokens_id)} examples')def _preprocess(self,all_premises_hypotheses_tokens):pool = multiprocessing.Pool(4)# 使用4個進(jìn)程out = pool.map(self._mp_worker,all_premises_hypotheses_tokens)all_tokens_id = [tokens_id for tokens_id,segments,valid_len in out]all_segments = [segments for tokens_id,segments,valid_len in out]all_valid_lens = [valid_len for tokens_id,segments,valid_len in out]return torch.tensor(all_tokens_id,dtype=torch.long),torch.tensor(all_segments,dtype=torch.long),torch.tensor(all_valid_lens)def _mp_worker(self,premises_hypotheses_tokens):p_tokens,h_tokens = premises_hypotheses_tokensself._truncate_pair_of_tokens(p_tokens,h_tokens)tokens,segments = d2l.torch.get_tokens_and_segments(p_tokens,h_tokens)valid_len = len(tokens)tokens_id = self.vocab[tokens]+[self.vocab['<pad>']]*(self.max_len-valid_len)segments = segments+[0]*(self.max_len-valid_len)return (tokens_id,segments,valid_len)def _truncate_pair_of_tokens(self,p_tokens,h_tokens):# 為BERT輸入中的'<CLS>'、'<SEP>'和'<SEP>'詞元保留位置while (len(p_tokens)+len(h_tokens))>self.max_len-3:if len(p_tokens)>len(h_tokens):p_tokens.pop()else:h_tokens.pop()def __getitem__(self, idx):return (self.all_tokens_id[idx],self.all_segments[idx],self.all_valid_lens[idx]),self.labels[idx]def __len__(self):return len(self.all_tokens_id)下載完SNLI數(shù)據(jù)集后,通過實(shí)例化SNLIBERTDataset類來生成訓(xùn)練和測試樣本,這些樣本將在自然語言推斷的訓(xùn)練和測試期間進(jìn)行小批量讀取。
#在原始的BERT模型中,max_len=512 batch_size,max_len,num_workers = 512,128,d2l.torch.get_dataloader_workers() data_dir = d2l.torch.download_extract('SNLI') train_set = SNLIBERTDataset(d2l.torch.read_snli(data_dir,is_train=True),max_len,vocab) test_set = SNLIBERTDataset(d2l.torch.read_snli(data_dir,is_train=False),max_len,vocab) train_iter = torch.utils.data.DataLoader(train_set,batch_size,num_workers=num_workers,shuffle=True) test_iter = torch.utils.data.DataLoader(test_set,batch_size,num_workers=num_workers,shuffle=False)4. BERT微調(diào)
**用于自然語言推斷的微調(diào)BERT只需要一個額外的多層感知機(jī),該多層感知機(jī)由兩個全連接層組成,**與前面BERT實(shí)現(xiàn)的博客BERT預(yù)訓(xùn)練第一篇:李沐動手學(xué)深度學(xué)習(xí)V2-bert和代碼實(shí)現(xiàn)中BERTClassifier類中進(jìn)行nsp預(yù)測的self.hidden和self.output的多層感知機(jī)結(jié)構(gòu)一個。這個多層感知機(jī)將特殊的“”詞元的BERT表示進(jìn)行了轉(zhuǎn)換,該詞元同時編碼前提和假設(shè)的信息,經(jīng)過多層感知機(jī)后得到自然語言推斷的輸出分類特征維:蘊(yùn)涵、矛盾和中性。
class BERTClassifier(nn.Module):def __init__(self,bert):super(BERTClassifier,self).__init__()self.encoder = bert.encoderself.hidden = bert.hiddenself.output = nn.Linear(256,3)def forward(self,inputs):tokens_X,segments_X,valid_lens_X = inputsencoded_X = self.encoder(tokens_X,segments_X,valid_lens_X)return self.output(self.hidden(encoded_X[:,0,:]))下面將預(yù)訓(xùn)練的BERT模型bert被送到用于下游應(yīng)用的BERTClassifier實(shí)例net中。在BERT微調(diào)的常見實(shí)現(xiàn)中,只有額外的多層感知機(jī)(net.output)的輸出層的參數(shù)將從零開始學(xué)習(xí)。預(yù)訓(xùn)練BERT編碼器(net.encoder)和額外的多層感知機(jī)的隱藏層(net.hidden)的所有參數(shù)都將進(jìn)行微調(diào)。
net = BERTClassifier(bert)在BERT預(yù)訓(xùn)練中MaskLM類和NextSentencePred類在其使用的多層感知機(jī)中都有一些參數(shù),這些參數(shù)是預(yù)訓(xùn)練BERT模型bert中參數(shù)的一部分,然而這些參數(shù)僅用于計(jì)算預(yù)訓(xùn)練過程中的遮蔽語言模型損失和下一句預(yù)測損失。這兩個損失函數(shù)與微調(diào)下游應(yīng)用無關(guān),因此當(dāng)BERT微調(diào)時,MaskLM和NextSentencePred中采用的多層感知機(jī)的參數(shù)不會更新(陳舊的,staled)。
通過d2l.train_batch_ch13()函數(shù)使用SNLI的訓(xùn)練集(train_iter)和測試集(test_iter)對net模型進(jìn)行訓(xùn)練和評估,結(jié)果如下圖所示。
5. 小結(jié)
- 針對下游應(yīng)用對預(yù)訓(xùn)練的BERT模型進(jìn)行微調(diào),例如在SNLI數(shù)據(jù)集上進(jìn)行自然語言推斷。
- 在微調(diào)過程中,BERT模型成為下游應(yīng)用模型的一部分,再加上多層感知機(jī)進(jìn)行下游應(yīng)用模型任務(wù)的訓(xùn)練和評估。
6. 使用原始BERT的預(yù)訓(xùn)練模型進(jìn)行微調(diào)
微調(diào)一個更大的預(yù)訓(xùn)練BERT模型,該模型與原始的BERT基礎(chǔ)模型一樣大。修改load_pretrained_model函數(shù)中的參數(shù)設(shè)置:將“bert.mall”替換為“bert.base”,將num_hiddens=256、ffn_num_hiddens=512、num_heads=4和num_layers=2的值分別增加到768、3072、12和12,同時修改多層感知機(jī)輸出層的Linear層為(nn.Linear(768,3),因?yàn)楝F(xiàn)在經(jīng)過BERT模型輸出特征維變?yōu)?68),增加微調(diào)迭代輪數(shù),代碼如下所示。
import os import torch from torch import nn import d2l.torch import json import multiprocessingd2l.torch.DATA_HUB['bert.base'] = (d2l.torch.DATA_URL + 'bert.base.torch.zip','225d66f04cae318b841a13d32af3acc165f253ac') d2l.torch.DATA_HUB['bert.small'] = (d2l.torch.DATA_URL + 'bert.small.torch.zip','c72329e68a732bef0452e4b96a1c341c8910f81f')devices = d2l.torch.try_all_gpus() def load_pretrained_model1(pretrained_model,num_hiddens,ffn_num_hiddens,num_heads,num_layers,dropout,max_len,devices):data_dir = d2l.torch.download_extract(pretrained_model)vocab = d2l.torch.Vocab()vocab.idx_to_token = json.load(open(os.path.join(data_dir,'vocab.json')))vocab.token_to_idx = {token:idx for idx,token in enumerate(vocab.idx_to_token)}bert = d2l.torch.BERTModel(len(vocab),num_hiddens=num_hiddens,norm_shape=[768],ffn_num_input=768,ffn_num_hiddens=ffn_num_hiddens,num_heads=num_heads,num_layers=num_layers,dropout=dropout,max_len=max_len,key_size=768,query_size=768,value_size=768,hid_in_features=768,mlm_in_features=768,nsp_in_features=768)# bert = nn.DataParallel(bert,device_ids=devices).to(devices[0])# bert.module.load_state_dict(torch.load(os.path.join(data_dir,'pretrained.params')),strict=False)bert.load_state_dict(torch.load(os.path.join(data_dir,'pretrained.params')))return bert,vocabbert,vocab = load_pretrained_model1('bert.base',num_hiddens=768,ffn_num_hiddens=3072,num_heads=12,num_layers=12,dropout=0.1,max_len=512,devices=devices) class SNLIBERTDataset(torch.utils.data.Dataset):def __init__(self, dataset, max_len, vocab=None):all_premises_hypotheses_tokens = [[p_tokens, h_tokens] for p_tokens, h_tokens inzip(*[d2l.torch.tokenize([s.lower() for s in sentences]) for sentences indataset[:2]])]self.vocab = vocabself.max_len = max_lenself.labels = torch.tensor(dataset[2])self.all_tokens_id, self.all_segments, self.all_valid_lens = self._preprocess(all_premises_hypotheses_tokens)print(f'read {len(self.all_tokens_id)} examples')def _preprocess(self, all_premises_hypotheses_tokens):pool = multiprocessing.Pool(4) # 使用4個進(jìn)程out = pool.map(self._mp_worker, all_premises_hypotheses_tokens)all_tokens_id = [tokens_id for tokens_id, segments, valid_len in out]all_segments = [segments for tokens_id, segments, valid_len in out]all_valid_lens = [valid_len for tokens_id, segments, valid_len in out]return torch.tensor(all_tokens_id, dtype=torch.long), torch.tensor(all_segments,dtype=torch.long), torch.tensor(all_valid_lens)def _mp_worker(self, premises_hypotheses_tokens):p_tokens, h_tokens = premises_hypotheses_tokensself._truncate_pair_of_tokens(p_tokens, h_tokens)tokens, segments = d2l.torch.get_tokens_and_segments(p_tokens, h_tokens)valid_len = len(tokens)tokens_id = self.vocab[tokens] + [self.vocab['<pad>']] * (self.max_len - valid_len)segments = segments + [0] * (self.max_len - valid_len)return (tokens_id, segments, valid_len)def _truncate_pair_of_tokens(self, p_tokens, h_tokens):# 為BERT輸入中的'<CLS>'、'<SEP>'和'<SEP>'詞元保留位置while (len(p_tokens) + len(h_tokens)) > self.max_len - 3:if len(p_tokens) > len(h_tokens):p_tokens.pop()else:h_tokens.pop()def __getitem__(self, idx):return (self.all_tokens_id[idx], self.all_segments[idx], self.all_valid_lens[idx]), self.labels[idx]def __len__(self):return len(self.all_tokens_id)#在原始的BERT模型中,max_len=512 batch_size, max_len, num_workers = 512, 128, d2l.torch.get_dataloader_workers() data_dir = d2l.torch.download_extract('SNLI') train_set = SNLIBERTDataset(d2l.torch.read_snli(data_dir, is_train=True), max_len, vocab) test_set = SNLIBERTDataset(d2l.torch.read_snli(data_dir, is_train=False), max_len, vocab) train_iter = torch.utils.data.DataLoader(train_set, batch_size, num_workers=num_workers, shuffle=True) test_iter = torch.utils.data.DataLoader(test_set, batch_size, num_workers=num_workers, shuffle=False)class BERTClassifier(nn.Module):def __init__(self, bert):super(BERTClassifier, self).__init__()self.encoder = bert.encoderself.hidden = bert.hiddenself.output = nn.Linear(768, 3)def forward(self, inputs):tokens_X, segments_X, valid_lens_X = inputsencoded_X = self.encoder(tokens_X, segments_X, valid_lens_X)return self.output(self.hidden(encoded_X[:, 0, :]))net = BERTClassifier(bert) lr, num_epochs = 1e-4, 20 optim = torch.optim.Adam(params=net.parameters(), lr=lr) loss = nn.CrossEntropyLoss(reduction='none') d2l.torch.train_ch13(net, train_iter, test_iter, loss, optim, num_epochs, devices)7. 全部代碼
import os import torch from torch import nn import d2l.torch import json import multiprocessingd2l.torch.DATA_HUB['bert.base'] = (d2l.torch.DATA_URL + 'bert.base.torch.zip','225d66f04cae318b841a13d32af3acc165f253ac') d2l.torch.DATA_HUB['bert.small'] = (d2l.torch.DATA_URL + 'bert.small.torch.zip','c72329e68a732bef0452e4b96a1c341c8910f81f')def load_pretrained_model(pretrained_model, num_hiddens, ffn_num_hiddens, num_heads, num_layers, dropout, max_len,devices):data_dir = d2l.torch.download_extract(pretrained_model)# 定義空詞表以加載預(yù)定義詞表vocab = d2l.torch.Vocab()vocab.idx_to_token = json.load(open(os.path.join(data_dir, 'vocab.json')))vocab.token_to_idx = {token: idx for idx, token in enumerate(vocab.idx_to_token)}bert = d2l.torch.BERTModel(len(vocab), num_hiddens=num_hiddens, norm_shape=[256], ffn_num_input=256,ffn_num_hiddens=ffn_num_hiddens, num_heads=num_heads, num_layers=num_layers,dropout=dropout, max_len=max_len, key_size=256, query_size=256, value_size=256,hid_in_features=256, mlm_in_features=256, nsp_in_features=256)# bert = nn.DataParallel(bert,device_ids=devices).to(devices[0])# bert.module.load_state_dict(torch.load(os.path.join(data_dir,'pretrained.params')),strict=False)# 加載預(yù)訓(xùn)練BERT參數(shù)bert.load_state_dict(torch.load(os.path.join(data_dir, 'pretrained.params')))return bert, vocabdevices = d2l.torch.try_all_gpus()[2:4] bert, vocab = load_pretrained_model('bert.small', num_hiddens=256, ffn_num_hiddens=512, num_heads=4, num_layers=2,dropout=0.1, max_len=512, devices=devices)class SNLIBERTDataset(torch.utils.data.Dataset):def __init__(self, dataset, max_len, vocab=None):all_premises_hypotheses_tokens = [[p_tokens, h_tokens] for p_tokens, h_tokens inzip(*[d2l.torch.tokenize([s.lower() for s in sentences]) for sentences indataset[:2]])]self.vocab = vocabself.max_len = max_lenself.labels = torch.tensor(dataset[2])self.all_tokens_id, self.all_segments, self.all_valid_lens = self._preprocess(all_premises_hypotheses_tokens)print(f'read {len(self.all_tokens_id)} examples')def _preprocess(self, all_premises_hypotheses_tokens):pool = multiprocessing.Pool(4) # 使用4個進(jìn)程out = pool.map(self._mp_worker, all_premises_hypotheses_tokens)all_tokens_id = [tokens_id for tokens_id, segments, valid_len in out]all_segments = [segments for tokens_id, segments, valid_len in out]all_valid_lens = [valid_len for tokens_id, segments, valid_len in out]return torch.tensor(all_tokens_id, dtype=torch.long), torch.tensor(all_segments,dtype=torch.long), torch.tensor(all_valid_lens)def _mp_worker(self, premises_hypotheses_tokens):p_tokens, h_tokens = premises_hypotheses_tokensself._truncate_pair_of_tokens(p_tokens, h_tokens)tokens, segments = d2l.torch.get_tokens_and_segments(p_tokens, h_tokens)valid_len = len(tokens)tokens_id = self.vocab[tokens] + [self.vocab['<pad>']] * (self.max_len - valid_len)segments = segments + [0] * (self.max_len - valid_len)return (tokens_id, segments, valid_len)def _truncate_pair_of_tokens(self, p_tokens, h_tokens):# 為BERT輸入中的'<CLS>'、'<SEP>'和'<SEP>'詞元保留位置while (len(p_tokens) + len(h_tokens)) > self.max_len - 3:if len(p_tokens) > len(h_tokens):p_tokens.pop()else:h_tokens.pop()def __getitem__(self, idx):return (self.all_tokens_id[idx], self.all_segments[idx], self.all_valid_lens[idx]), self.labels[idx]def __len__(self):return len(self.all_tokens_id)#在原始的BERT模型中,max_len=512 batch_size, max_len, num_workers = 512, 128, d2l.torch.get_dataloader_workers() data_dir = d2l.torch.download_extract('SNLI') train_set = SNLIBERTDataset(d2l.torch.read_snli(data_dir, is_train=True), max_len, vocab) test_set = SNLIBERTDataset(d2l.torch.read_snli(data_dir, is_train=False), max_len, vocab) train_iter = torch.utils.data.DataLoader(train_set, batch_size, num_workers=num_workers, shuffle=True) test_iter = torch.utils.data.DataLoader(test_set, batch_size, num_workers=num_workers, shuffle=False)class BERTClassifier(nn.Module):def __init__(self, bert):super(BERTClassifier, self).__init__()self.encoder = bert.encoderself.hidden = bert.hiddenself.output = nn.Linear(256, 3)def forward(self, inputs):tokens_X, segments_X, valid_lens_X = inputsencoded_X = self.encoder(tokens_X, segments_X, valid_lens_X)return self.output(self.hidden(encoded_X[:, 0, :]))net = BERTClassifier(bert) lr, num_epochs = 1e-4, 5 optim = torch.optim.Adam(params=net.parameters(), lr=lr) loss = nn.CrossEntropyLoss(reduction='none') d2l.torch.train_ch13(net, train_iter, test_iter, loss, optim, num_epochs, devices)8. 相關(guān)鏈接
BERT預(yù)訓(xùn)練第一篇:李沐動手學(xué)深度學(xué)習(xí)V2-bert和代碼實(shí)現(xiàn)
BERT預(yù)訓(xùn)練第二篇:李沐動手學(xué)深度學(xué)習(xí)V2-bert預(yù)訓(xùn)練數(shù)據(jù)集和代碼實(shí)現(xiàn)
BERT預(yù)訓(xùn)練第三篇:李沐動手學(xué)深度學(xué)習(xí)V2-BERT預(yù)訓(xùn)練和代碼實(shí)現(xiàn)
BERT微調(diào)第一篇:李沐動手學(xué)深度學(xué)習(xí)V2-自然語言推斷與數(shù)據(jù)集SNLI和代碼實(shí)現(xiàn)
BERT微調(diào)第二篇:李沐動手學(xué)深度學(xué)習(xí)V2-BERT微調(diào)和代碼實(shí)現(xiàn)
總結(jié)
以上是生活随笔為你收集整理的李沐动手学深度学习V2-BERT微调和代码实现的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 屏蔽红警3强制升级
- 下一篇: Electron设置窗口图标、设置桌面快