SentencePiece,subword-nmt,bpe算法
BPE(Byte Pair Encoding,雙字節編碼)。2016年應用于機器翻譯,解決 集外詞(OOV)和罕見詞(Rare word)問題。論文題目《Neural Machine Translation of Rare Words with Subword Units》 —發表于ACL2016
http://www.sohu.com/a/115373230_465975
tensor2tensor有用到bpe,抽取:
data_generators/problem.py
data_generators/translate_ende.py
bpe算法實現:
1.參考:https://plmsmile.github.io/2017/10/19/subword-units/
import re def process_raw_words(words, endtag='-'):'''把單詞分割成最小的符號,并且加上結尾符號'''vocabs = {}for word, count in words.items():# 加上空格word = re.sub(r'([a-zA-Z])', r' \1', word)word += ' ' + endtagvocabs[word] = countreturn vocabsdef get_symbol_pairs(vocabs):''' 獲得詞匯中所有的字符pair,連續長度為2,并統計出現次數Args:vocabs: 單詞dict,(word, count)單詞的出現次數。單詞已經分割為最小的字符Returns:pairs: ((符號1, 符號2), count)'''#pairs = collections.defaultdict(int)pairs = dict()for word, freq in vocabs.items():# 單詞里的符號symbols = word.split()for i in range(len(symbols) - 1):p = (symbols[i], symbols[i + 1])pairs[p] = pairs.get(p, 0) + freqreturn pairsdef merge_symbols(symbol_pair, vocabs):'''把vocabs中的所有單詞中的'a b'字符串用'ab'替換Args:symbol_pair: (a, b) 兩個符號vocabs: 用subword(symbol)表示的單詞,(word, count)。其中word使用subword空格分割Returns:vocabs_new: 替換'a b'為'ab'的新詞匯表'''vocabs_new = {}raw = ' '.join(symbol_pair)merged = ''.join(symbol_pair)# 非字母和數字字符做轉義bigram = re.escape(raw)p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')for word, count in vocabs.items():word_new = p.sub(merged, word)vocabs_new[word_new] = countreturn vocabs_newraw_words = {"low":5, "lower":2, "newest":6, "widest":3} vocabs = process_raw_words(raw_words)num_merges = 10 print(vocabs) for i in range(num_merges):pairs = get_symbol_pairs(vocabs)# 選擇出現頻率最高的pairsymbol_pair = max(pairs, key=pairs.get)vocabs = merge_symbols(symbol_pair, vocabs) print(vocabs)輸出:
原來:{"low":5, "lower":2, "newest":6, "widest":3} 經過bpe:{' low-': 5, ' low e r -': 2, ' newest-': 6, ' wi d est-': 3}{“low”:5, “lower”:2, “newest”:6, “widest”:3}這個是原本每個單詞出現的頻率。最后輸出,可以以空格為劃分,比如作為建模單元,比如這里的建模單元為 low e r newest wi d est 。輸出文本經過建模單元就能都映射出來,一串表示。
2.參考 《Neural Machine Translation of Rare Words with Subword Units》
論文講解:http://www.sohu.com/a/115373230_465975
import re, collections def get_stats(vocab):pairs = collections.defaultdict(int)for word, freq in vocab.items():symbols = word.split()print(symbols)print("len(symbols) --- ",len(symbols))for i in range(len(symbols)-1):pairs[symbols[i],symbols[i+1]] += freqreturn pairs def merge_vocab(pair, v_in):v_out = {}bigram = re.escape(' '.join(pair))print("bigram ",bigram)p = re.compile(r'(? for word in v_in:w_out = p.sub(''.join(pair), word)print("w_out ",w_out)v_out[w_out] = v_in[word]return v_outvocab = {'l o w ' : 5, 'l o w e r ' : 2, 'n e w e s t ':6, 'w i d e s t ':3} num_merges = 10for i in range(num_merges):print("=#####################################=== ")pairs = get_stats(vocab)print("===========11111======= ")print(pairs)#print("===========11111======= ")best = max(pairs, key=pairs.get)print("===========2222======= ")print("pairs.get ",pairs.get)print("best ",best)#raise SystemExitvocab = merge_vocab(best, vocab)print("vocab ",vocab)個人覺得分詞最好用的還是sentencepiece~~
SentencePiece
參考https://github.com/google/sentencepiece/tree/master/python
分詞20k個label id
>>> import sentencepiece as spm >>> spm.SentencePieceTrainer.Train('--input=/data/yelong/bpe_test/lib.txt --model_prefix=/data/yelong/bpe_test/bpe --vocab_size=20000 --model_type=bpe') import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load("/data/yelong/bpe_test/bpe.model") with open('/data/yelong/bpe_test/wav/train/text.txt', 'a') as fid, open('/data/yelong/bpe_test/wav/train/train.txt') as did:for line in did:a = line.strip().split()[1:] # eg. "TWO COME MUSE MIGRATE"aa = ' '.join([t for t in a])listid = sp.EncodeAsIds(aa)strid = ' '.join([str(t) for t in listid])b = line.strip().split()[:1]b =''.join([t for t in b])fid.write(b+' '+strid+'\n')得到.model和.vocab兩個文件,
bpe.vocab:
<unk> 0 <s> 0 </s> 0 ▁T -0 HE -1 ▁A -2 ▁THE -3 IN -4 ▁S -5 ▁W -6一個映射關系,右邊并不是id號,因為model_type有好幾種(unigram (default), bpe, char, or word),當選擇比如unigram種類時,得到的右邊是小數,所以并不是id號。
所以我不應該把nabu里配置里的alphabet里只寫了0-19996(bpe.vocab末尾是19996),而應該寫0-19999才對。
驗證過了,0-19999的id都有對應的piece,驗證方法:
% python >>> import sentencepiece as spm >>> sp = spm.SentencePieceProcessor() >>> sp.Load("/data/yelong/bpe_test/bpe.model") >>> for i in range(20000): ... sp.IdToPiece(i)都能輸出。(不能輸出的話會報錯,退出)
總結
以上是生活随笔為你收集整理的SentencePiece,subword-nmt,bpe算法的全部內容,希望文章能夠幫你解決所遇到的問題。
                            
                        - 上一篇: 浏览器漏洞种类复杂多样
 - 下一篇: python爬取京东商品图片_Pytho