Tensorflow 自然语言处理
文章目錄
- 前言
- 基本知識(shí)
- 使用API
- Text to sequences
- Padding
- 新聞標(biāo)題數(shù)據(jù)集用于諷刺檢測(cè)
前言
基本知識(shí)
使用API
import tensorflow as tf from tensorflow import keras from tensorflow.keras.preprocessing.text import Tokenizersentenses=['I love my dog','I love my cat','You love my dog!' ]tokenizer=Tokenizer(num_words=100) tokenizer.fit_on_texts(sentenses) # take in the data and encodes it word_index=tokenizer.word_index # key:word index:the token of the word print(word_index)打印結(jié)果:
{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}- num_words:需要保留的最大詞數(shù),基于詞頻。只有最常出現(xiàn)的 num_words 詞會(huì)被保留。(unique word) 詳情
- tokenizer.fit_on_texts():分詞器方法,實(shí)現(xiàn)分詞
- tokenizer會(huì)為您自動(dòng)除去標(biāo)點(diǎn)符號(hào)(punctutation),感嘆號(hào)(exclamation)并未出現(xiàn)在word_index中。并且大寫(xiě)會(huì)自動(dòng)改成小寫(xiě)。
Text to sequences
import tensorflow as tf from tensorflow import keras from tensorflow.keras.preprocessing.text import Tokenizersentenses=['I love my dog','I love my cat','You love my dog!','Do you think my dog is amazing?' ]tokenizer=Tokenizer(num_words=100) tokenizer.fit_on_texts(sentenses) # take in the data and encodes it word_index=tokenizer.word_index # key:word index:the token of the wordsequences=tokenizer.texts_to_sequences(sentenses)print(word_index) print(sequences)打印結(jié)果:
{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10} [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]在上面那段代碼的后面加上:
test_data=['I really love my dog','my dog loves my manatee' ] test_seq=tokenizer.texts_to_sequences(test_data) print(test_seq)打印結(jié)果:
[[4, 2, 1, 3], [1, 3, 1]]結(jié)論:我們需要訓(xùn)練很多數(shù)據(jù),否則可能就會(huì)像上面一樣得出my dog my,或者遺失really的句子。
如果我們用一個(gè)特殊標(biāo)識(shí)來(lái)代表不認(rèn)識(shí)的單詞而不是忽略它,結(jié)果又會(huì)怎么樣呢?
修改tokenizer:tokenizer=Tokenizer(num_words=100,oov_token="<OOV>")
打印結(jié)果:
{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11} [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]Padding
sequences=tokenizer.texts_to_sequences(sentenses)padded1=pad_sequences(sequences) padded2=pad_sequences(sequences,padding='post') padded3=pad_sequences(sequences,padding='post',maxlen=5)print(padded1) print(padded2) print(padded3)輸出結(jié)果:
[[ 0 0 0 5 3 2 4][ 0 0 0 5 3 2 7][ 0 0 0 6 3 2 4][ 8 6 9 2 4 10 11]][[ 5 3 2 4 0 0 0][ 5 3 2 7 0 0 0][ 6 3 2 4 0 0 0][ 8 6 9 2 4 10 11]][[ 5 3 2 4 0][ 5 3 2 7 0][ 6 3 2 4 0][ 9 2 4 10 11]]-
pad_sequences:將多個(gè)序列截?cái)嗷蜓a(bǔ)齊為相同長(zhǎng)度。詳情
-
padding:字符串,‘pre’ 或 ‘post’ ,在序列的前端補(bǔ)齊還是在后端補(bǔ)齊。
-
maxlen:整數(shù),所有序列的最大長(zhǎng)度。
新聞標(biāo)題數(shù)據(jù)集用于諷刺檢測(cè)
數(shù)據(jù)集:CCO public domain dataset:sarcasm detection(嘲諷檢測(cè))
新聞標(biāo)題數(shù)據(jù)集用于諷刺檢測(cè):News Headlines Dataset For Sarcasm Detection
Each record consists of three attributes:
- is_sarcastic: 1 if the record is sarcastic otherwise 0
- headline: the headline of the news article
- article_link: link to the original news article. Useful for collecting supplementary data
注意:Laurence為了方便把數(shù)據(jù)集稍作修改了
import json from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequenceswith open('sarcasm.json','r') as f:datastore=json.load(f) # 返回一個(gè)包含三種數(shù)據(jù)的列表sentences = [] labels = [] urls = [] for item in datastore:sentences.append(item['headline'])labels.append(item['is_sarcastic'])urls.append(item['article_link'])tokenizer=Tokenizer(oov_token="<OOV>") tokenizer.fit_on_texts(sentences) word_index=tokenizer.word_indexsequences=tokenizer.texts_to_sequences(sentences) padded=pad_sequences(sequences,padding='post')print(padded[0]) print(padded.shape)打印結(jié)果:
[ 308 15115 679 3337 2298 48 382 2576 15116 6 2577 84340 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0] (26709, 40)共有26709個(gè)不重復(fù)的單詞,最長(zhǎng)的標(biāo)題有40個(gè)單詞。這些單詞按照詞頻從高到低排序。
總結(jié)
以上是生活随笔為你收集整理的Tensorflow 自然语言处理的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 2018阿里巴巴春招面试
- 下一篇: 论文翻译阅读——Facial Emoti