當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Tensorflow 自然语言处理

發(fā)布時(shí)間：2023/12/14 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Tensorflow 自然语言处理小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

前言
基本知識(shí)
- 使用API
- Text to sequences
- Padding
新聞標(biāo)題數(shù)據(jù)集用于諷刺檢測(cè)

前言

基本知識(shí)

使用API

import tensorflow as tf from tensorflow import keras from tensorflow.keras.preprocessing.text import Tokenizersentenses=['I love my dog','I love my cat','You love my dog!' ]tokenizer=Tokenizer(num_words=100) tokenizer.fit_on_texts(sentenses) # take in the data and encodes it word_index=tokenizer.word_index # key:word index:the token of the word print(word_index)

打印結(jié)果：

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

num_words：需要保留的最大詞數(shù)，基于詞頻。只有最常出現(xiàn)的 num_words 詞會(huì)被保留。（unique word）詳情
tokenizer.fit_on_texts()：分詞器方法，實(shí)現(xiàn)分詞
tokenizer會(huì)為您自動(dòng)除去標(biāo)點(diǎn)符號(hào)(punctutation)，感嘆號(hào)(exclamation)并未出現(xiàn)在word_index中。并且大寫(xiě)會(huì)自動(dòng)改成小寫(xiě)。

Text to sequences

import tensorflow as tf from tensorflow import keras from tensorflow.keras.preprocessing.text import Tokenizersentenses=['I love my dog','I love my cat','You love my dog!','Do you think my dog is amazing?' ]tokenizer=Tokenizer(num_words=100) tokenizer.fit_on_texts(sentenses) # take in the data and encodes it word_index=tokenizer.word_index # key:word index:the token of the wordsequences=tokenizer.texts_to_sequences(sentenses)print(word_index) print(sequences)

打印結(jié)果：

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10} [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

在上面那段代碼的后面加上：

test_data=['I really love my dog','my dog loves my manatee' ] test_seq=tokenizer.texts_to_sequences(test_data) print(test_seq)

打印結(jié)果：

[[4, 2, 1, 3], [1, 3, 1]]

結(jié)論：我們需要訓(xùn)練很多數(shù)據(jù)，否則可能就會(huì)像上面一樣得出my dog my,或者遺失really的句子。

如果我們用一個(gè)特殊標(biāo)識(shí)來(lái)代表不認(rèn)識(shí)的單詞而不是忽略它，結(jié)果又會(huì)怎么樣呢？

修改tokenizer：tokenizer=Tokenizer(num_words=100,oov_token="<OOV>")

打印結(jié)果：

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11} [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Padding

sequences=tokenizer.texts_to_sequences(sentenses)padded1=pad_sequences(sequences) padded2=pad_sequences(sequences,padding='post') padded3=pad_sequences(sequences,padding='post',maxlen=5)print(padded1) print(padded2) print(padded3)

輸出結(jié)果：

[[ 0 0 0 5 3 2 4][ 0 0 0 5 3 2 7][ 0 0 0 6 3 2 4][ 8 6 9 2 4 10 11]][[ 5 3 2 4 0 0 0][ 5 3 2 7 0 0 0][ 6 3 2 4 0 0 0][ 8 6 9 2 4 10 11]][[ 5 3 2 4 0][ 5 3 2 7 0][ 6 3 2 4 0][ 9 2 4 10 11]]

pad_sequences：將多個(gè)序列截?cái)嗷蜓a(bǔ)齊為相同長(zhǎng)度。詳情
padding：字符串，‘pre’ 或 ‘post’ ，在序列的前端補(bǔ)齊還是在后端補(bǔ)齊。
maxlen：整數(shù)，所有序列的最大長(zhǎng)度。

新聞標(biāo)題數(shù)據(jù)集用于諷刺檢測(cè)

數(shù)據(jù)集：CCO public domain dataset：sarcasm detection（嘲諷檢測(cè)）

新聞標(biāo)題數(shù)據(jù)集用于諷刺檢測(cè)：News Headlines Dataset For Sarcasm Detection

Each record consists of three attributes:

is_sarcastic: 1 if the record is sarcastic otherwise 0
headline: the headline of the news article
article_link: link to the original news article. Useful for collecting supplementary data

注意：Laurence為了方便把數(shù)據(jù)集稍作修改了

import json from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequenceswith open('sarcasm.json','r') as f:datastore=json.load(f) # 返回一個(gè)包含三種數(shù)據(jù)的列表sentences = [] labels = [] urls = [] for item in datastore:sentences.append(item['headline'])labels.append(item['is_sarcastic'])urls.append(item['article_link'])tokenizer=Tokenizer(oov_token="<OOV>") tokenizer.fit_on_texts(sentences) word_index=tokenizer.word_indexsequences=tokenizer.texts_to_sequences(sentences) padded=pad_sequences(sequences,padding='post')print(padded[0]) print(padded.shape)

打印結(jié)果：

[ 308 15115 679 3337 2298 48 382 2576 15116 6 2577 84340 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0] (26709, 40)

共有26709個(gè)不重復(fù)的單詞，最長(zhǎng)的標(biāo)題有40個(gè)單詞。這些單詞按照詞頻從高到低排序。

總結(jié)

以上是生活随笔為你收集整理的Tensorflow 自然语言处理的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 2018阿里巴巴春招面试
下一篇：论文翻译阅读——Facial Emoti

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

Tensorflow 自然语言处理

文章目錄

前言

基本知識(shí)

使用API

Text to sequences

Padding

新聞標(biāo)題數(shù)據(jù)集用于諷刺檢測(cè)

總結(jié)