Text-CNN-文本分类-keras
?
Text?CNN
1.?簡介
TextCNN?是利用卷積神經網絡對文本進行分類的算法,由?Yoon?Kim?在?“Convolutional?Neural?Networks?for?Sentence?Classification”?一文中提出.?是2014年的算法.
我們將實現一個類似于Kim?Yoon的卷積神經網絡語句分類的模型。?本文提出的模型在一系列文本分類任務(如情感分析)中實現了良好的分類性能,并已成為新的文本分類架構的標準基準。
?
2.準備好需要的庫和數據集
- tensorflow
- h5py
- hdf5
- keras
- numpy
- itertools
- collections
- re
- sklearn 0.19.0
準備數據集:
鏈接: https://pan.baidu.com/s/1oO4pDHeu3xIgkDtkLgQEVA 密碼: 6wrv
3.?數據和預處理
我們使用的數據集是?Movie?Review?data?from?Rotten?Tomatoes,也是原始文獻中使用的數據集之一。?數據集包含10,662個示例評論句子,正負向各占一半。?數據集的大小約為1M。?請注意,由于這個數據集很小,我們很可能會使用強大的模型。?此外,數據集不附帶拆分的訓練/測試集,因此我們只需將20%的數據用作?test?set。?
數據預處理的函數包括以下幾點(data_helpers.py):
?
l?load_data_and_labels()從原始數據文件中加載正負向情感的句子。使用one_hot編碼為每個句子打上標簽;[0,1],[1,0]
l?clean_str()正則化去除句子中的標點。
l?pad_sentences()使每個句子都擁有最長句子的長度,不夠的地方補上<PAD/>。允許我們有效地批量我們的數據,因為批處理中的每個示例必須具有相同的長度。
?
l?build_vocab()建立單詞的映射,去重,對單詞按照自然順序排序。然后給排好序的單詞標記標號。構建詞的匯索引,并將每個單詞映射到0到單詞個數之間的整數(詞庫大小)。?每個句子都成為一個整數向量。
?
l?build_input_data()將處理好的句子轉換為numpy數組。
l?load_data()將上述操作整合正在一個函數中。
import numpy as np import re import itertools from collections import Counterdef clean_str(string):"""Tokenization/string cleaning for datasets.Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py"""string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)string = re.sub(r"\'s", " \'s", string)string = re.sub(r"\'ve", " \'ve", string)string = re.sub(r"n\'t", " n\'t", string)string = re.sub(r"\'re", " \'re", string)string = re.sub(r"\'d", " \'d", string)string = re.sub(r"\'ll", " \'ll", string)string = re.sub(r",", " , ", string)string = re.sub(r"!", " ! ", string)string = re.sub(r"\(", " \( ", string)string = re.sub(r"\)", " \) ", string)string = re.sub(r"\?", " \? ", string)string = re.sub(r"\s{2,}", " ", string)return string.strip().lower()def load_data_and_labels():"""Loads polarity data from files, splits the data into words and generates labels.Returns split sentences and labels."""# Load data from filespositive_examples = list(open("./data/rt-polarity.pos", "r", encoding='latin-1').readlines())positive_examples = [s.strip() for s in positive_examples]negative_examples = list(open("./data/rt-polarity.neg", "r", encoding='latin-1').readlines())negative_examples = [s.strip() for s in negative_examples]# Split by wordsx_text = positive_examples + negative_examplesx_text = [clean_str(sent) for sent in x_text]x_text = [s.split(" ") for s in x_text]# Generate labelspositive_labels = [[0, 1] for _ in positive_examples]negative_labels = [[1, 0] for _ in negative_examples]y = np.concatenate([positive_labels, negative_labels], 0)return [x_text, y]def pad_sentences(sentences, padding_word="<PAD/>"):"""Pads all sentences to the same length. The length is defined by the longest sentence.Returns padded sentences."""sequence_length = max(len(x) for x in sentences)padded_sentences = []for i in range(len(sentences)):sentence = sentences[i]num_padding = sequence_length - len(sentence)new_sentence = sentence + [padding_word] * num_paddingpadded_sentences.append(new_sentence)return padded_sentencesdef build_vocab(sentences):"""Builds a vocabulary mapping from word to index based on the sentences.Returns vocabulary mapping and inverse vocabulary mapping."""# Build vocabularyword_counts = Counter(itertools.chain(*sentences))# Mapping from index to wordvocabulary_inv = [x[0] for x in word_counts.most_common()]vocabulary_inv = list(sorted(vocabulary_inv))# Mapping from word to indexvocabulary = {x: i for i, x in enumerate(vocabulary_inv)}return [vocabulary, vocabulary_inv]def build_input_data(sentences, labels, vocabulary):"""Maps sentences and labels to vectors based on a vocabulary."""x = np.array([[vocabulary[word] for word in sentence] for sentence in sentences])y = np.array(labels)return [x, y]def load_data():"""Loads and preprocessed data for the dataset.Returns input vectors, labels, vocabulary, and inverse vocabulary."""# Load and preprocess datasentences, labels = load_data_and_labels()sentences_padded = pad_sentences(sentences)vocabulary, vocabulary_inv = build_vocab(sentences_padded)x, y = build_input_data(sentences_padded, labels, vocabulary)return [x, y, vocabulary, vocabulary_inv]?
4.?模型
?
?
第一層將單詞嵌入到低維向量中。?下一層使用多個過濾器大小對嵌入的字矢量執行卷積。?例如,一次滑過3,4或5個字。池化層選擇使用最大池化。
之后將這三個卷積池化層結合起來。接下來,我們將卷積層的max_pooling結果,使用Flatten層將特征融合成一個長的特征向量,添加dropout正則,并使用softmax層對結果進行分類。
_______________________________________________________________________________
Layer?(type)?????????????????????Output?Shape??????????Param?#?????Connected?to?????????????????????
===============================================================================
input_1?(InputLayer)????????????? (None,?56)????????????0????????????????????????????????????????????
_______________________________________________________________________________
embedding_1?(Embedding)??????????(None,?56,?256)???????4803840?????input_1[0][0]????????????????????
_______________________________________________________________________________
reshape_1?(Reshape)??????????????(None,?56,?256,?1)????0???????????embedding_1[0][0]????????????????
_______________________________________________________________________________
conv2d_1?(Conv2D)????????????????(None,?54,?1,?512)????393728??????reshape_1[0][0]??????????????????
_______________________________________________________________________________
conv2d_2?(Conv2D)????????????????(None,?53,?1,?512)????524800??????reshape_1[0][0]??????????????????
_______________________________________________________________________________
conv2d_3?(Conv2D)????????????????(None,?52,?1,?512)????655872??????reshape_1[0][0]??????????????????
_______________________________________________________________________________
max_pooling2d_1?(MaxPooling2D)???(None,?1,?1,?512)?????0???????????conv2d_1[0][0]???????????????????
_______________________________________________________________________________
max_pooling2d_2?(MaxPooling2D)???(None,?1,?1,?512)?????0???????????conv2d_2[0][0]???????????????????
_______________________________________________________________________________
max_pooling2d_3?(MaxPooling2D)???(None,?1,?1,?512)?????0???????????conv2d_3[0][0]???????????????????
_______________________________________________________________________________
concatenate_1?(Concatenate)??????(None,?3,?1,?512)?????0???????????max_pooling2d_1[0][0]????????????
???????????????????????????????????????????????????????????????????max_pooling2d_2[0][0]????????????
???????????????????????????????????????????????????????????????????max_pooling2d_3[0][0]????????????
_______________________________________________________________________________
flatten_1?(Flatten)??????????????(None,?1536)??????????0???????????concatenate_1[0][0]??????????????
_______________________________________________________________________________
dropout_1?(Dropout)????????????(None,?1536)??????????0???????????flatten_1[0][0]??????????????????
_______________________________________________________________________________
dense_1?(Dense)????????????????(None,?2)?????????????3074????????dropout_1[0][0]??????????????????
===============================================================================
Total?params:?6,381,314
Trainable?params:?6,381,314
Non-trainable?params:?0
_______________________________________________________________________________
?
?
?
- ?優化器選擇了:adam?
- ?loss選擇了binary_crossentropy(二分類問題)
- ?評價標準為分類問題的標準評價標準(是否分對)
?
?
?
?
?
?
轉載于:https://www.cnblogs.com/ansang/p/9010370.html
《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀總結
以上是生活随笔為你收集整理的Text-CNN-文本分类-keras的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: rsync 备份服务搭建(完成)
- 下一篇: 八年级英语57页答案