當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

自然语言处理美国政客的社交媒体消息分类

發(fā)布時間：2024/10/8 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了自然语言处理美国政客的社交媒体消息分类小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

數(shù)據(jù)簡介: Disasters on social media

美國政客的社交媒體消息分類
內(nèi)容：收集了來自美國參議員和其他美國政客的數(shù)千條社交媒體消息，可按內(nèi)容分類為目標群眾（國家或選民）、政治主張（中立/兩黨或偏見/黨派）和實際內(nèi)容（如攻擊政敵等）

社交媒體上有些討論是關于災難，疾病，暴亂的，有些只是開玩笑或者是電影情節(jié)，我們該如何讓機器能分辨出這兩種討論呢？

import keras import nltk import pandas as pd import numpy as np import re import codecs questions = pd.read_csv("socialmedia_relevant_cols_clean.csv") questions.columns=['text', 'choose_one', 'class_label'] questions.head() textchoose_oneclass_label01234

Just happened a terrible car crash	Relevant	1
Our Deeds are the Reason of this #earthquake M...	Relevant	1
Heard about #earthquake is different cities, s...	Relevant	1
there is a forest fire at spot pond, geese are...	Relevant	1
Forest fire near La Ronge Sask. Canada	Relevant	1

questions.describe() class_labelcountmeanstdmin25%50%75%max

10876.000000
0.432604
0.498420
0.000000
0.000000
0.000000
1.000000
2.000000

數(shù)據(jù)清洗，去掉無用字符

def standardize_text(df, text_field):df[text_field] = df[text_field].str.replace(r"http\S+", "")df[text_field] = df[text_field].str.replace(r"http", "")df[text_field] = df[text_field].str.replace(r"@\S+", "")df[text_field] = df[text_field].str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")df[text_field] = df[text_field].str.replace(r"@", "at")df[text_field] = df[text_field].str.lower()return dfquestions = standardize_text(questions, "text")questions.to_csv("clean_data.csv") questions.head() textchoose_oneclass_label01234

just happened a terrible car crash	Relevant	1
our deeds are the reason of this earthquake m...	Relevant	1
heard about earthquake is different cities, s...	Relevant	1
there is a forest fire at spot pond, geese are...	Relevant	1
forest fire near la ronge sask canada	Relevant	1

clean_questions = pd.read_csv("clean_data.csv") clean_questions.tail() Unnamed: 0textchoose_oneclass_label1087110872108731087410875

10871	m1 94 01 04 utc ?5km s of volcano hawaii	Relevant	1
10872	police investigating after an e bike collided ...	Relevant	1
10873	the latest more homes razed by northern calif...	Relevant	1
10874	meg issues hazardous weather outlook (hwo)	Relevant	1
10875	cityofcalgary has activated its municipal eme...	Relevant	1

數(shù)據(jù)分布情況

數(shù)據(jù)是否傾斜

clean_questions.groupby("class_label").count() Unnamed: 0textchoose_oneclass_label012

6187	6187	6187
4673	4673	4673
16	16	16

看起來還算均衡的

處理流程

分詞
訓練與測試集
檢查與驗證

from nltk.tokenize import RegexpTokenizertokenizer = RegexpTokenizer(r'\w+')clean_questions["tokens"] = clean_questions["text"].apply(tokenizer.tokenize) clean_questions.head() Unnamed: 0textchoose_oneclass_labeltokens01234

0	just happened a terrible car crash	Relevant	1	[just, happened, a, terrible, car, crash]
1	our deeds are the reason of this earthquake m...	Relevant	1	[our, deeds, are, the, reason, of, this, earth...
2	heard about earthquake is different cities, s...	Relevant	1	[heard, about, earthquake, is, different, citi...
3	there is a forest fire at spot pond, geese are...	Relevant	1	[there, is, a, forest, fire, at, spot, pond, g...
4	forest fire near la ronge sask canada	Relevant	1	[forest, fire, near, la, ronge, sask, canada]

語料庫情況

from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categoricalall_words = [word for tokens in clean_questions["tokens"] for word in tokens] sentence_lengths = [len(tokens) for tokens in clean_questions["tokens"]] VOCAB = sorted(list(set(all_words))) print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB))) print("Max sentence length is %s" % max(sentence_lengths)) 154724 words total, with a vocabulary size of 18101 Max sentence length is 34

句子長度情況

import matplotlib.pyplot as pltfig = plt.figure(figsize=(10, 10)) plt.xlabel('Sentence length') plt.ylabel('Number of sentences') plt.hist(sentence_lengths) plt.show()

特征如何構建？

Bag of Words Counts

from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerdef cv(data):count_vectorizer = CountVectorizer()emb = count_vectorizer.fit_transform(data)return emb, count_vectorizerlist_corpus = clean_questions["text"].tolist() list_labels = clean_questions["class_label"].tolist()X_train, X_test, y_train, y_test = train_test_split(list_corpus, list_labels, test_size=0.2, random_state=40)X_train_counts, count_vectorizer = cv(X_train) X_test_counts = count_vectorizer.transform(X_test)

PCA展示Bag of Words

from sklearn.decomposition import PCA, TruncatedSVD import matplotlib import matplotlib.patches as mpatchesdef plot_LSA(test_data, test_labels, savepath="PCA_demo.csv", plot=True):lsa = TruncatedSVD(n_components=2)lsa.fit(test_data)lsa_scores = lsa.transform(test_data)color_mapper = {label:idx for idx,label in enumerate(set(test_labels))}color_column = [color_mapper[label] for label in test_labels]colors = ['orange','blue','blue']if plot:plt.scatter(lsa_scores[:,0], lsa_scores[:,1], s=8, alpha=.8, c=test_labels, cmap=matplotlib.colors.ListedColormap(colors))red_patch = mpatches.Patch(color='orange', label='Irrelevant')green_patch = mpatches.Patch(color='blue', label='Disaster')plt.legend(handles=[red_patch, green_patch], prop={'size': 30})fig = plt.figure(figsize=(16, 16)) plot_LSA(X_train_counts, y_train) plt.show()

看起來并沒有將這兩類點區(qū)分開

邏輯回歸看一下結果

from sklearn.linear_model import LogisticRegressionclf = LogisticRegression(C=30.0, class_weight='balanced', solver='newton-cg', multi_class='multinomial', n_jobs=-1, random_state=40) clf.fit(X_train_counts, y_train)y_predicted_counts = clf.predict(X_test_counts)

評估

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_reportdef get_metrics(y_test, y_predicted): # true positives / (true positives+false positives)precision = precision_score(y_test, y_predicted, pos_label=None,average='weighted') # true positives / (true positives + false negatives)recall = recall_score(y_test, y_predicted, pos_label=None,average='weighted')# harmonic mean of precision and recallf1 = f1_score(y_test, y_predicted, pos_label=None, average='weighted')# true positives + true negatives/ totalaccuracy = accuracy_score(y_test, y_predicted)return accuracy, precision, recall, f1accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts) print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1)) accuracy = 0.754, precision = 0.752, recall = 0.754, f1 = 0.753

混淆矩陣檢查

import numpy as np import itertools from sklearn.metrics import confusion_matrixdef plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.winter):if normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]plt.imshow(cm, interpolation='nearest', cmap=cmap)plt.title(title, fontsize=30)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, fontsize=20)plt.yticks(tick_marks, classes, fontsize=20)fmt = '.2f' if normalize else 'd'thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] < thresh else "black", fontsize=40)plt.tight_layout()plt.ylabel('True label', fontsize=30)plt.xlabel('Predicted label', fontsize=30)return plt cm = confusion_matrix(y_test, y_predicted_counts) fig = plt.figure(figsize=(10, 10)) plot = plot_confusion_matrix(cm, classes=['Irrelevant','Disaster','Unsure'], normalize=False, title='Confusion matrix') plt.show() print(cm)

[[970 251 3][274 670 1][ 3 4 0]]

第三類咋沒有一個呢。。。因為數(shù)據(jù)里面就沒幾個啊。。。

進一步檢查模型的關注點

def get_most_important_features(vectorizer, model, n=5):index_to_word = {v:k for k,v in vectorizer.vocabulary_.items()}# loop for each classclasses ={}for class_index in range(model.coef_.shape[0]):word_importances = [(el, index_to_word[i]) for i,el in enumerate(model.coef_[class_index])]sorted_coeff = sorted(word_importances, key = lambda x : x[0], reverse=True)tops = sorted(sorted_coeff[:n], key = lambda x : x[0])bottom = sorted_coeff[-n:]classes[class_index] = {'tops':tops,'bottom':bottom}return classesimportance = get_most_important_features(count_vectorizer, clf, 10) def plot_important_words(top_scores, top_words, bottom_scores, bottom_words, name):y_pos = np.arange(len(top_words))top_pairs = [(a,b) for a,b in zip(top_words, top_scores)]top_pairs = sorted(top_pairs, key=lambda x: x[1])bottom_pairs = [(a,b) for a,b in zip(bottom_words, bottom_scores)]bottom_pairs = sorted(bottom_pairs, key=lambda x: x[1], reverse=True)top_words = [a[0] for a in top_pairs]top_scores = [a[1] for a in top_pairs]bottom_words = [a[0] for a in bottom_pairs]bottom_scores = [a[1] for a in bottom_pairs]fig = plt.figure(figsize=(10, 10)) plt.subplot(121)plt.barh(y_pos,bottom_scores, align='center', alpha=0.5)plt.title('Irrelevant', fontsize=20)plt.yticks(y_pos, bottom_words, fontsize=14)plt.suptitle('Key words', fontsize=16)plt.xlabel('Importance', fontsize=20)plt.subplot(122)plt.barh(y_pos,top_scores, align='center', alpha=0.5)plt.title('Disaster', fontsize=20)plt.yticks(y_pos, top_words, fontsize=14)plt.suptitle(name, fontsize=16)plt.xlabel('Importance', fontsize=20)plt.subplots_adjust(wspace=0.8)plt.show()top_scores = [a[0] for a in importance[1]['tops']] top_words = [a[1] for a in importance[1]['tops']] bottom_scores = [a[0] for a in importance[1]['bottom']] bottom_words = [a[1] for a in importance[1]['bottom']]plot_important_words(top_scores, top_words, bottom_scores, bottom_words, "Most important words for relevance")

我們的模型找到了一些模式，但是看起來還不夠好

TFIDF Bag of Words

這樣我們就不均等對待每一個詞了

def tfidf(data):tfidf_vectorizer = TfidfVectorizer()train = tfidf_vectorizer.fit_transform(data)return train, tfidf_vectorizerX_train_tfidf, tfidf_vectorizer = tfidf(X_train) X_test_tfidf = tfidf_vectorizer.transform(X_test) fig = plt.figure(figsize=(16, 16)) plot_LSA(X_train_tfidf, y_train) plt.show()

看起來好那么一丁丁丁丁點

clf_tfidf = LogisticRegression(C=30.0, class_weight='balanced', solver='newton-cg', multi_class='multinomial', n_jobs=-1, random_state=40) clf_tfidf.fit(X_train_tfidf, y_train)y_predicted_tfidf = clf_tfidf.predict(X_test_tfidf) accuracy_tfidf, precision_tfidf, recall_tfidf, f1_tfidf = get_metrics(y_test, y_predicted_tfidf) print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy_tfidf, precision_tfidf, recall_tfidf, f1_tfidf)) accuracy = 0.762, precision = 0.760, recall = 0.762, f1 = 0.761 cm2 = confusion_matrix(y_test, y_predicted_tfidf) fig = plt.figure(figsize=(10, 10)) plot = plot_confusion_matrix(cm2, classes=['Irrelevant','Disaster','Unsure'], normalize=False, title='Confusion matrix') plt.show() print("TFIDF confusion matrix") print(cm2) print("BoW confusion matrix") print(cm)

TFIDF confusion matrix [[974 249 1][261 684 0][ 3 4 0]] BoW confusion matrix [[970 251 3][274 670 1][ 3 4 0]]

詞語的解釋

importance_tfidf = get_most_important_features(tfidf_vectorizer, clf_tfidf, 10) top_scores = [a[0] for a in importance_tfidf[1]['tops']] top_words = [a[1] for a in importance_tfidf[1]['tops']] bottom_scores = [a[0] for a in importance_tfidf[1]['bottom']] bottom_words = [a[1] for a in importance_tfidf[1]['bottom']]plot_important_words(top_scores, top_words, bottom_scores, bottom_words, "Most important words for relevance")

這些詞看起來比之前強一些了

問題

我們現(xiàn)在考慮的是每一個詞基于頻率的情況，如果在新的測試環(huán)境下有些詞變了呢？比如說goog和positive.有些詞可能表達的意義差不多但是卻長得不一樣，這樣我們的模型就難捕捉到了。

word2vec

一句話解釋：比較牛逼。。。

import gensimword2vec_path = "GoogleNews-vectors-negative300.bin" word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True) def get_average_word2vec(tokens_list, vector, generate_missing=False, k=300):if len(tokens_list)<1:return np.zeros(k)if generate_missing:vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens_list]else:vectorized = [vector[word] if word in vector else np.zeros(k) for word in tokens_list]length = len(vectorized)summed = np.sum(vectorized, axis=0)averaged = np.divide(summed, length)return averageddef get_word2vec_embeddings(vectors, clean_questions, generate_missing=False):embeddings = clean_questions['tokens'].apply(lambda x: get_average_word2vec(x, vectors, generate_missing=generate_missing))return list(embeddings) embeddings = get_word2vec_embeddings(word2vec, clean_questions) X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec = train_test_split(embeddings, list_labels, test_size=0.2, random_state=40) X_train_word2vec[0] array([ 0.05639939, 0.02053833, 0.07635207, 0.06914993, -0.01007262,-0.04978943, 0.02546038, -0.06045968, 0.04264323, 0.02419935,0.00375076, -0.15124639, 0.02915809, -0.01554943, -0.10182699,0.05523972, 0.00953747, 0.0834525 , 0.00200544, -0.0238909 ,-0.01706369, 0.09193638, 0.03979783, 0.04899052, 0.04707618,-0.09235491, -0.10698809, 0.07503255, 0.04905628, -0.01991781,0.04036749, -0.0117856 , -0.00576346, 0.01624843, -0.01823952,-0.01545715, 0.06020392, 0.02975609, 0.02211217, 0.07844525,0.05023847, -0.09430913, 0.20582217, -0.05274091, 0.00881231,0.04394059, -0.01748512, -0.0403268 , 0.03178769, 0.06038993,0.03867458, 0.00492932, 0.05121649, 0.01256743, -0.02096994,0.02814593, -0.06389218, 0.01661319, -0.02686709, -0.07981364,-0.00288318, 0.07032367, -0.07524182, -0.01155599, -0.0259661 ,0.00625901, -0.05474758, -0.00059877, -0.01737177, 0.07586161,0.0273136 , -0.00077093, 0.0752638 , 0.05861119, -0.15668742,-0.00779506, 0.04997617, 0.08768209, 0.04078311, 0.07749503,0.02886018, -0.08094715, 0.05818976, -0.02744593, -0.00559489,-0.00488863, -0.06092762, 0.15089634, -0.02423968, 0.02867635,0.0041097 , 0.00409226, -0.05106317, -0.0156715 , -0.06731596,0.00594657, 0.02464658, 0.10740153, 0.0207287 , -0.02535357,-0.05631002, -0.01714507, -0.04964483, -0.00834728, -0.01148841,0.04122198, 0.00281052, -0.02053833, 0.01521229, -0.10191563,-0.07321421, -0.01803589, -0.02788144, 0.00172424, 0.07978603,-0.01517505, 0.03893743, -0.0548212 , 0.03782436, 0.04642305,-0.05222284, 0.01304263, -0.06944965, 0.01763625, -0.02670433,-0.03698331, -0.02478899, -0.06544131, 0.05864679, -0.00175549,-0.11564055, -0.10066441, -0.04190209, -0.02992467, -0.08564534,-0.02061244, 0.02688017, -0.0045171 , 0.00165086, 0.10750544,-0.028361 , -0.03209577, 0.0515936 , -0.04164342, 0.02281843,0.08524286, -0.10112653, -0.14161319, -0.05427769, -0.01017171,0.09955125, 0.02694847, -0.0915055 , 0.09549531, -0.0138172 ,0.01547096, 0.00868443, -0.04557078, -0.00442069, 0.01043919,-0.00775728, 0.02804129, 0.10577102, 0.07417879, -0.0414545 ,-0.10446894, 0.07996532, -0.06722441, 0.0636742 , -0.05054583,-0.11369978, 0.02922131, -0.03643508, -0.09067681, -0.06278338,-0.01135545, 0.09446498, -0.02156576, 0.00918143, 0.0722787 ,-0.01088969, 0.03180022, -0.00304031, 0.0532895 , 0.07494827,-0.02797735, -0.06948853, 0.06283715, 0.10689872, 0.02087112,0.05185082, 0.06266276, 0.01831927, 0.10564604, 0.00259254,0.08089193, -0.01426479, 0.00684974, -0.03707304, -0.1198062 ,-0.05715216, 0.01687549, 0.03455462, -0.08835565, 0.05120559,-0.06600516, -0.01664807, -0.02856736, 0.02654157, -0.00975818,-0.03065236, -0.04041981, -0.01071312, -0.05153402, -0.14723714,-0.00877744, 0.08035714, 0.00351824, -0.10722714, -0.03078206,-0.00496383, -0.01665388, 0.0004069 , -0.02276175, 0.14360192,-0.09488932, 0.00554548, 0.13301958, -0.02263096, -0.03730701,0.03650629, -0.02395339, 0.00687372, -0.02563804, 0.03732518,-0.02720424, -0.0106114 , -0.05050805, 0.00444685, -0.02968924,0.07124983, -0.00694057, 0.00107829, -0.08331589, -0.03359186,0.0081293 , -0.0008138 , 0.01801554, 0.02518827, -0.03804089,0.06714594, 0.00194731, 0.08901033, 0.06102903, 0.03237479,-0.05186026, 0.02203078, -0.02689325, -0.01497105, -0.07096935,0.00406174, 0.03199695, -0.05650693, -0.00124395, 0.08180745,0.10938081, 0.0316787 , 0.01944987, -0.02388909, 0.00355748,0.0249256 , 0.00739524, 0.0506243 , -0.01226516, 0.01143035,-0.09211658, -0.02129836, -0.11622447, -0.04239509, -0.05391511,-0.00467064, -0.01021031, 0.00030227, 0.12456985, -0.0130964 ,0.02393832, -0.04647537, 0.06130255, 0.02752686, 0.04820469,-0.06352307, 0.0357637 , -0.1455921 , 0.01995268, -0.04385739,-0.03136626, -0.04338237, -0.08235096, 0.02723331, -0.01401483]) fig = plt.figure(figsize=(16, 16)) plot_LSA(embeddings, list_labels) plt.show()

這看起來就好多啦！

clf_w2v = LogisticRegression(C=30.0, class_weight='balanced', solver='newton-cg', multi_class='multinomial', random_state=40) clf_w2v.fit(X_train_word2vec, y_train_word2vec) y_predicted_word2vec = clf_w2v.predict(X_test_word2vec) accuracy_word2vec, precision_word2vec, recall_word2vec, f1_word2vec = get_metrics(y_test_word2vec, y_predicted_word2vec) print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy_word2vec, precision_word2vec, recall_word2vec, f1_word2vec)) accuracy = 0.777, precision = 0.776, recall = 0.777, f1 = 0.777 cm_w2v = confusion_matrix(y_test_word2vec, y_predicted_word2vec) fig = plt.figure(figsize=(10, 10)) plot = plot_confusion_matrix(cm, classes=['Irrelevant','Disaster','Unsure'], normalize=False, title='Confusion matrix') plt.show() print("Word2Vec confusion matrix") print(cm_w2v) print("TFIDF confusion matrix") print(cm2) print("BoW confusion matrix") print(cm)

Word2Vec confusion matrix [[980 242 2][232 711 2][ 2 5 0]] TFIDF confusion matrix [[974 249 1][261 684 0][ 3 4 0]] BoW confusion matrix [[970 251 3][274 670 1][ 3 4 0]]

這是目前為止最好的啦

基于深度學習的自然語言處理（CNN與RNN）

from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categoricalEMBEDDING_DIM = 300 MAX_SEQUENCE_LENGTH = 35 VOCAB_SIZE = len(VOCAB)VALIDATION_SPLIT=.2 tokenizer = Tokenizer(num_words=VOCAB_SIZE) tokenizer.fit_on_texts(clean_questions["text"].tolist()) sequences = tokenizer.texts_to_sequences(clean_questions["text"].tolist())word_index = tokenizer.word_index print('Found %s unique tokens.' % len(word_index))cnn_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) labels = to_categorical(np.asarray(clean_questions["class_label"]))indices = np.arange(cnn_data.shape[0]) np.random.shuffle(indices) cnn_data = cnn_data[indices] labels = labels[indices] num_validation_samples = int(VALIDATION_SPLIT * cnn_data.shape[0])embedding_weights = np.zeros((len(word_index)+1, EMBEDDING_DIM)) for word,index in word_index.items():embedding_weights[index,:] = word2vec[word] if word in word2vec else np.random.rand(EMBEDDING_DIM) print(embedding_weights.shape) Found 19098 unique tokens. (19099, 300)

Now, we will define a simple Convolutional Neural Network

from keras.layers import Dense, Input, Flatten, Dropout, Merge from keras.layers import Conv1D, MaxPooling1D, Embedding from keras.layers import LSTM, Bidirectional from keras.models import Modeldef ConvNet(embeddings, max_sequence_length, num_words, embedding_dim, labels_index, trainable=False, extra_conv=True):embedding_layer = Embedding(num_words,embedding_dim,weights=[embeddings],input_length=max_sequence_length,trainable=trainable)sequence_input = Input(shape=(max_sequence_length,), dtype='int32')embedded_sequences = embedding_layer(sequence_input)# Yoon Kim model (https://arxiv.org/abs/1408.5882)convs = []filter_sizes = [3,4,5]for filter_size in filter_sizes:l_conv = Conv1D(filters=128, kernel_size=filter_size, activation='relu')(embedded_sequences)l_pool = MaxPooling1D(pool_size=3)(l_conv)convs.append(l_pool)l_merge = Merge(mode='concat', concat_axis=1)(convs)# add a 1D convnet with global maxpooling, instead of Yoon Kim modelconv = Conv1D(filters=128, kernel_size=3, activation='relu')(embedded_sequences)pool = MaxPooling1D(pool_size=3)(conv)if extra_conv==True:x = Dropout(0.5)(l_merge) else:# Original Yoon Kim modelx = Dropout(0.5)(pool)x = Flatten()(x)x = Dense(128, activation='relu')(x)#x = Dropout(0.5)(x)preds = Dense(labels_index, activation='softmax')(x)model = Model(sequence_input, preds)model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['acc'])return model

訓練網(wǎng)絡

x_train = cnn_data[:-num_validation_samples] y_train = labels[:-num_validation_samples] x_val = cnn_data[-num_validation_samples:] y_val = labels[-num_validation_samples:]model = ConvNet(embedding_weights, MAX_SEQUENCE_LENGTH, len(word_index)+1, EMBEDDING_DIM, len(list(clean_questions["class_label"].unique())), False) model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=3, batch_size=128) Train on 8701 samples, validate on 2175 samples Epoch 1/3 8701/8701 [==============================] - 11s - loss: 0.5964 - acc: 0.7067 - val_loss: 0.4970 - val_acc: 0.7848 Epoch 2/3 8701/8701 [==============================] - 11s - loss: 0.4434 - acc: 0.8019 - val_loss: 0.4722 - val_acc: 0.8005 Epoch 3/3 8701/8701 [==============================] - 11s - loss: 0.3968 - acc: 0.8283 - val_loss: 0.4985 - val_acc: 0.7880 <keras.callbacks.History at 0x12237bc88>

總結

以上是生活随笔為你收集整理的自然语言处理美国政客的社交媒体消息分类的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：房产证上的身份证号是假的怎样办理
下一篇：三十一、Python读写docx文件