基于文本挖掘的企业隐患排查质量分析模型
一、賽題
? ? ? ? 賽題地址:https://www.sodic.com.cn/competitions/900010
? ? ? ? ?賽題背景:?企業(yè)自主填報(bào)安全生產(chǎn)隱患,對(duì)于將風(fēng)險(xiǎn)消除在事故萌芽階段具有重要意義。企業(yè)在填報(bào)隱患時(shí),往往存在不認(rèn)真填報(bào)的情況,“虛報(bào)、假報(bào)”隱患內(nèi)容,增大了企業(yè)監(jiān)管的難度。采用大數(shù)據(jù)手段分析隱患內(nèi)容,找出不切實(shí)履行主體責(zé)任的企業(yè),向監(jiān)管部門進(jìn)行推送,實(shí)現(xiàn)精準(zhǔn)執(zhí)法,能夠提高監(jiān)管手段的有效性,增強(qiáng)企業(yè)安全責(zé)任意識(shí)。
? ? ? ? ? 賽題任務(wù):?本賽題提供企業(yè)填報(bào)隱患數(shù)據(jù),參賽選手需通過智能化手段識(shí)別其中是否存在“虛報(bào)、假報(bào)”的情況。
二、解決方法
? ? ? ? ? ?該賽題抽象成模型問題,總體來看是一個(gè)文本分類任務(wù),下面主要采取baseline、nlp傳統(tǒng)模型、nlp深度模型以及一些前沿的方法進(jìn)行提升。
? ? 一、baseline模型
# baseline 方法 (albert)# encoding = 'utf-8'import random import numpy as np import pandas as pd from bert4keras.backend import keras,set_gelu from bert4keras.tokenizers import Tokenizer from bert4keras.models import build_transformer_model from bert4keras.optimizers import Adam,extend_with_piecewise_linear_lr from bert4keras.snippets import sequence_padding, DataGenerator from bert4keras.snippets import open from keras.layers import Lambda, Dense# 相關(guān)參數(shù)及預(yù)訓(xùn)練模型 set_gelu("tanh") num_classes = 2 maxlen = 128 batch_size = 32 config_path = "../model/albert_small_zh_google/albert_config_small_google.json" checkpoint_path = '../model/albert_small_zh_google/albert_model.ckpt' dict_path = '../model/albert_small_zh_google/vocab.txt'# 建立分詞器 tokenizer = Tokenizer(dict_path, do_lower_case=True)# 定義模型 bert = build_transformer_model(config_path=config_path,checkpoint_path=checkpoint_path,model='albert',return_keras_model=False)output = Lambda(lambda x: x[:,0], name='CLS-token')(bert.model.output) output = Dense(units=num_classes,activation='softmax',kernel_initializer=bert.initializer)(output) model = keras.models.Model(bert.model.input, output) model.compile(loss='sparse_categorical_crossentropy',optimizer=Adam(1e-5),metrics=['accuracy'])# 加載數(shù)據(jù)與處理df_train_data = pd.read_csv("../data/train.csv") df_test_data = pd.read_csv("../data/test.csv")train_data, valid_data, test_data = [], [], []valid_rate = 0.3 for row_i, data in df_train_data.iterrows():id, level_1, level_2, level_3, level_4, content, label = dataid, text, label = id, str(level_1) + '\t' + str(level_2) + '\t' + str(level_3) + '\t' + str(level_4) + '\t' + str(content), labelif random.random() > valid_rate:train_data.append((id,text,int(label)))else:valid_data.append((id,text,int(label)))for row_i, data in df_test_data.iterrows():id, level_1, level_2, level_3, level_4, content = dataid, text, label = id, str(level_1) + '\t' + str(level_2) + '\t' + str(level_3) + '\t' + str(level_4) + '\t' + str(content), 0test_data.append((id, text, int(label)))# 定義data_generator class data_generator(DataGenerator):def __iter__(self, random=False):batch_token_ids, batch_segment_ids, batch_labels = [], [], []for is_end,(id, text, label) in self.sample(random):token_ids, segment_ids = tokenizer.encode(text, maxlen=maxlen)batch_token_ids.append(token_ids)batch_segment_ids.append(segment_ids)batch_labels.append([label])if len(batch_token_ids) == self.batch_size or is_end:batch_token_ids = sequence_padding(batch_token_ids)batch_segment_ids = sequence_padding(batch_segment_ids)batch_labels = sequence_padding(batch_labels)yield [batch_token_ids, batch_segment_ids], batch_labelsbatch_token_ids, batch_segment_ids, batch_labels = [], [], []# 轉(zhuǎn)換數(shù)據(jù)集 train_generator = data_generator(train_data, batch_size) valid_generator = data_generator(valid_data, batch_size)# 評(píng)估與保存 class Evaluator(keras.callbacks.Callback):def __init__(self):self.best_val_acc = 0.def on_epoch_end(self, epoch, logs=None):val_acc = evaluate(valid_generator)if val_acc > self.best_val_acc:self.best_val_acc = val_accmodel.save_weights('best_model.weights')test_acc = evaluate(valid_generator)print(u'val_acc: %.5f, best_val_acc: %.5f, test_acc: %.5f\n' %(val_acc, self.best_val_acc, test_acc))# 訓(xùn)練和驗(yàn)證 evaluator = Evaluator()# 訓(xùn)咯模型 model.fit(train_generator.forfit(),steps_per_epoch=len(train_generator),epochs=10,callbacks=[evaluator])# 加載模型 model.load_weights("best_model.weights") print(u"final test acc: %05f\n" % (evaluate(valid_generator)))# 評(píng)價(jià)指標(biāo) def data_pred(test_data):id_ids, y_pred_ids = [], []for id, text, label in test_data:token_ids, segment_ids = tokenizer.encode(text, maxlen=maxlen)token_ids = sequence_padding([token_ids])segment_ids = sequence_padding([segment_ids])y_pred = int(model.predict([token_ids, segment_ids]).argmax(axis=1)[0])id_ids.append(id)y_pred_ids.append(y_pred)return id_ids, y_pred_ids# 模型預(yù)測(cè)及保存結(jié)果 id_ids, y_pred_ids = data_pred(test_data) df_save = pd.DataFrame() df_save['id'] = id_ids df_save['label'] = y_pred_ids# 結(jié)果打印 df_save.head()id label 0 0 0 1 1 0 2 2 1 3 3 0 4 4 0二、NLP經(jīng)典模型
""" tf-idf """import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import StratifiedKFold from sklearn.metrics import f1_score import lightgbm as lgbbase_dir = "../" train = pd.read_csv(base_dir + "train.csv") test = pd.read_csv(base_dir + "test.csv") results = pd.read_csv(base_dir + "results.csv")# 數(shù)據(jù)去重 train = train.drop_duplicates(['level_1', 'level_2', 'level_3', 'level_4', 'content', 'label'])train['text'] = (train['content']).map(lambda x:' '.join(list(str(x)))) test['text'] = (test['content']).map(lambda x:' '.join(list(str(x))))vectorizer = TfidfVectorizer(analyzer='char') train_X = vectorizer.fit_transform(train['text']).toarray() test_X = vectorizer.transform(test['text']).toarray() train_y = train['label'].astype(int).values# 參數(shù) params = {'task':'train','boosting_type':'gbdt','num_leaves': 31,'objective': 'binary', 'learning_rate': 0.05, 'bagging_freq': 2, 'max_bin':256,'num_threads': 32, # 'metric':['binary_logloss','binary_error']} skf = StratifiedKFold(n_splits=5)for index,(train_index, test_index) in enumerate(skf.split(train_X, train_y)):X_train, X_test = train_X[train_index], train_X[test_index]y_train, y_test = train_y[train_index], train_y[test_index]lgb_train = lgb.Dataset(X_train, y_train)lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)gbm = lgb.train(params,lgb_train,num_boost_round=1000,valid_sets=lgb_eval,early_stopping_rounds=10)y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)pred = gbm.predict(test_X, num_iteration=gbm.best_iteration)if index == 0:pred_y_check, true_y_check = list(y_pred), list(y_test)pred_out=predelse:pred_y_check += list(y_pred)true_y_check += list(y_test)pred_out += pred#驗(yàn)證for i in range(10):pred = [int(x) for x in np.where(np.array(pred_y_check) >= i/10.0,1,0)]scores = f1_score(true_y_check,pred)print(i, scores) """ n-gram模型 """# encoding='utf-8'import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import StratifiedKFold from sklearn.metrics import f1_score import lightgbm as lgb# 讀取數(shù)據(jù)base_dir = "../" train = pd.read_csv(base_dir + "train.csv") test = pd.read_csv(base_dir + "test.csv") results = pd.read_csv(base_dir + "results.csv")train = train.drop_duplicates(['level_1', 'level_2', 'level_3', 'level_4', 'content', 'label'])# 構(gòu)建特征 train['text'] = (train['content']).map(lambda x:' '.join(list(str(x)))) test['text'] = (test['content']).map(lambda x:' '.join(list(str(x))))vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 3), stop_words=[])train_X = vectorizer.fit_transform(train['text']).toarray() test_X = vectorizer.transform(test['text']).toarray()train_y = train['label'].astype(int).values# 交叉驗(yàn)證,訓(xùn)練模型params = {'task':'train','boosting_type':'gbdt','num_leaves': 31,'objective': 'binary', 'learning_rate': 0.05, 'bagging_freq': 2, 'max_bin':256,'num_threads': 32, # 'metric':['binary_logloss','binary_error']} skf = StratifiedKFold(n_splits=5)for index,(train_index, test_index) in enumerate(skf.split(train_X, train_y)):X_train, X_test = train_X[train_index], train_X[test_index]y_train, y_test = train_y[train_index], train_y[test_index]lgb_train = lgb.Dataset(X_train, y_train)lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)gbm = lgb.train(params,lgb_train,num_boost_round=1000,valid_sets=lgb_eval,early_stopping_rounds=10)y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)pred = gbm.predict(test_X, num_iteration=gbm.best_iteration)if index == 0:pred_y_check, true_y_check = list(y_pred), list(y_test)pred_out=predelse:pred_y_check += list(y_pred)true_y_check += list(y_test)pred_out += pred# 驗(yàn)證 for i in range(10):pred = [int(x) for x in np.where(np.array(pred_y_check) >= i/10.0,1,0)]scores = f1_score(true_y_check,pred)print(i, scores) """word2vec"""import pandas as pd import numpy as np import jieba import lightgbm as lgbfrom sklearn.model_selection import StratifiedKFold from sklearn.metrics import f1_score from gensim.models import Word2Vec# 讀取數(shù)據(jù) base_dir = "../" train = pd.read_csv(base_dir + "train.csv") test = pd.read_csv(base_dir + "test.csv") results = pd.read_csv(base_dir + "results.csv")# 訓(xùn)練集去重 train = train.drop_duplicates(['level_1', 'level_2', 'level_3', 'level_4', 'content', 'label'])# 構(gòu)建特征,使用word2vec train['text'] = (train['content']).map(lambda x:' '.join(jieba.cut(str(x)))) test['text'] = (test['content']).map(lambda x:' '.join(jieba.cut(str(x))))model_word = Word2Vec(train['text'].values.tolist(), size=100, window=5, min_count=1, workers=4)def get_vec(word_list, model):init = np.array([0.0]*100)index = 0for word in word_list:if word in model.wv: init += np.array(model.wv[word])index += 1if index == 0:return initreturn list(init / index)# 向量取平均值 train['vec'] = train['text'].map(lambda x: get_vec(x, model_word)) test['vec'] = test['text'].map(lambda x: get_vec(x, model_word))train_X = np.array(train['vec'].values.tolist()) test_X = np.array(test['vec'].values.tolist()) train_y = train['label'].astype(int).values# 交叉驗(yàn)證params = {'task':'train','boosting_type':'gbdt','num_leaves': 31,'objective': 'binary', 'learning_rate': 0.05, 'bagging_freq': 2, 'max_bin':256,'num_threads': 32, # 'metric':['binary_logloss','binary_error']} skf = StratifiedKFold(n_splits=5)for index,(train_index, test_index) in enumerate(skf.split(train_X, train_y)):X_train, X_test = train_X[train_index], train_X[test_index]y_train, y_test = train_y[train_index], train_y[test_index]lgb_train = lgb.Dataset(X_train, y_train)lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)gbm = lgb.train(params,lgb_train,num_boost_round=1000,valid_sets=lgb_eval,early_stopping_rounds=10)y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)pred = gbm.predict(test_X, num_iteration=gbm.best_iteration)if index == 0:pred_y_check, true_y_check = list(y_pred), list(y_test)pred_out=predelse:pred_y_check += list(y_pred)true_y_check += list(y_test)pred_out += pred# 驗(yàn)證for i in range(10):pred = [int(x) for x in np.where(np.array(pred_y_check) >= i/10.0,1,0)]scores = f1_score(true_y_check,pred)print(i/10.0, scores)三、NLP深度模型
"""TextCNN"""import random import numpy as np import pandas as pd from bert4keras.backend import keras, set_gelu from bert4keras.tokenizers import Tokenizer from bert4keras.models import build_transformer_model from bert4keras.optimizers import Adam, extend_with_piecewise_linear_lr from bert4keras.snippets import sequence_padding, DataGenerator from bert4keras.snippets import open from keras.layers import * import tensorflow as tfset_gelu('tanh') # 切換gelu版本num_classes = 2 maxlen = 128 batch_size = 32 config_path = '../model/albert_small_zh_google/albert_config_small_google.json' checkpoint_path = '../model/albert_small_zh_google/albert_model.ckpt' dict_path = '../model/albert_small_zh_google/vocab.txt'# 建立分詞器 tokenizer = Tokenizer(dict_path, do_lower_case=True)# 加載bert模型# 加載預(yù)訓(xùn)練模型 bert = build_transformer_model(config_path=config_path,checkpoint_path=checkpoint_path,model='albert',return_keras_model=False, )# keras輔助函數(shù) expand_dims = Lambda(lambda X: tf.expand_dims(X,axis=-1)) max_pool = Lambda(lambda X: tf.squeeze(tf.reduce_max(X,axis=1),axis=1)) concat = Lambda(lambda X: tf.concat(X, axis=-1))# 獲取bert的char embedding cnn_input = expand_dims(bert.layers['Embedding-Token'].output)# 定義cnn網(wǎng)絡(luò) filters = 2 sizes = [3,5,7,9] output = [] for size_i in sizes:X = Conv2D(filters=2,kernel_size=(size_i, 128),activation='relu',)(cnn_input)X = max_pool(X)output.append(X)cnn_output = concat(output)# 分類全連接 output = Dense(units=num_classes,activation='softmax' )(cnn_output)# 定義模型輸入輸出 model = keras.models.Model(bert.model.input[0], output)# 編譯模型model.compile(loss='sparse_categorical_crossentropy',optimizer=Adam(1e-5), metrics=['accuracy'], )# 加載數(shù)據(jù)def load_data(valid_rate=0.3):train_file = "../data/train.csv"test_file = "../data/test.csv"df_train_data = pd.read_csv("../data/train.csv").\drop_duplicates(['level_1', 'level_2', 'level_3', 'level_4', 'content', 'label'])df_test_data = pd.read_csv("../data/test.csv")train_data, valid_data, test_data = [], [], []for row_i, data in df_train_data.iterrows():id, level_1, level_2, level_3, level_4, content, label = dataid, text, label = id, str(level_1) + '\t' + str(level_2) + '\t' + \str(level_3) + '\t' + str(level_4) + '\t' + str(content), labelif random.random() > valid_rate:train_data.append( (id, text, int(label)) )else:valid_data.append( (id, text, int(label)) )for row_i, data in df_test_data.iterrows():id, level_1, level_2, level_3, level_4, content = dataid, text, label = id, str(level_1) + '\t' + str(level_2) + '\t' + \str(level_3) + '\t' + str(level_4) + '\t' + str(content), 0test_data.append( (id, text, int(label)) )return train_data, valid_data, test_datatrain_data, valid_data, test_data = load_data(valid_rate=0.3)# 迭代器生成class data_generator(DataGenerator):def __iter__(self, random=False):batch_token_ids, batch_labels = [], []for is_end, (id, text, label) in self.sample(random):token_ids, segment_ids = tokenizer.encode(text, maxlen=maxlen)batch_token_ids.append(token_ids)batch_labels.append([label])if len(batch_token_ids) == self.batch_size or is_end:batch_token_ids = sequence_padding(batch_token_ids)batch_labels = sequence_padding(batch_labels)yield [batch_token_ids], batch_labelsbatch_token_ids, batch_labels = [], []train_generator = data_generator(train_data, batch_size) valid_generator = data_generator(valid_data, batch_size)# 訓(xùn)練驗(yàn)證和預(yù)測(cè)def evaluate(data):total, right = 0., 0.for x_true, y_true in data:y_pred = model.predict(x_true).argmax(axis=1)y_true = y_true[:, 0]total += len(y_true)right += (y_true == y_pred).sum()return right / totalclass Evaluator(keras.callbacks.Callback):def __init__(self):self.best_val_acc = 0.def on_epoch_end(self, epoch, logs=None):val_acc = evaluate(valid_generator)if val_acc > self.best_val_acc:self.best_val_acc = val_accmodel.save_weights('best_model.weights')test_acc = evaluate(valid_generator)print(u'val_acc: %.5f, best_val_acc: %.5f, test_acc: %.5f\n' %(val_acc, self.best_val_acc, test_acc))def data_pred(test_data):id_ids, y_pred_ids = [], []for id, text, label in test_data:token_ids, segment_ids = tokenizer.encode(text, maxlen=maxlen)token_ids = sequence_padding([token_ids])y_pred = int(model.predict([token_ids]).argmax(axis=1)[0])id_ids.append(id)y_pred_ids.append(y_pred)return id_ids, y_pred_ids# 訓(xùn)練和驗(yàn)證模型evaluator = Evaluator() model.fit(train_generator.forfit(),steps_per_epoch=len(train_generator),epochs=1,callbacks=[evaluator])# 加載最好的模型model.load_weights('best_model.weights')# 驗(yàn)證集結(jié)果 print(u'final test acc: %05f\n' % (evaluate(valid_generator)))# 訓(xùn)練集結(jié)果 print(u'final test acc: %05f\n' % (evaluate(train_generator)))# 模型預(yù)測(cè)保存結(jié)果id_ids, y_pred_ids = data_pred(test_data) df_save = pd.DataFrame() df_save['id'] = id_ids df_save['label'] = y_pred_idsdf_save.to_csv('result.csv') """Bi-LSTM"""import random import numpy as np import pandas as pd from bert4keras.backend import keras, set_gelu from bert4keras.tokenizers import Tokenizer from bert4keras.models import build_transformer_model from bert4keras.optimizers import Adam, extend_with_piecewise_linear_lr from bert4keras.snippets import sequence_padding, DataGenerator from bert4keras.snippets import open from keras.layers import * import tensorflow as tfset_gelu('tanh') # 切換gelu版本 num_classes = 2 maxlen = 128 batch_size = 32 config_path = '../model/albert_small_zh_google/albert_config_small_google.json' checkpoint_path = '../model/albert_small_zh_google/albert_model.ckpt' dict_path = '../model/albert_small_zh_google/vocab.txt'# 建立分詞器 tokenizer = Tokenizer(dict_path, do_lower_case=True) # 加載預(yù)訓(xùn)練模型 bert = build_transformer_model(config_path=config_path,checkpoint_path=checkpoint_path,model='albert',return_keras_model=False, ) lstm_input = bert.layers['Embedding-Token'].output X = Bidirectional(LSTM(128, return_sequences=True))(lstm_input) lstm_output = Bidirectional(LSTM(128))(X)output = Dense(units=num_classes,activation='softmax' )(lstm_output)model = keras.models.Model(bert.model.input[0], output)model.compile(loss='sparse_categorical_crossentropy',optimizer=Adam(1e-5), metrics=['accuracy'], )def load_data(valid_rate=0.3):train_file = "../data/train.csv"test_file = "../data/test.csv"df_train_data = pd.read_csv("../data/train.csv").\drop_duplicates(['level_1', 'level_2', 'level_3', 'level_4', 'content', 'label'])df_test_data = pd.read_csv("../data/test.csv")train_data, valid_data, test_data = [], [], []for row_i, data in df_train_data.iterrows():id, level_1, level_2, level_3, level_4, content, label = dataid, text, label = id, str(level_1) + '\t' + str(level_2) + '\t' + \str(level_3) + '\t' + str(level_4) + '\t' + str(content), labelif random.random() > valid_rate:train_data.append( (id, text, int(label)) )else:valid_data.append( (id, text, int(label)) )for row_i, data in df_test_data.iterrows():id, level_1, level_2, level_3, level_4, content = dataid, text, label = id, str(level_1) + '\t' + str(level_2) + '\t' + \str(level_3) + '\t' + str(level_4) + '\t' + str(content), 0test_data.append( (id, text, int(label)) )return train_data, valid_data, test_datatrain_data, valid_data, test_data = load_data(valid_rate=0.3)class data_generator(DataGenerator):def __iter__(self, random=False):batch_token_ids, batch_labels = [], []for is_end, (id, text, label) in self.sample(random):token_ids, segment_ids = tokenizer.encode(text, maxlen=maxlen)batch_token_ids.append(token_ids)batch_labels.append([label])if len(batch_token_ids) == self.batch_size or is_end:batch_token_ids = sequence_padding(batch_token_ids)batch_labels = sequence_padding(batch_labels)yield [batch_token_ids], batch_labelsbatch_token_ids, batch_labels = [], []train_generator = data_generator(train_data, batch_size) valid_generator = data_generator(valid_data, batch_size)def evaluate(data):total, right = 0., 0.for x_true, y_true in data:y_pred = model.predict(x_true).argmax(axis=1)y_true = y_true[:, 0]total += len(y_true)right += (y_true == y_pred).sum()return right / totalclass Evaluator(keras.callbacks.Callback):def __init__(self):self.best_val_acc = 0.def on_epoch_end(self, epoch, logs=None):val_acc = evaluate(valid_generator)if val_acc > self.best_val_acc:self.best_val_acc = val_accmodel.save_weights('best_model.weights')test_acc = evaluate(valid_generator)print(u'val_acc: %.5f, best_val_acc: %.5f, test_acc: %.5f\n' %(val_acc, self.best_val_acc, test_acc))def data_pred(test_data):id_ids, y_pred_ids = [], []for id, text, label in test_data:token_ids, segment_ids = tokenizer.encode(text, maxlen=maxlen)token_ids = sequence_padding([token_ids])y_pred = int(model.predict([token_ids]).argmax(axis=1)[0])id_ids.append(id)y_pred_ids.append(y_pred)return id_ids, y_pred_idsevaluator = Evaluator()model.fit(train_generator.forfit(),steps_per_epoch=len(train_generator),epochs=1,callbacks=[evaluator])model.load_weights('best_model.weights')print(u'final test acc: %05f\n' % (evaluate(valid_generator))) print(u'final test acc: %05f\n' % (evaluate(train_generator)))id_ids, y_pred_ids = data_pred(test_data) df_save = pd.DataFrame() df_save['id'] = id_ids df_save['label'] = y_pred_idsdf_save.to_csv('result.csv')最終結(jié)果: 開始的時(shí)候提交幾版,后來沒有時(shí)間優(yōu)化也就不了了之啦
注:
相關(guān)資料鏈接:?
? ? 1. 北大分詞庫地址:?http://sighan.cs.uchicago.edu/bakeoff2005/
? ? 2. 騰訊詞向量:?https://ai.tencent.com/ailab/nlp/zh/embedding.html
? ? ? ? ??
總結(jié)
以上是生活随笔為你收集整理的基于文本挖掘的企业隐患排查质量分析模型的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Leetcode 55.跳跃游戏 (每日
- 下一篇: Leetcode 33.搜索旋转排序数组