心电图心跳信号多分类预测(一)
目錄
- 1.賽題理解
- 1.1賽題概況
- 1.2數據概況
- 1.3代碼示例
- 1.3.1數據讀取
- 1.3.2分類指標計算示例
- 2.baseline
- 2.1 導入第三方包
- 2.2 讀取數據
- 2.3.數據預處理
- 2.4.訓練數據/測試數據準備
- 2.5.模型訓練
- 2.6.預測結果
1.賽題理解
Tip:心電圖心跳信號多分類預測挑戰賽。
2016年6月,國務院辦公廳印發《國務院辦公廳關于促進和規范健康醫療大數據應用發展的指導意見》,文件指出健康醫療大數據應用發展將帶來健康醫療模式的深刻變化,有利于提升健康醫療服務效率和質量。
賽題以心電圖數據為背景,要求選手根據心電圖感應數據預測心跳信號,其中心跳信號對應正常病例以及受不同心律不齊和心肌梗塞影響的病例,這是一個多分類的問題。通過這道賽題來引導大家了解醫療大數據的應用,幫助競賽新人進行自我練習、自我提高。
賽題鏈接:https://tianchi.aliyun.com/competition/entrance/531883/introduction
1.1賽題概況
比賽要求參賽選手根據給定的數據集,建立模型,預測不同的心跳信號。賽題以預測心電圖心跳信號類別為任務,數據集報名后可見并可下載,該該數據來自某平臺心電圖數據記錄,總數據量超過20萬,主要為1列心跳信號序列數據,其中每個樣本的信號序列采樣頻次一致,長度相等。為了保證比賽的公平性,將會從中抽取10萬條作為訓練集,2萬條作為測試集A,2萬條作為測試集B,同時會對心跳信號類別(label)信息進行脫敏。
通過這道賽題來引導大家走進醫療大數據的世界,主要針對于于競賽新人進行自我練習,自我提高。
1.2數據概況
一般而言,對于數據在比賽界面都有對應的數據概況介紹(匿名特征除外),說明列的性質特征。了解列的性質會有助于我們對于數據的理解和后續分析。
Tip:匿名特征,就是未告知數據列所屬的性質的特征列。
train.csv
- id 為心跳信號分配的唯一標識
- heartbeat_signals 心跳信號序列(數據之間采用“,”進行分隔)
- label 心跳信號類別(0、1、2、3)
testA.csv
- id 心跳信號分配的唯一標識
- heartbeat_signals 心跳信號序列(數據之間采用“,”進行分隔)
##1.3預測指標
選手需提交4種不同心跳信號預測的概率,選手提交結果與實際心跳類型結果進行對比,求預測的概率與真實值差值的絕對值。
具體計算公式如下:
總共有n個病例,針對某一個信號,若真實值為[y1,y2,y3,y4],模型預測概率值為[a1,a2,a3,a4],那么該模型的評價指標abs-sum為
abs?sum=∑j=1n∑i=14∣yi?ai∣{abs-sum={\mathop{ \sum }\limits_{{j=1}}^{{n}}{{\mathop{ \sum }\limits_{{i=1}}^{{4}}{{ \left| {y\mathop{{}}\nolimits_{{i}}-a\mathop{{}}\nolimits_{{i}}} \right| }}}}}} abs?sum=j=1∑n?i=1∑4?∣yi??ai?∣
例如,某心跳信號類別為1,通過編碼轉成[0,1,0,0],預測不同心跳信號概率為[0.1,0.7,0.1,0.1],那么這個信號預測結果的abs-sum為
abs?sum=∣0.1?0∣+∣0.7?1∣+∣0.1?0∣+∣0.1?0∣=0.6{abs-sum={ \left| {0.1-0} \right| }+{ \left| {0.7-1} \right| }+{ \left| {0.1-0} \right| }+{ \left| {0.1-0} \right| }=0.6} abs?sum=∣0.1?0∣+∣0.7?1∣+∣0.1?0∣+∣0.1?0∣=0.6
多分類算法常見的評估指標如下:
其實多分類的評價指標的計算方式與二分類完全一樣,只不過我們計算的是針對于每一類來說的召回率、精確度、準確率和 F1分數。
1、混淆矩陣(Confuse Matrix)
- (1)若一個實例是正類,并且被預測為正類,即為真正類TP(True Positive )
- (2)若一個實例是正類,但是被預測為負類,即為假負類FN(False Negative )
- (3)若一個實例是負類,但是被預測為正類,即為假正類FP(False Positive )
- (4)若一個實例是負類,并且被預測為負類,即為真負類TN(True Negative )
第一個字母T/F,表示預測的正確與否;第二個字母P/N,表示預測的結果為正例或者負例。如TP就表示預測對了,預測的結果是正例,那它的意思就是把正例預測為了正例。
2.準確率(Accuracy)
準確率是常用的一個評價指標,但是不適合樣本不均衡的情況,醫療數據大部分都是樣本不均衡數據。
Accuracy=CorrectTotalAccuracy=TP+TNTP+TN+FP+FNAccuracy=\frac{Correct}{Total}\\ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} Accuracy=TotalCorrect?Accuracy=TP+TN+FP+FNTP+TN?
3、精確率(Precision)也叫查準率簡寫為P
精確率(Precision)是針對預測結果而言的,其含義是在被所有預測為正的樣本中實際為正樣本的概率在被所有預測為正的樣本中實際為正樣本的概率,精確率和準確率看上去有些類似,但是是兩個完全不同的概念。精確率代表對正樣本結果中的預測準確程度,準確率則代表整體的預測準確程度,包括正樣本和負樣本。
Precision=TPTP+FPPrecision = \frac{TP}{TP + FP} Precision=TP+FPTP?
4.召回率(Recall) 也叫查全率 簡寫為R
召回率(Recall)是針對原樣本而言的,其含義是在實際為正的樣本中被預測為正樣本的概率。
Recall=TPTP+FNRecall = \frac{TP}{TP + FN} Recall=TP+FNTP?
下面我們通過一個簡單例子來看看精確率和召回率。假設一共有10篇文章,里面4篇是你要找的。根據你的算法模型,你找到了5篇,但實際上在這5篇之中,只有3篇是你真正要找的。
那么算法的精確率是3/5=60%,也就是你找的這5篇,有3篇是真正對的。算法的召回率是3/4=75%,也就是需要找的4篇文章,你找到了其中三篇。以精確率還是以召回率作為評價指標,需要根據具體問題而定。
5.宏查準率(macro-P)
計算每個樣本的精確率然后求平均值
macroP=1n∑1npi{macroP=\frac{{1}}{{n}}{\mathop{ \sum }\limits_{{1}}^{{n}}{p\mathop{{}}\nolimits_{{i}}}}} macroP=n1?1∑n?pi?
6.宏查全率(macro-R)
計算每個樣本的召回率然后求平均值
macroR=1n∑1nRi{macroR=\frac{{1}}{{n}}{\mathop{ \sum }\limits_{{1}}^{{n}}{R\mathop{{}}\nolimits_{{i}}}}} macroR=n1?1∑n?Ri?
7.宏F1(macro-F1)
macroF1=2×macroP×macroRmacroP+macroR{macroF1=\frac{{2 \times macroP \times macroR}}{{macroP+macroR}}} macroF1=macroP+macroR2×macroP×macroR?
與上面的宏不同,微查準查全,先將多個混淆矩陣的TP,FP,TN,FN對應位置求平均,然后按照P和R的公式求得micro-P和micro-R,最后根據micro-P和micro-R求得micro-F1
8.微查準率(micro-P)
microP=TP ̄TP ̄×FP ̄{microP=\frac{{\overline{TP}}}{{\overline{TP} \times \overline{FP}}}} microP=TP×FPTP?
9.微查全率(micro-R)
microR=TP ̄TP ̄×FN ̄{microR=\frac{{\overline{TP}}}{{\overline{TP} \times \overline{FN}}}} microR=TP×FNTP?
10.微F1(micro-F1)
microF1=2×microP×microRmicroP+microR{microF1=\frac{{2 \times microP\times microR }}{{microP+microR}}} microF1=microP+microR2×microP×microR?
1.3代碼示例
本部分為對于數據讀取和指標評價的示例。
1.3.1數據讀取
import pandas as pd import numpy as nppath='./data/' train_data=pd.read_csv(path+'train.csv') test_data=pd.read_csv(path+'testA.csv') print('Train data shape:',train_data.shape) print('TestA data shape:',test_data.shape) Train data shape: (100000, 3) TestA data shape: (20000, 2) train_data.head()1.3.2分類指標計算示例
多分類評估指標的計算示例:
from sklearn.metrics import accuracy_score, precision_score, recall_score from sklearn.metrics import f1_score y_test = [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4,5,5,6,6,6,0,0,0,0] #真實值 y_predict = [1, 1, 1, 3, 3, 2, 2, 3, 3, 3, 4, 3, 4, 3,5,1,3,6,6,1,1,0,6] #預測值#計算準確率 print("accuracy:", accuracy_score(y_test, y_predict)) #計算精確率 #計算macro_precision print("macro_precision:", precision_score(y_test, y_predict, average='macro')) #計算micro_precision print("micro_precision:", precision_score(y_test, y_predict, average='micro')) #計算召回率 #計算macro_recall print("macro_recall:", recall_score(y_test, y_predict, average='macro')) #計算micro_recall print("micro_recall:", recall_score(y_test, y_predict, average='micro')) #計算F1 #計算macro_f1 print("macro_f1:", f1_score(y_test, y_predict, average='macro')) #計算micro_f1 print("micro_f1:", f1_score(y_test, y_predict, average='micro')) accuracy: 0.5217391304347826 macro_precision: 0.7023809523809524 micro_precision: 0.5217391304347826 macro_recall: 0.5261904761904762 micro_recall: 0.5217391304347826 macro_f1: 0.5441558441558441 micro_f1: 0.5217391304347826 def abs_sum(y_pre,y_tru):#y_pre為預測概率矩陣#y_tru為真實類別矩陣y_pre=np.array(y_pre)y_tru=np.array(y_tru)loss=sum(sum(abs(y_pre-y_tru)))return loss y_pre=[[0.1,0.1,0.7,0.1],[0.1,0.1,0.7,0.1]] y_tru=[[0,0,1,0],[0,0,1,0]] print(abs_sum(y_pre,y_tru)) def abs_sum(y_pre,y_tru):#y_pre為預測概率矩陣#y_tru為真實類別矩陣y_pre=np.array(y_pre)y_tru=np.array(y_tru)loss=sum(sum(abs(y_pre-y_tru)))return loss y_pre=[[0.1,0.1,0.7,0.1],[0.1,0.1,0.7,0.1]] y_tru=[[0,0,1,0],[0,0,1,0]] print(abs_sum(y_pre,y_tru)) 1.22.baseline
2.1 導入第三方包
import os import gc import mathimport pandas as pd import numpy as npimport lightgbm as lgb import xgboost as xgb from catboost import CatBoostRegressor from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge from sklearn.preprocessing import MinMaxScalerfrom sklearn.model_selection import StratifiedKFold, KFold from sklearn.metrics import log_loss from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoderfrom tqdm import tqdm import matplotlib.pyplot as plt import time import warnings warnings.filterwarnings('ignore')2.2 讀取數據
train = pd.read_csv('./data/train.csv') test=pd.read_csv('./data/testA.csv') train.head() test.head()2.3.數據預處理
def reduce_mem_usage(df):start_mem = df.memory_usage().sum() / 1024**2 print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))for col in df.columns:col_type = df[col].dtypeif col_type != object:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64) else:if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:df[col] = df[col].astype(np.float16)elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)else:df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() / 1024**2 print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))return df # 簡單預處理 train_list = []for items in train.values:train_list.append([items[0]] + [float(i) for i in items[1].split(',')] + [items[2]])train = pd.DataFrame(np.array(train_list)) train.columns = ['id'] + ['s_'+str(i) for i in range(len(train_list[0])-2)] + ['label'] train = reduce_mem_usage(train)test_list=[] for items in test.values:test_list.append([items[0]] + [float(i) for i in items[1].split(',')])test = pd.DataFrame(np.array(test_list)) test.columns = ['id'] + ['s_'+str(i) for i in range(len(test_list[0])-1)] test = reduce_mem_usage(test)2.4.訓練數據/測試數據準備
x_train = train.drop(['id','label'], axis=1) y_train = train['label'] x_test=test.drop(['id'], axis=1)2.5.模型訓練
def abs_sum(y_pre,y_tru):y_pre=np.array(y_pre)y_tru=np.array(y_tru)loss=sum(sum(abs(y_pre-y_tru)))return loss def cv_model(clf, train_x, train_y, test_x, clf_name):folds = 5seed = 2021kf = KFold(n_splits=folds, shuffle=True, random_state=seed)test = np.zeros((test_x.shape[0],4))cv_scores = []onehot_encoder = OneHotEncoder(sparse=False)for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):print('************************************ {} ************************************'.format(str(i+1)))trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]if clf_name == "lgb":train_matrix = clf.Dataset(trn_x, label=trn_y)valid_matrix = clf.Dataset(val_x, label=val_y)params = {'boosting_type': 'gbdt','objective': 'multiclass','num_class': 4,'num_leaves': 2 ** 5,'feature_fraction': 0.8,'bagging_fraction': 0.8,'bagging_freq': 4,'learning_rate': 0.1,'seed': seed,'nthread': 28,'n_jobs':24,'verbose': -1,}model = clf.train(params, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=2000, verbose_eval=100, early_stopping_rounds=200)val_pred = model.predict(val_x, num_iteration=model.best_iteration)test_pred = model.predict(test_x, num_iteration=model.best_iteration) val_y=np.array(val_y).reshape(-1, 1)val_y = onehot_encoder.fit_transform(val_y)print('預測的概率矩陣為:')print(test_pred)test += test_predscore=abs_sum(val_y, val_pred)cv_scores.append(score)print(cv_scores)print("%s_scotrainre_list:" % clf_name, cv_scores)print("%s_score_mean:" % clf_name, np.mean(cv_scores))print("%s_score_std:" % clf_name, np.std(cv_scores))test=test/kf.n_splitsreturn test def lgb_model(x_train, y_train, x_test):lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")return lgb_train, lgb_test lgb_test = lgb_model(x_train, y_train, x_test) ************************************ 1 ************************************ Training until validation scores don't improve for 200 rounds. [100] valid_0's multi_logloss: 0.064112 [200] valid_0's multi_logloss: 0.0459167 [300] valid_0's multi_logloss: 0.0408373 [400] valid_0's multi_logloss: 0.0392399 [500] valid_0's multi_logloss: 0.0392175 [600] valid_0's multi_logloss: 0.0405053 Early stopping, best iteration is: [463] valid_0's multi_logloss: 0.0389485 預測的概率矩陣為: [[9.99961153e-01 3.79594757e-05 5.40269890e-07 3.47753326e-07][4.99636350e-05 5.92518438e-04 9.99357510e-01 8.40463147e-09][1.13534131e-06 6.17413877e-08 5.21828651e-07 9.99998281e-01]...[1.22083700e-01 2.13622465e-04 8.77693686e-01 8.99180470e-06][9.99980119e-01 1.98198246e-05 3.10446404e-08 2.99597008e-08][9.94006248e-01 1.02828877e-03 3.81770253e-03 1.14776091e-03]] [579.1476207255505] ************************************ 2 ************************************ Training until validation scores don't improve for 200 rounds. [100] valid_0's multi_logloss: 0.0687167 [200] valid_0's multi_logloss: 0.0502461 [300] valid_0's multi_logloss: 0.0448571 [400] valid_0's multi_logloss: 0.0434752 [500] valid_0's multi_logloss: 0.0437877 [600] valid_0's multi_logloss: 0.0451813 Early stopping, best iteration is: [436] valid_0's multi_logloss: 0.0432897 預測的概率矩陣為: [[9.99991739e-01 7.88513434e-06 2.75344752e-07 1.00915251e-07][2.49685427e-05 2.43497381e-04 9.99731529e-01 4.61614730e-09][1.44652230e-06 5.27814431e-08 8.75834875e-07 9.99997625e-01]...[1.73518192e-02 6.25526922e-04 9.82020954e-01 1.69956047e-06][9.99964916e-01 3.49310996e-05 6.38724919e-08 8.85421858e-08][9.49334496e-01 2.90531517e-03 4.51978522e-02 2.56233644e-03]] [579.1476207255505, 604.2307776963927] ************************************ 3 ************************************ Training until validation scores don't improve for 200 rounds. [100] valid_0's multi_logloss: 0.0618889 [200] valid_0's multi_logloss: 0.0430953 [300] valid_0's multi_logloss: 0.0375819 [400] valid_0's multi_logloss: 0.0352853 [500] valid_0's multi_logloss: 0.0351532 [600] valid_0's multi_logloss: 0.0358523 Early stopping, best iteration is: [430] valid_0's multi_logloss: 0.0349329 預測的概率矩陣為: [[9.99983568e-01 1.56724846e-05 1.94436783e-07 5.64651188e-07][2.99349484e-05 2.43726437e-04 9.99726329e-01 9.67287310e-09][2.31218700e-06 1.82129075e-07 6.88966798e-07 9.99996817e-01]...[2.96181314e-02 2.17104772e-04 9.70149950e-01 1.48136138e-05][9.99966536e-01 3.34076405e-05 4.80303648e-08 7.93476494e-09][9.73829094e-01 5.30951041e-03 1.50529670e-02 5.80842848e-03]] [579.1476207255505, 604.2307776963927, 555.3013640683623] ************************************ 4 ************************************ Training until validation scores don't improve for 200 rounds. [100] valid_0's multi_logloss: 0.0686223 [200] valid_0's multi_logloss: 0.0500697 [300] valid_0's multi_logloss: 0.045032 [400] valid_0's multi_logloss: 0.0440971 [500] valid_0's multi_logloss: 0.0446604 [600] valid_0's multi_logloss: 0.0464719 Early stopping, best iteration is: [441] valid_0's multi_logloss: 0.0439851 預測的概率矩陣為: [[9.99993398e-01 5.63383991e-06 4.13276650e-07 5.54430625e-07][4.23117526e-05 1.07414935e-03 9.98883518e-01 2.07032580e-08][1.48865216e-06 1.48204734e-07 6.84974277e-07 9.99997678e-01]...[1.76412184e-02 1.88982715e-04 9.82166158e-01 3.64074385e-06][9.99938870e-01 6.10399865e-05 4.18069924e-08 4.79151711e-08][8.63192676e-01 1.44352297e-02 1.12629861e-01 9.74223271e-03]] [579.1476207255505, 604.2307776963927, 555.3013640683623, 605.9808495854531] ************************************ 5 ************************************ Training until validation scores don't improve for 200 rounds. [100] valid_0's multi_logloss: 0.0620219 [200] valid_0's multi_logloss: 0.0442736 [300] valid_0's multi_logloss: 0.038578 [400] valid_0's multi_logloss: 0.0372811 [500] valid_0's multi_logloss: 0.03731 [600] valid_0's multi_logloss: 0.0382158 Early stopping, best iteration is: [439] valid_0's multi_logloss: 0.036957 預測的概率矩陣為: [[9.99982194e-01 1.70578170e-05 4.38130927e-07 3.10112892e-07][5.29666403e-05 1.31372254e-03 9.98633279e-01 3.15655986e-08][2.00659963e-06 1.05574382e-07 1.42520417e-06 9.99996463e-01]...[1.00123126e-02 4.74424633e-05 9.89939074e-01 1.17041518e-06][9.99920877e-01 7.90255835e-05 7.54952546e-08 2.23046247e-08][9.65319082e-01 3.09196193e-03 2.49116248e-02 6.67733096e-03]] [579.1476207255505, 604.2307776963927, 555.3013640683623, 605.9808495854531, 570.2883889772514] lgb_scotrainre_list: [579.1476207255505, 604.2307776963927, 555.3013640683623, 605.9808495854531, 570.2883889772514] lgb_score_mean: 582.9898002106021 lgb_score_std: 19.608697878368312.6.預測結果
temp=pd.DataFrame(lgb_test) result=pd.read_csv('sample_submit.csv') result['label_0']=temp[0] result['label_1']=temp[1] result['label_2']=temp[2] result['label_3']=temp[3] result.to_csv('submit.csv',index=False)結果提交分數:
總結
以上是生活随笔為你收集整理的心电图心跳信号多分类预测(一)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【微信小程序】微信支付
- 下一篇: Java上传文件到minio