生活随笔
收集整理的這篇文章主要介紹了
kaggle-Santander 客户交易预测总结
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
1 繪圖
sns.kdeplot()——核密度估計圖
sns.distplot()——集合了matplotlib的hist()與核函數估計kdeplot的功能
Seaborn入門系列之kdeplot和distplot
2 Permutation Importance
我們在構建樹類模型(XGBoost、LightGBM等)時,如果想要知道哪些變量比較重要的話。可以通過模型的feature_importances_方法來獲取特征重要性。例如LightGBM的feature_importances_可以通過特征的分裂次數或利用該特征分裂后的增益來衡量。一般情況下,不同的衡量準則得到的特征重要性順序會有差異。我一般是通過多種評價標準來交叉選擇特征。若一個特征在不同的評價標準下都是比較重要的,那么該特征對label有較好的預測能力。
若將一個特征置為隨機數,模型效果下降很多,說明該特征比較重要;反之則不是
import eli5
from eli5
.sklearn
import PermutationImportance
from sklearn
.feature_selection
import SelectFromModel
def PermutationImportance_(clf
,X_train
,y_train
,X_valid
,X_test
):perm
= PermutationImportance
(clf
, n_iter
=5, random_state
=1024, cv
=5)perm
.fit
(X_train
, y_train
) result_
= {'var':X_train
.columns
.values
,'feature_importances_':perm
.feature_importances_
,'feature_importances_std_':perm
.feature_importances_std_
}feature_importances_
= pd
.DataFrame
(result_
, columns
=['var','feature_importances_','feature_importances_std_'])feature_importances_
= feature_importances_
.sort_values
('feature_importances_',ascending
=False)sel
= SelectFromModel
(perm
, threshold
=0.00, prefit
=True)X_train_
= sel
.transform
(X_train
)X_valid_
= sel
.transform
(X_valid
)X_test_
= sel
.transform
(X_test
)return feature_importances_
,X_train_
,X_valid_
,X_test
model_1
= RandomForestClassifier
(random_state
=1024)
feature_importances_1
,X_train_1
,X_valid_1
,X_test_1
= PermutationImportance_
(model_1
,X_train
,y_train
,X_valid
,X_test
)model_2
= lgb
.LGBMClassifier
(objective
='binary',random_state
=1024)
feature_importances_2
,X_train_2
,X_valid_2
,X_test_2
= PermutationImportance_
(model_2
,X_train
,y_train
,X_valid
,X_test
)model_3
= LogisticRegression
(random_state
=1024)
feature_importances_3
,X_train_3
,X_valid_3
,X_test_3
= PermutationImportance_
(model_3
,X_train
,y_train
,X_valid
,X_test
3 部分依賴圖
部分依賴圖顯示每個變量或預測變量如何影響模型的預測。這對于以下問題很有用:
男女之間的工資差異有多少僅僅取決于性別,而不是教育背景或工作經歷的差異?控制房屋特征,經度和緯度對房價有何影響?為了重申這一點,我們想要了解在不同區域如何定價同樣大小的房屋,即使實際上這些地區的房屋大小不同。由于飲食差異或其他因素,兩組之間是否存在健康差異?
from sklearn
.ensemble
.partial_dependence
import plot_partial_dependencemy_plots
= plot_partial_dependence
(my_model
,feature_names
= clo_to_use
,features
= [0,2],X
= imputed_X
)
4 tqdm
from tqdm
import tqdm_notebook
as tqdm
Tqdm 是一個快速,可擴展的Python進度條,可以在 Python 長循環中添加一個進度提示信息,用戶只需要封裝任意的迭代器 tqdm(iterator)。
5 特征工程
找出每一列中的唯一值,如果其唯一,則標記為1。
如果某一樣本中含有唯一值,則視為真樣本;如果某一樣本中所有特征均不唯一,則視為假樣本。
將真樣本和真實訓練樣本拼在一起。
unique_samples
= []
unique_count
= np
.zeros_like
(df_test
)
for feature
in range(df_test
.shape
[1]):_
, index_
, count_
= np
.unique
(df_test
[:, feature
], return_counts
=True, return_index
=True)unique_count
[index_
[count_
== 1], feature
] += 1
real_samples_indexes
= np
.argwhere
(np
.sum(unique_count
, axis
=1) > 0)[:, 0]
synthetic_samples_indexes
= np
.argwhere
(np
.sum(unique_count
, axis
=1) == 0)[:, 0]
"vc"列:重復數值的個數,大于10次的取10
"sum"列:出現次數大于1的,用vc列的值乘以(原值-均值)
for feat
in feats
:temp
= df
[feat
].value_counts
(dropna
= True) df_train
[feat
+"vc"] = df_train
[feat
].map(temp
).map(lambda x
:min(10,x
)).astype
(np
.uint8
)df_test
[feat
+"vc"] = df_test
[feat
].map(temp
).map(lambda x
:min(10,x
)).astype
(np
.uint8
)print(feat
,temp
.shape
[0],df_train
[feat
+"vc"].map(lambda x
:int(x
>2)).sum(),df_train
[feat
+"vc"].map(lambda x
:int(x
>3)).sum())df_train
[feat
+"sum"] = ((df_train
[feat
] - df
[feat
].mean
()) * df_train
[feat
+"vc"].map(lambda x
:int(x
>1))).astype
(np
.float32
)df_test
[feat
+"sum"] = ((df_test
[feat
] - df
[feat
].mean
()) * df_test
[feat
+"vc"].map(lambda x
:int(x
>1))).astype
(np
.float32
)df_train
[feat
+"sum2"] = ((df_train
[feat
]) * df_train
[feat
+"vc"].map(lambda x
:int(x
>2))).astype
(np
.float32
)df_test
[feat
+"sum2"] = ((df_test
[feat
]) * df_test
[feat
+"vc"].map(lambda x
:int(x
>2))).astype
(np
.float32
)df_train
[feat
+"sum3"] = ((df_train
[feat
]) * df_train
[feat
+"vc"].map(lambda x
:int(x
>4))).astype
(np
.float32
) df_test
[feat
+"sum3"] = ((df_test
[feat
]) * df_test
[feat
+"vc"].map(lambda x
:int(x
>4))).astype
(np
.float32
)
def encode_FE(df
,col
,test
):cv
= df
[col
].value_counts
()nm
= col
+'_FE'df
[nm
] = df
[col
].map(cv
)test
[nm
] = test
[col
].map(cv
)test
[nm
].fillna
(0,inplace
=True)if cv
.max()<=255:df
[nm
] = df
[nm
].astype
('uint8')test
[nm
] = test
[nm
].astype
('uint8')else:df
[nm
] = df
[nm
].astype
('uint16')test
[nm
] = test
[nm
].astype
('uint16') returntest
['target'] = -1
comb
= pd
.concat
([train
,test
.loc
[real_samples_indexes
]],axis
=0,sort
=True)
for i
in range(200): encode_FE
(comb
,'var_'+str(i
),test
)
train
= comb
[:len(train
)]; del comb
print('Added 200 new magic features!')
《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀
總結
以上是生活随笔為你收集整理的kaggle-Santander 客户交易预测总结的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。