Adversarial Validation 微软恶意代码比赛的一个kenel的解析
英文文檔鏈接🔗
比賽網址🔗
對抗性驗證(Adversarial Validation)的作用
生成與待分類數據集同分布的新數據集并當作驗證集,這樣子訓練出來的模型在待分類數據集中的分類效果更好。
AUC簡介
最后得到的模型的對新數據的預測結果的AUC值越大,說明這個分類模型的分類能力越好。
項目詳解
代碼:
??
輸出:
[‘microsoft-malware-prediction’, ‘malware-feature-engineering-full-train-and-test’]
[‘train.csv’, ‘sample_submission.csv’, ‘test.csv’]
[’__output__.json’, ‘custom.css’, ‘new_test.csv’, ‘__results__.html’, ‘new_train.csv’]
輸出:
(1000000, 80)
| ProductName | EngineVersion | AppVersion | AvSigVersion | IsBeta | RtpStateBitfield | IsSxsPassiveMode | DefaultBrowsersIdentifier | AVProductStatesIdentifier | AVProductsInstalled | AVProductsEnabled | HasTpm | CountryIdentifier | CityIdentifier | OrganizationIdentifier | GeoNameIdentifier | LocaleEnglishNameIdentifier | Platform | Processor | OsVer | OsBuild | OsSuite | OsPlatformSubRelease | OsBuildLab | SkuEdition | IsProtected | AutoSampleOptIn | SMode | IeVerIdentifier | SmartScreen | Firewall | UacLuaenable | Census_MDC2FormFactor | Census_DeviceFamily | Census_OEMNameIdentifier | Census_OEMModelIdentifier | Census_ProcessorCoreCount | Census_ProcessorManufacturerIdentifier | Census_ProcessorModelIdentifier | Census_ProcessorClass | Census_PrimaryDiskTotalCapacity | Census_PrimaryDiskTypeName | Census_SystemVolumeTotalCapacity | Census_HasOpticalDiskDrive | Census_TotalPhysicalRAM | Census_ChassisTypeName | Census_InternalPrimaryDiagonalDisplaySizeInInches | Census_InternalPrimaryDisplayResolutionHorizontal | Census_InternalPrimaryDisplayResolutionVertical | Census_PowerPlatformRoleName | Census_InternalBatteryType | Census_InternalBatteryNumberOfCharges | Census_OSVersion | Census_OSArchitecture | Census_OSBranch | Census_OSBuildNumber | Census_OSBuildRevision | Census_OSEdition | Census_OSSkuName | Census_OSInstallTypeName | Census_OSInstallLanguageIdentifier | Census_OSUILocaleIdentifier | Census_OSWUAutoUpdateOptionsName | Census_IsPortableOperatingSystem | Census_GenuineStateName | Census_ActivationChannel | Census_IsFlightingInternal | Census_IsFlightsDisabled | Census_FlightRing | Census_ThresholdOptIn | Census_FirmwareManufacturerIdentifier | Census_FirmwareVersionIdentifier | Census_IsSecureBootEnabled | Census_IsWIMBootEnabled | Census_IsVirtualDevice | Census_IsTouchEnabled | Census_IsPenCapable | Census_IsAlwaysOnAlwaysConnectedCapable | Wdft_IsGamer | Wdft_RegionIdentifier | |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 1 | 0 | 202.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 0.0 | 0 | -1 | 1.0 | 0 | 0 | 0 | 0 | 20832.0 | 4.0 | 0 | 0 | -1 | 476940.0 | 0 | 299451.0 | 0 | 4096.0 | 0 | 18.9 | 1440.0 | 900.0 | 0 | -1 | 4.294967e+09 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0.0 | 0 | NaN | 0 | 2516.0 | 0 | NaN | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 |
| 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 1 | 1 | 164.0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 0.0 | 0 | -1 | 1.0 | 0 | 1 | 0 | 0 | 98328.0 | 4.0 | 0 | 1 | -1 | 476940.0 | 0 | 102385.0 | 0 | 4096.0 | 1 | 13.9 | 1366.0 | 768.0 | 1 | -1 | 1.000000e+00 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | NaN | 0.0 | 1 | NaN | 0 | 1767.0 | 0 | NaN | 0.0 | 0 | 0 | 0.0 | 0.0 | 1 |
| 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 1 | 2 | 685.0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1.0 | 0 | 0.0 | 0 | 0 | 1.0 | 0 | 0 | 0 | 1 | 2.0 | 4.0 | 0 | 2 | -1 | 114473.0 | 1 | 113907.0 | 0 | 4096.0 | 0 | 21.5 | 1920.0 | 1080.0 | 0 | -1 | 4.294967e+09 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 2 | 2 | 1 | 0 | 0 | 1 | NaN | 0.0 | 0 | NaN | 1 | 190.0 | 0 | NaN | 0.0 | 0 | 0 | 0.0 | 0.0 | 2 |
| 3 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 1 | 3 | 20.0 | -1 | 3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 0.0 | 0 | 1 | 1.0 | 0 | 0 | 0 | 2 | 171.0 | 4.0 | 0 | 3 | -1 | 238475.0 | 2 | 227116.0 | 0 | 4096.0 | 2 | 18.5 | 1366.0 | 768.0 | 0 | -1 | 4.294967e+09 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 3 | 3 | 1 | 0 | 0 | 1 | NaN | 0.0 | 0 | NaN | 2 | 33.0 | 0 | NaN | 0.0 | 0 | 0 | 0.0 | 0.0 | 2 |
| 4 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 1 | 4 | 15.0 | -1 | 4 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1.0 | 0 | 0.0 | 0 | 0 | 1.0 | 0 | 1 | 0 | 2 | 2263.0 | 4.0 | 0 | 4 | -1 | 476940.0 | 0 | 101900.0 | 0 | 6144.0 | 3 | 14.0 | 1366.0 | 768.0 | 1 | 0 | 0.000000e+00 | 3 | 0 | 0 | 0 | 3 | 1 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 2 | 124.0 | 0 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 3 |
輸出:
(1000000, 80)
Training until validation scores don’t improve for 25 rounds.
[10] ?training’s auc: 0.977506 valid_1’s auc: 0.977521
[20] ?training’s auc: 0.978298 valid_1’s auc: 0.978195
[30] ?training’s auc: 0.978955 valid_1’s auc: 0.978624
[40] ?training’s auc: 0.979589 valid_1’s auc: 0.979024
[50] ?training’s auc: 0.980195 valid_1’s auc: 0.979331
[60] ?training’s auc: 0.980738 valid_1’s auc: 0.979562
[70] ?training’s auc: 0.981254 valid_1’s auc: 0.979729
[80] ?training’s auc: 0.981701 valid_1’s auc: 0.979824
[90] ?training’s auc: 0.982138 valid_1’s auc: 0.979934
[100] training’s auc: 0.982507 valid_1’s auc: 0.979991
[110] training’s auc: 0.98287 ?valid_1’s auc: 0.980026
[120] training’s auc: 0.983184 valid_1’s auc: 0.980058
[130] training’s auc: 0.98349 ?valid_1’s auc: 0.980061
[140] training’s auc: 0.983802 valid_1’s auc: 0.980066
[150] training’s auc: 0.984118 valid_1’s auc: 0.980061
[160] training’s auc: 0.984421 valid_1’s auc: 0.980064
Early stopping, best iteration is:
[136] training’s auc: 0.983674 valid_1’s auc: 0.980071
通過對抗性驗證之后,可以得到為生成與test.csv同分布數據集的原數據集特征的貢獻度的排名,并以圖形表示出來。
import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.simplefilter(action='ignore', category=FutureWarning)feature_imp = pd.DataFrame(sorted(zip(clf.feature_importance(),columns_to_use), reverse=True), columns=['Value','Feature'])plt.figure(figsize=(20, 10)) sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)) plt.title('LightGBM Features (avg over folds)') plt.tight_layout() plt.show() plt.savefig('lgbm_importances-01.png')
最后,我們可以根據這個 樣本重要性 排行榜來選擇樣本作為驗證集
如何利用這個排行榜:
??在原始數據中,是存在許多缺失值的,有許多的值的命名也不規范(例如字符串型的特征值),那么,我們要選擇哪些樣本呢?這時候就可以通過這個排行榜。
??舉個例子:這個排行榜中的的第一名的特征是’AvSiaVersion’,我們把那些在這個特征上的值是缺失值的樣本全部移除,從剩下的樣本中挑選出驗證集。
總結
以上是生活随笔為你收集整理的Adversarial Validation 微软恶意代码比赛的一个kenel的解析的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LGB 的 .feature_impor
- 下一篇: 作者:赵洋(1988-),男,国家超级