代碼來源?
前言
閱讀別的的優(yōu)秀代碼有助于提高自己的代碼編寫能力,從中我們不僅能學(xué)習(xí)到許多的編程知識(shí),還能借鑒他人優(yōu)秀的編程習(xí)慣,也能學(xué)習(xí)到別人獨(dú)特的編程技巧。這篇博客是博主對(duì)微軟2019惡意軟件檢測(cè)比賽第七名的一些個(gè)人總結(jié)和看法,有些代碼上博主已經(jīng)給了注釋,同時(shí)也會(huì)額外給代碼另外進(jìn)行注釋。由于博主能力有限,錯(cuò)誤的出現(xiàn)在所難免,還望技術(shù)愛好者們 不吝賜教。
正文
概要
眾所周知,機(jī)器學(xué)習(xí)分類模型的構(gòu)建主要由兩部分組成1.數(shù)據(jù)預(yù)處理(包括數(shù)據(jù)清洗、特征工程等) 2.機(jī)器學(xué)習(xí)模型構(gòu)建(訓(xùn)練、調(diào)參),而數(shù)據(jù)預(yù)處理是機(jī)器學(xué)習(xí)模型構(gòu)建的前期工作,用于訓(xùn)練的數(shù)據(jù)的質(zhì)量在很大程度決定了最后的機(jī)器學(xué)習(xí)模型的質(zhì)量,所以一般的機(jī)器學(xué)習(xí)項(xiàng)目的代碼絕大篇幅都是處理數(shù)據(jù)的代碼,這份代碼也是如此。 個(gè)人認(rèn)為,這份代碼的的數(shù)據(jù)處理不算很好,但也還算過得去(如果想了解比較有趣的數(shù)據(jù)預(yù)處理代碼請(qǐng)看博主的另一篇博客? )。這份代碼所使用的機(jī)器學(xué)習(xí)算法是lightGBM 。
代碼詳解
說明: 博主會(huì)把代碼分開來講解,但由于設(shè)備原因無法把每一步的代碼結(jié)果顯示出來,條件允許的技術(shù)愛好者們可以自己復(fù)制代碼自己去run一下,代碼中使用的文件在官網(wǎng) 可以下載。雖然是步講解,但是從上往下把代碼拼接起來的是完整的代碼。
數(shù)據(jù)預(yù)處理部分
import numpy
as np
import pandas
as pd
import gc
import time
import random
from lightgbm
import LGBMClassifier
from sklearn
. metrics
import roc_auc_score
, roc_curve
from sklearn
. model_selection
import StratifiedKFold
import matplotlib
. pyplot
as plot
import seaborn
as sb
實(shí)現(xiàn)功能前的預(yù)備階段
dataFolder
= '../input/'
submissionFileName
= 'submission'
trainFile
= 'train.csv'
testFile
= 'test.csv'
numberOfRows
= 4000000 seed
= 6001
np
. random
. seed
( seed
)
random
. seed
( seed
) def displayImportances ( featureImportanceDf
, submissionFileName
) : cols
= featureImportanceDf
[ [ "feature" , "importance" ] ] . groupby
( "feature" ) . mean
( ) . sort_values
( by
= "importance" , ascending
= False ) . indexbestFeatures
= featureImportanceDf
. loc
[ featureImportanceDf
. feature
. isin
( cols
) ] plot
. figure
( figsize
= ( 14 , 14 ) ) sb
. barplot
( x
= "importance" , y
= "feature" , data
= bestFeatures
. sort_values
( by
= "importance" , ascending
= False ) ) plot
. title
( 'LightGBM Features' ) plot
. tight_layout
( ) plot
. savefig
( submissionFileName
+ '.png' )
這一段代碼,其實(shí)我覺得可以不用把路徑用幾個(gè)變量來表示(或許是代碼作者的編程習(xí)慣吧)。numberOfRows=4000000的用法要縱觀代碼才能知道,是這樣的,代碼作者把比賽官方給的train和test拼接在了一起,然后再選取前4000000個(gè)樣例作為訓(xùn)練集(最后被分為訓(xùn)練集和驗(yàn)證集)。seed=6001及下面兩條代碼是為了生成隨機(jī)種子,但博主有個(gè)疑惑 ,為什么用了np.random.seed(seed)還要用 random.seed(seed)? ,先按住不表,等我查好資料再來補(bǔ)充。至于那個(gè)自定義函數(shù),是最后來保存輸出結(jié)果的。
為官方提供的文件中的特征設(shè)置類型 就是說原始數(shù)據(jù)中的特征只有特征值,官方是沒有標(biāo)出它是什么類型的數(shù)據(jù),需要自己來設(shè)置。
dtypes
= { 'MachineIdentifier' : 'category' , 'ProductName' : 'category' , 'EngineVersion' : 'category' , 'AppVersion' : 'category' , 'AvSigVersion' : 'category' , 'IsBeta' : 'int8' , 'RtpStateBitfield' : 'float16' , 'IsSxsPassiveMode' : 'int8' , 'DefaultBrowsersIdentifier' : 'float16' , 'AVProductStatesIdentifier' : 'float32' , 'AVProductsInstalled' : 'float16' , 'AVProductsEnabled' : 'float16' , 'HasTpm' : 'int8' , 'CountryIdentifier' : 'int16' , 'CityIdentifier' : 'float32' , 'OrganizationIdentifier' : 'float16' , 'GeoNameIdentifier' : 'float16' , 'LocaleEnglishNameIdentifier' : 'int8' , 'Platform' : 'category' , 'Processor' : 'category' , 'OsVer' : 'category' , 'OsBuild' : 'int16' , 'OsSuite' : 'int16' , 'OsPlatformSubRelease' : 'category' , 'OsBuildLab' : 'category' , 'SkuEdition' : 'category' , 'IsProtected' : 'float16' , 'AutoSampleOptIn' : 'int8' , 'PuaMode' : 'category' , 'SMode' : 'float16' , 'IeVerIdentifier' : 'float16' , 'SmartScreen' : 'category' , 'Firewall' : 'float16' , 'UacLuaenable' : 'float32' , 'Census_MDC2FormFactor' : 'category' , 'Census_DeviceFamily' : 'category' , 'Census_OEMNameIdentifier' : 'float16' , 'Census_OEMModelIdentifier' : 'float32' , 'Census_ProcessorCoreCount' : 'float16' , 'Census_ProcessorManufacturerIdentifier' : 'float16' , 'Census_ProcessorModelIdentifier' : 'float16' , 'Census_ProcessorClass' : 'category' , 'Census_PrimaryDiskTotalCapacity' : 'float32' , 'Census_PrimaryDiskTypeName' : 'category' , 'Census_SystemVolumeTotalCapacity' : 'float32' , 'Census_HasOpticalDiskDrive' : 'int8' , 'Census_TotalPhysicalRAM' : 'float32' , 'Census_ChassisTypeName' : 'category' , 'Census_InternalPrimaryDiagonalDisplaySizeInInches' : 'float16' , 'Census_InternalPrimaryDisplayResolutionHorizontal' : 'float16' , 'Census_InternalPrimaryDisplayResolutionVertical' : 'float16' , 'Census_PowerPlatformRoleName' : 'category' , 'Census_InternalBatteryType' : 'category' , 'Census_InternalBatteryNumberOfCharges' : 'float32' , 'Census_OSVersion' : 'category' , 'Census_OSArchitecture' : 'category' , 'Census_OSBranch' : 'category' , 'Census_OSBuildNumber' : 'int16' , 'Census_OSBuildRevision' : 'int32' , 'Census_OSEdition' : 'category' , 'Census_OSSkuName' : 'category' , 'Census_OSInstallTypeName' : 'category' , 'Census_OSInstallLanguageIdentifier' : 'float16' , 'Census_OSUILocaleIdentifier' : 'int16' , 'Census_OSWUAutoUpdateOptionsName' : 'category' , 'Census_IsPortableOperatingSystem' : 'int8' , 'Census_GenuineStateName' : 'category' , 'Census_ActivationChannel' : 'category' , 'Census_IsFlightingInternal' : 'float16' , 'Census_IsFlightsDisabled' : 'float16' , 'Census_FlightRing' : 'category' , 'Census_ThresholdOptIn' : 'float16' , 'Census_FirmwareManufacturerIdentifier' : 'float16' , 'Census_FirmwareVersionIdentifier' : 'float32' , 'Census_IsSecureBootEnabled' : 'int8' , 'Census_IsWIMBootEnabled' : 'float16' , 'Census_IsVirtualDevice' : 'float16' , 'Census_IsTouchEnabled' : 'int8' , 'Census_IsPenCapable' : 'int8' , 'Census_IsAlwaysOnAlwaysConnectedCapable' : 'float16' , 'Wdft_IsGamer' : 'float16' , 'Wdft_RegionIdentifier' : 'float16' , 'HasDetections' : 'int8' }
selectedFeatures
= [ 'AVProductStatesIdentifier' , 'AVProductsEnabled' , 'IsProtected' , 'Processor' , 'OsSuite' , 'IsProtected' , 'RtpStateBitfield' , 'AVProductsInstalled' , 'Wdft_IsGamer' , 'DefaultBrowsersIdentifier' , 'OsBuild' , 'Wdft_RegionIdentifier' , 'SmartScreen' , 'CityIdentifier' , 'AppVersion' , 'Census_IsSecureBootEnabled' , 'Census_PrimaryDiskTypeName' , 'Census_SystemVolumeTotalCapacity' , 'Census_HasOpticalDiskDrive' , 'Census_IsWIMBootEnabled' , 'Census_IsVirtualDevice' , 'Census_IsTouchEnabled' , 'Census_FirmwareVersionIdentifier' , 'GeoNameIdentifier' , 'IeVerIdentifier' , 'Census_FirmwareManufacturerIdentifier' , 'Census_InternalPrimaryDisplayResolutionHorizontal' , 'Census_InternalPrimaryDisplayResolutionVertical' , 'Census_OEMModelIdentifier' , 'Census_ProcessorModelIdentifier' , 'Census_OSVersion' , 'Census_InternalPrimaryDiagonalDisplaySizeInInches' , 'Census_OEMNameIdentifier' , 'Census_ChassisTypeName' , 'Census_OSInstallLanguageIdentifier' , 'EngineVersion' , 'OrganizationIdentifier' , 'CountryIdentifier' , 'Census_ActivationChannel' , 'Census_ProcessorCoreCount' , 'Census_OSWUAutoUpdateOptionsName' , 'Census_InternalBatteryType' ]
代碼作者因?yàn)榫邆浞浅7浅I詈竦臄?shù)據(jù)處理技術(shù)功底,他可能是根據(jù)以前對(duì)惡意代碼數(shù)據(jù)處理的經(jīng)驗(yàn)直接選擇了這些特征來給機(jī)器學(xué)習(xí)模型進(jìn)行訓(xùn)練。所以說,特征是不能亂選的,如果沒有代碼作者那樣的技術(shù),還是借鑒別人的數(shù)據(jù)預(yù)處理方法進(jìn)行特征篩選吧。
trainDf
= pd
. read_csv
( dataFolder
+ trainFile
, dtype
= dtypes
, usecols
= selectedFeatures
, low_memory
= True , nrows
= numberOfRows
)
labels
= pd
. read_csv
( dataFolder
+ trainFile
, usecols
= [ 'HasDetections' ] , nrows
= numberOfRows
)
testDf
= pd
. read_csv
( dataFolder
+ testFile
, dtype
= dtypes
, usecols
= selectedFeatures
, low_memory
= True )
print ( '== Dataset Shapes ==' )
print ( 'Train : ' + str ( trainDf
. shape
) )
print ( 'Labels : ' + str ( labels
. shape
) )
print ( 'Test : ' + str ( testDf
. shape
) )
df
= trainDf
. append
( testDf
) . reset_index
( )
del trainDf
, testDf
gc
. collect
( )
df 是將train和test拼接之后的新的DataFrame。
對(duì)特征 ‘SmartScreen’ 的特征值進(jìn)行處理
df
. loc
[ df
. SmartScreen
== 'off' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== 'of' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== 'OFF' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== '00000000' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== '0' , 'SmartScreen' ] = 'Off'
df
. loc
[ df
. SmartScreen
== 'ON' , 'SmartScreen' ] = 'On'
df
. loc
[ df
. SmartScreen
== 'on' , 'SmartScreen' ] = 'On'
df
. loc
[ df
. SmartScreen
== 'Enabled' , 'SmartScreen' ] = 'On'
df
. loc
[ df
. SmartScreen
== 'BLOCK' , 'SmartScreen' ] = 'Block'
df
. loc
[ df
. SmartScreen
== 'requireadmin' , 'SmartScreen' ] = 'RequireAdmin'
df
. loc
[ df
. SmartScreen
== 'requireAdmin' , 'SmartScreen' ] = 'RequireAdmin'
df
. loc
[ df
. SmartScreen
== 'RequiredAdmin' , 'SmartScreen' ] = 'RequireAdmin'
df
. loc
[ df
. SmartScreen
== 'Promt' , 'SmartScreen' ] = 'Prompt'
df
. loc
[ df
. SmartScreen
== 'Promprt' , 'SmartScreen' ] = 'Prompt'
df
. loc
[ df
. SmartScreen
== 'prompt' , 'SmartScreen' ] = 'Prompt'
df
. loc
[ df
. SmartScreen
== 'warn' , 'SmartScreen' ] = 'Warn'
df
. loc
[ df
. SmartScreen
== 'Deny' , 'SmartScreen' ] = 'Block'
df
. loc
[ df
. SmartScreen
== '' , 'SmartScreen' ] = 'Off'
在這里我們能學(xué)到一種從某特征中取特定值的方法:通過設(shè)定條件來取特征中的目標(biāo)特征值
將每種特征的個(gè)特征值出現(xiàn)次數(shù)統(tǒng)計(jì)出來再生成一個(gè)新的DataFrame
for col
in [ f
for f
in df
. columns
if f
not in [ 'index' , 'HasDetections' , 'Census_SystemVolumeTotalCapacity' ] ] : df
[ col
] = df
[ col
] . map ( df
[ col
] . value_counts
( ) ) dfDummy
= pd
. get_dummies
( df
, dummy_na
= True )
print ( 'Dummy: ' + str ( dfDummy
. shape
) )
del df
gc
. collect
( )
print ( '== Dataset Shapes ==' )
print ( 'Train: ' + str ( train
. shape
) )
print ( 'Test: ' + str ( test
. shape
) )
print ( '== Dataset Columns ==' )
features
= [ f
for f
in train
. columns
if f
not in [ 'index' ] ]
for feature
in features
: print ( feature
)
df[col].map(df[col].value_counts()) 通過.map()函數(shù)將每個(gè)特征值的出現(xiàn)次數(shù)映射到原來存放特征值的那個(gè)位置 (如果是函數(shù)意思不懂的話博主建議自己去查一下,這里只給出代碼的意義)。這行代碼是很有技巧的,因?yàn)樗挥昧艘恍写a就對(duì)每個(gè)特征中存放的值從特征值換成了特征值出現(xiàn)次數(shù),也就是所謂的頻率(更正式的“頻率”應(yīng)該是出現(xiàn)次數(shù)除以100) ,那為什么要修改為頻率呢?那是因?yàn)?strong>lightGBM算法是基于頻率的。
feature 在上面我們把 train 和 test 拼接起來的時(shí)候使用了函數(shù) .reset_index(),會(huì)出現(xiàn)新的一列’index’保存原來的索引,所以在這里我們要 not in ['index'] ``
df[col]=df[col].map(df[col].value_counts()) 這行代碼比較難,我這里放個(gè)例子給大家看看
機(jī)器學(xué)習(xí)模型構(gòu)建部分
folds
= StratifiedKFold
( n_splits
= 5 , shuffle
= True , random_state
= seed
)
oofPreds
= np
. zeros
( train
. shape
[ 0 ] )
subPreds
= np
. zeros
( test
. shape
[ 0 ] )
featureImportanceDf
= pd
. DataFrame
( )
for n_fold
, ( trainXId
, validXId
) in enumerate ( folds
. split
( train
[ features
] , labels
) ) : trainX
, trainY
= train
[ features
] . iloc
[ trainXId
] , labels
. iloc
[ trainXId
] validX
, validY
= train
[ features
] . iloc
[ validXId
] , labels
. iloc
[ validXId
] print ( '== Fold: ' + str ( n_fold
) ) lgbm
= LGBMClassifier
( objective
= 'binary' , boosting_type
= 'gbdt' , n_estimators
= 2500 , learning_rate
= 0.05 , num_leaves
= 250 , min_data_in_leaf
= 125 , bagging_fraction
= 0.901 , max_depth
= 13 , reg_alpha
= 2.5 , reg_lambda
= 2.5 , min_split_gain
= 0.0001 , min_child_weight
= 25 , feature_fraction
= 0.5 , silent
= - 1 , verbose
= - 1 , n_jobs
= - 1 ) lgbm
. fit
( trainX
, trainY
, eval_set
= [ ( trainX
, trainY
) , ( validX
, validY
) ] , eval_metric
= 'auc' , verbose
= 250 , early_stopping_rounds
= 100 ) oofPreds
[ validXId
] = lgbm
. predict_proba
( validX
, num_iteration
= lgbm
. best_iteration_
) [ : , 1 ] print ( 'Fold %2d AUC : %.6f' % ( n_fold
+ 1 , roc_auc_score
( validY
, oofPreds
[ validXId
] ) ) ) print ( 'Cleanup' ) del trainX
, trainY
, validX
, validYgc
. collect
( ) subPreds
+= lgbm
. predict_proba
( test
[ features
] , num_iteration
= lgbm
. best_iteration_
) [ : , 1 ] / folds
. n_splits fold_importance_df
= pd
. DataFrame
( ) fold_importance_df
[ "feature" ] = featuresfold_importance_df
[ "importance" ] = lgbm
. feature_importances_ fold_importance_df
[ "fold" ] = n_fold
+ 1 featureImportanceDf
= pd
. concat
( [ featureImportanceDf
, fold_importance_df
] , axis
= 0 ) print ( 'Cleanup. Post-Fold' ) del lgbmgc
. collect
( ) print ( 'Full AUC score %.6f' % roc_auc_score
( labels
, oofPreds
) )
1.oofPreds = np.zeros(train.shape[0]) : 創(chuàng)建一個(gè)與 train 行長度相等的元素為0的數(shù)組 subPreds = np.zeros(test.shape[0]) : 創(chuàng)建一個(gè)與 test 行長度相等的元素為0的數(shù)組
2.oofPreds = np.zeros(train.shape[0]) subPreds = np.zeros(test.shape[0])是 numpy.ndarray類型,因?yàn)閞oc_auc_score()參數(shù)得是array類型。
3.經(jīng)過訓(xùn)練,我們可以計(jì)算AUC值來檢測(cè)分類效果
oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1] 驗(yàn)證集中樣本預(yù)測(cè)為1(正樣本)的概率 roc_auc_score(validY, oofPreds[validXId])) 過驗(yàn)證集的標(biāo)簽和預(yù)測(cè)驗(yàn)證集為正樣本的概率計(jì)算AUC
保存文件、可視化模塊(可視化函數(shù)在代碼最上面定義了)
displayImportances
( featureImportanceDf
, submissionFileName
)
kaggleSubmission
= pd
. read_csv
( dataFolder
+ 'sample_submission.csv' )
kaggleSubmission
[ 'HasDetections' ] = subPreds
kaggleSubmission
. to_csv
( submissionFileName
+ '.csv' , index
= False )
總結(jié)
以上是生活随笔 為你收集整理的“7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码 的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔 推薦給好友。