當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

“7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码

發(fā)布時(shí)間：2025/3/15 编程问答 11 豆豆

生活随笔收集整理的這篇文章主要介紹了 “7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

代碼來源?

前言

閱讀別的的優(yōu)秀代碼有助于提高自己的代碼編寫能力，從中我們不僅能學(xué)習(xí)到許多的編程知識(shí)，還能借鑒他人優(yōu)秀的編程習(xí)慣，也能學(xué)習(xí)到別人獨(dú)特的編程技巧。這篇博客是博主對(duì)微軟2019惡意軟件檢測(cè)比賽第七名的一些個(gè)人總結(jié)和看法，有些代碼上博主已經(jīng)給了注釋，同時(shí)也會(huì)額外給代碼另外進(jìn)行注釋。由于博主能力有限，錯(cuò)誤的出現(xiàn)在所難免，還望技術(shù)愛好者們不吝賜教。

正文

概要

眾所周知，機(jī)器學(xué)習(xí)分類模型的構(gòu)建主要由兩部分組成1.數(shù)據(jù)預(yù)處理（包括數(shù)據(jù)清洗、特征工程等） 2.機(jī)器學(xué)習(xí)模型構(gòu)建（訓(xùn)練、調(diào)參），而數(shù)據(jù)預(yù)處理是機(jī)器學(xué)習(xí)模型構(gòu)建的前期工作，用于訓(xùn)練的數(shù)據(jù)的質(zhì)量在很大程度決定了最后的機(jī)器學(xué)習(xí)模型的質(zhì)量，所以一般的機(jī)器學(xué)習(xí)項(xiàng)目的代碼絕大篇幅都是處理數(shù)據(jù)的代碼，這份代碼也是如此。 個(gè)人認(rèn)為，這份代碼的的數(shù)據(jù)處理不算很好，但也還算過得去（如果想了解比較有趣的數(shù)據(jù)預(yù)處理代碼請(qǐng)看博主的另一篇博客?)。這份代碼所使用的機(jī)器學(xué)習(xí)算法是lightGBM。

代碼詳解

說明:
博主會(huì)把代碼分開來講解，但由于設(shè)備原因無法把每一步的代碼結(jié)果顯示出來，條件允許的技術(shù)愛好者們可以自己復(fù)制代碼自己去run一下，代碼中使用的文件在官網(wǎng)可以下載。雖然是步講解，但是從上往下把代碼拼接起來的是完整的代碼。

數(shù)據(jù)預(yù)處理部分

庫的導(dǎo)入

#imports import numpy as np import pandas as pd import gc # python 的垃圾收集機(jī)制 import time # 貌似在這份代碼中沒有用...... import random # 隨機(jī)數(shù) from lightgbm import LGBMClassifier # lightGBM 算法庫 from sklearn.metrics import roc_auc_score, roc_curve # AUC ROC 模型分類能力的一種評(píng)估標(biāo)準(zhǔn) from sklearn.model_selection import StratifiedKFold # 訓(xùn)練集和驗(yàn)證集的劃分 import matplotlib.pyplot as plot #可視化 import seaborn as sb #可視化

實(shí)現(xiàn)功能前的預(yù)備階段

#vars dataFolder = '../input/' submissionFileName = 'submission' trainFile='train.csv' testFile='test.csv' #used 4000000 nr of rows in stead of 8000000 because of Kernel memory issue numberOfRows = 4000000seed = 6001 np.random.seed(seed) random.seed(seed)def displayImportances(featureImportanceDf, submissionFileName):# 根據(jù) importance 的降序排位來給 feature 排序，再將排序后的特征存入 cols （存的特征的名稱）cols = featureImportanceDf[["feature", "importance"]].groupby("feature").mean().sort_values(by = "importance", ascending = False).index# .loc() 不僅可以索引為參數(shù)，也可以以boolean為參數(shù)。boolean的操作單位是某個(gè)特征的特征值bestFeatures = featureImportanceDf.loc[featureImportanceDf.feature.isin(cols)] # isin()接受一個(gè)列表，判斷該列中元素是否在列表中，并返回boolean值plot.figure(figsize = (14, 14))sb.barplot(x = "importance", y = "feature", data = bestFeatures.sort_values(by = "importance", ascending = False))plot.title('LightGBM Features')plot.tight_layout()plot.savefig(submissionFileName + '.png')

這一段代碼，其實(shí)我覺得可以不用把路徑用幾個(gè)變量來表示（或許是代碼作者的編程習(xí)慣吧）。numberOfRows=4000000的用法要縱觀代碼才能知道，是這樣的，代碼作者把比賽官方給的train和test拼接在了一起，然后再選取前4000000個(gè)樣例作為訓(xùn)練集（最后被分為訓(xùn)練集和驗(yàn)證集）。seed=6001及下面兩條代碼是為了生成隨機(jī)種子，但博主有個(gè)疑惑，為什么用了np.random.seed(seed)還要用 random.seed(seed)？,先按住不表，等我查好資料再來補(bǔ)充。至于那個(gè)自定義函數(shù)，是最后來保存輸出結(jié)果的。

為官方提供的文件中的特征設(shè)置類型
就是說原始數(shù)據(jù)中的特征只有特征值，官方是沒有標(biāo)出它是什么類型的數(shù)據(jù)，需要自己來設(shè)置。

dtypes = {'MachineIdentifier': 'category','ProductName': 'category','EngineVersion': 'category','AppVersion': 'category','AvSigVersion': 'category','IsBeta': 'int8','RtpStateBitfield': 'float16','IsSxsPassiveMode': 'int8','DefaultBrowsersIdentifier': 'float16','AVProductStatesIdentifier': 'float32','AVProductsInstalled': 'float16','AVProductsEnabled': 'float16','HasTpm': 'int8','CountryIdentifier': 'int16','CityIdentifier': 'float32','OrganizationIdentifier': 'float16','GeoNameIdentifier': 'float16','LocaleEnglishNameIdentifier': 'int8','Platform': 'category','Processor': 'category','OsVer': 'category','OsBuild': 'int16','OsSuite': 'int16','OsPlatformSubRelease': 'category','OsBuildLab': 'category','SkuEdition': 'category','IsProtected': 'float16','AutoSampleOptIn': 'int8','PuaMode': 'category','SMode': 'float16','IeVerIdentifier': 'float16','SmartScreen': 'category','Firewall': 'float16','UacLuaenable': 'float32','Census_MDC2FormFactor': 'category','Census_DeviceFamily': 'category','Census_OEMNameIdentifier': 'float16','Census_OEMModelIdentifier': 'float32','Census_ProcessorCoreCount': 'float16','Census_ProcessorManufacturerIdentifier': 'float16','Census_ProcessorModelIdentifier': 'float16','Census_ProcessorClass': 'category','Census_PrimaryDiskTotalCapacity': 'float32','Census_PrimaryDiskTypeName': 'category','Census_SystemVolumeTotalCapacity': 'float32','Census_HasOpticalDiskDrive': 'int8','Census_TotalPhysicalRAM': 'float32','Census_ChassisTypeName': 'category','Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float16','Census_InternalPrimaryDisplayResolutionHorizontal': 'float16','Census_InternalPrimaryDisplayResolutionVertical': 'float16','Census_PowerPlatformRoleName': 'category','Census_InternalBatteryType': 'category','Census_InternalBatteryNumberOfCharges': 'float32','Census_OSVersion': 'category','Census_OSArchitecture': 'category','Census_OSBranch': 'category','Census_OSBuildNumber': 'int16','Census_OSBuildRevision': 'int32','Census_OSEdition': 'category','Census_OSSkuName': 'category','Census_OSInstallTypeName': 'category','Census_OSInstallLanguageIdentifier': 'float16','Census_OSUILocaleIdentifier': 'int16','Census_OSWUAutoUpdateOptionsName': 'category','Census_IsPortableOperatingSystem': 'int8','Census_GenuineStateName': 'category','Census_ActivationChannel': 'category','Census_IsFlightingInternal': 'float16','Census_IsFlightsDisabled': 'float16','Census_FlightRing': 'category','Census_ThresholdOptIn': 'float16','Census_FirmwareManufacturerIdentifier': 'float16','Census_FirmwareVersionIdentifier': 'float32','Census_IsSecureBootEnabled': 'int8','Census_IsWIMBootEnabled': 'float16','Census_IsVirtualDevice': 'float16','Census_IsTouchEnabled': 'int8','Census_IsPenCapable': 'int8','Census_IsAlwaysOnAlwaysConnectedCapable': 'float16','Wdft_IsGamer': 'float16','Wdft_RegionIdentifier': 'float16','HasDetections': 'int8'}

特征選擇

selectedFeatures = [ 'AVProductStatesIdentifier','AVProductsEnabled','IsProtected','Processor','OsSuite','IsProtected','RtpStateBitfield','AVProductsInstalled','Wdft_IsGamer','DefaultBrowsersIdentifier','OsBuild','Wdft_RegionIdentifier','SmartScreen','CityIdentifier','AppVersion','Census_IsSecureBootEnabled','Census_PrimaryDiskTypeName','Census_SystemVolumeTotalCapacity','Census_HasOpticalDiskDrive','Census_IsWIMBootEnabled','Census_IsVirtualDevice','Census_IsTouchEnabled','Census_FirmwareVersionIdentifier','GeoNameIdentifier','IeVerIdentifier','Census_FirmwareManufacturerIdentifier','Census_InternalPrimaryDisplayResolutionHorizontal','Census_InternalPrimaryDisplayResolutionVertical','Census_OEMModelIdentifier','Census_ProcessorModelIdentifier','Census_OSVersion','Census_InternalPrimaryDiagonalDisplaySizeInInches','Census_OEMNameIdentifier','Census_ChassisTypeName','Census_OSInstallLanguageIdentifier','EngineVersion','OrganizationIdentifier' ,'CountryIdentifier' ,'Census_ActivationChannel','Census_ProcessorCoreCount','Census_OSWUAutoUpdateOptionsName','Census_InternalBatteryType']

代碼作者因?yàn)榫邆浞浅７浅Ｉ詈竦臄?shù)據(jù)處理技術(shù)功底，他可能是根據(jù)以前對(duì)惡意代碼數(shù)據(jù)處理的經(jīng)驗(yàn)直接選擇了這些特征來給機(jī)器學(xué)習(xí)模型進(jìn)行訓(xùn)練。所以說，特征是不能亂選的，如果沒有代碼作者那樣的技術(shù)，還是借鑒別人的數(shù)據(jù)預(yù)處理方法進(jìn)行特征篩選吧。

載入數(shù)據(jù)

# Load Data with selected features trainDf = pd.read_csv(dataFolder + trainFile, dtype=dtypes,usecols=selectedFeatures, low_memory=True, nrows = numberOfRows) # 訓(xùn)練集 labels = pd.read_csv(dataFolder + trainFile, usecols = ['HasDetections'], nrows = numberOfRows) # 標(biāo)簽 testDf = pd.read_csv(dataFolder + testFile,dtype=dtypes, usecols=selectedFeatures, low_memory=True) #測(cè)試集 print('== Dataset Shapes ==') print('Train : ' + str(trainDf.shape)) # trainDf.shape 是 tuple 類型 print('Labels : ' + str(labels.shape)) print('Test : ' + str(testDf.shape))# Append Datasets and Cleanup df = trainDf.append(testDf).reset_index() # 從這里可以看到 .append() 對(duì)DataFrame來說一樣有效，不僅可以用在 list 上,并且會(huì)出現(xiàn)新的‘index’列（用來保存原來的index）。這里是上下拼接。 del trainDf, testDf # 刪除 trainDf testDf 節(jié)省內(nèi)存 gc.collect()

df 是將train和test拼接之后的新的DataFrame。

對(duì)特征 ‘SmartScreen’ 的特征值進(jìn)行處理

# Modify SmartScreen Feature df.loc[df.SmartScreen == 'off', 'SmartScreen'] = 'Off' # df.SmartScreen=='off'是條件 df.loc[df.SmartScreen == 'of', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == 'OFF', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == '00000000', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == '0', 'SmartScreen'] = 'Off' df.loc[df.SmartScreen == 'ON', 'SmartScreen'] = 'On' df.loc[df.SmartScreen == 'on', 'SmartScreen'] = 'On' df.loc[df.SmartScreen == 'Enabled', 'SmartScreen'] = 'On' df.loc[df.SmartScreen == 'BLOCK', 'SmartScreen'] = 'Block' df.loc[df.SmartScreen == 'requireadmin', 'SmartScreen'] = 'RequireAdmin' df.loc[df.SmartScreen == 'requireAdmin', 'SmartScreen'] = 'RequireAdmin' df.loc[df.SmartScreen == 'RequiredAdmin', 'SmartScreen'] = 'RequireAdmin' df.loc[df.SmartScreen == 'Promt', 'SmartScreen'] = 'Prompt' df.loc[df.SmartScreen == 'Promprt', 'SmartScreen'] = 'Prompt' df.loc[df.SmartScreen == 'prompt', 'SmartScreen'] = 'Prompt' df.loc[df.SmartScreen == 'warn', 'SmartScreen'] = 'Warn' df.loc[df.SmartScreen == 'Deny', 'SmartScreen'] = 'Block' df.loc[df.SmartScreen == '', 'SmartScreen'] = 'Off'

在這里我們能學(xué)到一種從某特征中取特定值的方法：通過設(shè)定條件來取特征中的目標(biāo)特征值

將每種特征的個(gè)特征值出現(xiàn)次數(shù)統(tǒng)計(jì)出來再生成一個(gè)新的DataFrame

#Count Encoding (with exceptions) for col in [f for f in df.columns if f not in ['index','HasDetections','Census_SystemVolumeTotalCapacity']]:df[col]=df[col].map(df[col].value_counts()) # col列中的特征值換成該特征值在該特征中出現(xiàn)的次數(shù)dfDummy = pd.get_dummies(df, dummy_na=True) # 對(duì) df 進(jìn)行獨(dú)熱編碼，dummy_na=True 表示考慮缺失值NaN print('Dummy: ' + str(dfDummy.shape))# Cleanup del df gc.collect()# Summary Shape print('== Dataset Shapes ==') print('Train: ' + str(train.shape)) print('Test: ' + str(test.shape))# Summary Columns print('== Dataset Columns ==') features = [f for f in train.columns if f not in ['index']] for feature in features:print(feature)

df[col].map(df[col].value_counts()) 通過.map()函數(shù)將每個(gè)特征值的出現(xiàn)次數(shù)映射到原來存放特征值的那個(gè)位置 (如果是函數(shù)意思不懂的話博主建議自己去查一下，這里只給出代碼的意義)。這行代碼是很有技巧的，因?yàn)樗挥昧艘恍写a就對(duì)每個(gè)特征中存放的值從特征值換成了特征值出現(xiàn)次數(shù)，也就是所謂的頻率（更正式的“頻率”應(yīng)該是出現(xiàn)次數(shù)除以100），那為什么要修改為頻率呢？那是因?yàn)?strong>lightGBM算法是基于頻率的。

feature 在上面我們把 train 和 test 拼接起來的時(shí)候使用了函數(shù) .reset_index()，會(huì)出現(xiàn)新的一列’index’保存原來的索引，所以在這里我們要 not in ['index']
``

df[col]=df[col].map(df[col].value_counts())
這行代碼比較難，我這里放個(gè)例子給大家看看

機(jī)器學(xué)習(xí)模型構(gòu)建部分

訓(xùn)練模塊

# CV Folds folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = seed)# Create arrays and dataframes to store results oofPreds = np.zeros(train.shape[0]) # numpy.ndarray 類型 subPreds = np.zeros(test.shape[0]) # numpy.ndarray 類型 featureImportanceDf = pd.DataFrame()# Loop through all Folds. for n_fold, (trainXId, validXId) in enumerate(folds.split(train[features], labels)): # enumerate 為每個(gè)元素標(biāo)個(gè)索引，并且將該索引與相應(yīng)的值合并為一個(gè)元組，這里應(yīng)該有5個(gè)元組，因?yàn)檎哿?次# Create TrainXY and ValidationXY set based on fold-indexestrainX, trainY = train[features].iloc[trainXId], labels.iloc[trainXId]validX, validY = train[features].iloc[validXId], labels.iloc[validXId]print('== Fold: ' + str(n_fold)) # 強(qiáng)制轉(zhuǎn)化為 str 類型應(yīng)該是代碼作者的習(xí)慣，其實(shí)直接顯示數(shù)值也行的# LightGBM parameterslgbm = LGBMClassifier(objective = 'binary',boosting_type = 'gbdt',n_estimators = 2500,learning_rate = 0.05, num_leaves = 250,min_data_in_leaf = 125, bagging_fraction = 0.901,max_depth = 13, reg_alpha = 2.5,reg_lambda = 2.5,min_split_gain = 0.0001,min_child_weight = 25,feature_fraction = 0.5, silent = -1,verbose = -1,#n_jobs is set to -1 instead of 4 otherwise the kernell will time outn_jobs = -1) lgbm.fit(trainX, trainY, eval_set=[(trainX, trainY), (validX, validY)], eval_metric = 'auc', verbose = 250, early_stopping_rounds = 100)# 通過分類器模型對(duì)驗(yàn)證集預(yù)測(cè)為正樣本的概率和驗(yàn)證集的真實(shí)標(biāo)簽計(jì)算AUC來檢測(cè)分類器模型的分類效果oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1] # 驗(yàn)證集中樣本預(yù)測(cè)為1(正樣本)的概率print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(validY, oofPreds[validXId]))) # 通過驗(yàn)證集的標(biāo)簽和預(yù)測(cè)為正樣本的概率計(jì)算AUC# cleanupprint('Cleanup')del trainX, trainY, validX, validYgc.collect()subPreds += lgbm.predict_proba(test[features], num_iteration = lgbm.best_iteration_)[:, 1] / folds.n_splits # 對(duì)測(cè)試集進(jìn)行預(yù)測(cè)，并返回預(yù)測(cè)為正例的概率， folds.n_splits = 5 （折了5次）# Feature Importancefold_importance_df = pd.DataFrame()fold_importance_df["feature"] = featuresfold_importance_df["importance"] = lgbm.feature_importances_ # .feature_importances_：特征重要性，特征越重要該值越大fold_importance_df["fold"] = n_fold + 1featureImportanceDf = pd.concat([featureImportanceDf, fold_importance_df], axis=0) # 垂直拼接，并保留原index# cleanupprint('Cleanup. Post-Fold')del lgbmgc.collect()print('Full AUC score %.6f' % roc_auc_score(labels, oofPreds)) # 全部樣本的AUC值

1.oofPreds = np.zeros(train.shape[0]) ：創(chuàng)建一個(gè)與 train 行長度相等的元素為0的數(shù)組
subPreds = np.zeros(test.shape[0]) ：創(chuàng)建一個(gè)與 test 行長度相等的元素為0的數(shù)組

2.oofPreds = np.zeros(train.shape[0]) subPreds = np.zeros(test.shape[0])是 numpy.ndarray類型，因?yàn)閞oc_auc_score()參數(shù)得是array類型。

3.經(jīng)過訓(xùn)練，我們可以計(jì)算AUC值來檢測(cè)分類效果

oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1] 驗(yàn)證集中樣本預(yù)測(cè)為1(正樣本)的概率
roc_auc_score(validY, oofPreds[validXId])) 過驗(yàn)證集的標(biāo)簽和預(yù)測(cè)驗(yàn)證集為正樣本的概率計(jì)算AUC

保存文件、可視化模塊(可視化函數(shù)在代碼最上面定義了)

# Feature Importance displayImportances(featureImportanceDf, submissionFileName) # Generate Submission kaggleSubmission = pd.read_csv(dataFolder + 'sample_submission.csv') kaggleSubmission['HasDetections'] = subPreds kaggleSubmission.to_csv(submissionFileName + '.csv', index = False)

總結(jié)

以上是生活随笔為你收集整理的“7th-place-solution-microsoft-malware-prediction”——kaggle微软恶意代码检测比赛第七名代码的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：合并DateFrame之—— appen
下一篇： DAE(去噪自动编码器)