KDD_cup99 pytorch
提示:這個數據集我是簡單歸一化處理的,只是為了簡單跑下實驗學習用的,你如果用于跑論文的實驗什么的,這個不能用,處理的效果不好。
如果想看數據處理,可以參考kaggle上的代碼:Intrusion Detection System | Kaggle
KDD_Cup99數據集,由于全部的數據集太大,訓練集我們只取官網給的10%數據,即kddcup.data_10_percent.gz。
二分類和多分類的完整代碼和數據集:下載地址,鏈接如果失效,請在下方評論,我會及時更新鏈接https://www.lanzouw.com/iLBaOmyq3gbhttps://www.lanzouw.com/iLBaOmyq3gb
官網源數據下載地址:KDD Cup 1999 Data
個人下載地址:數據集下載地址
訓練集:23種標簽,包含normal正常和22種攻擊類型標簽。包含494021條數據
測試集:38種標簽,包含normal正常和37中攻擊標簽。包含311029條數據
必看說明:
'spy.', 'warezclient'這兩種攻擊只存在于訓練集中,17種攻擊只存在于測試集中,分別為{'apache2.', 'httptunnel.', 'mailbomb.', 'mscan.', 'named.', 'processtable.', 'ps.', 'saint.', 'sendmail.', 'snmpgetattack.', 'snmpguess.', 'sqlattack.', 'udpstorm.', 'worm.', 'xlock.', 'xsnoop.', 'xterm.'}
大意:訓練集獨有兩種攻擊,測試集獨有17中攻擊。因此標簽的數據化必須跟任務想匹配,如果是二分類,訓練集和測試集可以統一編碼,normal歸一類,其他歸一類。如果是多分類,那么你必須保證兩者的標簽類別一致才行,也就是必須要去除訓練集和測試集中獨有的標簽數據。就是要去除2+17中類別的數據,只去除測試集中獨有的17中類型應該也是可以,但感覺沒必要。下面我們的處理就按照二分類和多分類兩種類別處理數據。
41特征和class標簽:['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'ho', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'class']
樣本:0 'tcp' 'http' 'SF' 181 5450 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0.0 0.0 0.0 0.0 1.0 0.0 0.0 9 9 1.0 0.0 0.11 0.0 0.0 0.0 0.0 0.0 'normal.'
0 duration: continuous. ??
1 protocol_type: symbolic. ?
2 service: symbolic.?
3 flag: symbolic.
4 src_bytes: continuous.
5 dst_bytes: continuous.
6 land: symbolic.
7 wrong_fragment: continuous.
8 burgent: continuous.
9 hot: continuous.
10?num_failed_logins: continuous.
11?logged_in: symbolic.
12?num_compromised: continuous.
13?root_shell: continuous.
14?su_attempted: continuous.
15?num_root: continuous.
16?num_file_creations: continuous.
17?num_shells: continuous.
18?num_access_files: continuous.
19 num_outbound_cmds: continuous.
20?is_host_login: symbolic.
21?is_guest_login: symbolic.
22?count: continuous.
23?srv_count: continuous.
24?serror_rate: continuous.
25?srv_serror_rate: continuous.
26?rerror_rate: continuous.
27?srv_rerror_rate: continuous.
28?same_srv_rate: continuous.
29 diff_srv_rate: continuous.
30?srv_diff_host_rate: continuous.
31?dst_host_count: continuous.
32?dst_host_srv_count: continuous.
33?dst_host_same_srv_rate: continuous.
34?dst_host_diff_srv_rate: continuous.
35 dst_host_same_src_port_rate: continuous.
36?dst_host_srv_diff_host_rate: continuous.
37?dst_host_serror_rate: continuous.
38?dst_host_srv_serror_rate: continuous.
39 dst_host_rerror_rate: continuous.
40 dst_host_srv_rerror_rate: continuous.
簡單數據處理-第19列所有值都為0,所以該特征無用,刪去。對非數值特征編碼,然后對整個數據集特征進行歸一化。(提示:這只是演示,實際這么數據處理肯定不合理。)
實驗代碼:
下面是二分類的代碼,如果想做多分類實驗,就將num_outputs設置為23,將標簽重新數值化
import pandas as pd import torch import torchvision import torch.nn as nn import numpy as np import torch.utils.data as Data from sklearn import preprocessing import matplotlib.pyplot as pltepochs = 20 batch_size=64 lr = 0.001#我直接將官網的格式改成了csv文件 train_data = pd.read_csv('./data/KDD_cup99/train_10_percent.csv',header=None) test_data = pd.read_csv('./data/KDD_cup99/test.csv',header=None)?數據簡單處理,特征和標簽數值化,特征歸一化
#分類任務,將測試集中多余的17種類別去掉 test_data=test_data[test_data[41].isin(set(train_data[41]))] data = pd.concat((train_data, test_data), ignore_index=True)#特征和標簽編碼,刪去了19列 le = preprocessing.LabelEncoder() #特征值編碼 data[1]=le.fit_transform(data[1]) data[2]=le.fit_transform(data[2]) data[3]=le.fit_transform(data[3]) #將normal.標簽設置為1, 非normal.標簽設置為0 data.loc[data[41]!='normal.', 41]=0 data.loc[data[41]=='normal.',41]=1 data[41]=data[41].astype('int64')#第19列的特征全為0,無用,刪掉 del data[19] data.columns=list(range(41))#對特征值歸一化 for i in range(40):Max, Min = max(data.loc[:,i]), min(data.loc[:,i])data.loc[:,i]= ((data.loc[:,i]-Min)/(Max-Min)).astype('float32')制作pytorch數據集和定義模型結構
#制作pytorch識別的數據集和定義模型 train_data, train_label = torch.Tensor(data.loc[:494021,:39].values), torch.Tensor(data.loc[:494021,40].values).long() test_data, test_label = torch.Tensor(data.loc[494021:, :39,].values), torch.Tensor(data.loc[494021:, 40].values).long()train_dataset = Data.TensorDataset(train_data, train_label) test_dataset = Data.TensorDataset(test_data, test_label)#制作Dataloder數據集,可迭代 train_loader = Data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = Data.DataLoader(test_dataset, batch_size=128)#如果是用gpu,就用gpu訓練 device = torch.device('cuda'if torch.cuda.is_available() else 'cpu') #定義模型 num_inputs, num_hiddens, num_outputs = 40, 128, 2 net = nn.Sequential(nn.Linear(num_inputs, num_hiddens),nn.ReLU(),nn.Linear(num_hiddens, 2*num_hiddens),nn.ReLU(),nn.Linear(2*num_hiddens, num_outputs) ).to(device)?訓練模型
#定義損失函數 loss = torch.nn.CrossEntropyLoss() #定義優化器 optimizer = torch.optim.Adam(net.parameters(), lr=lr)#訓練 def train():net.train()batch_loss, correct, total = 0.0,0.0 ,0.0for data, label in train_loader:data, label = data.to(device), label.to(device)net.zero_grad()output = net(data)l = loss(output, label)l.backward()optimizer.step()predict_label = torch.argmax(output, dim=1)correct += torch.sum(predict_label==label).cpu().item()total +=len(label)batch_loss +=l.cpu().item()return correct/total, batch_loss/len(train_loader)#繪圖 def pltfigure(x,y,title,id, data):plt.subplot(2,2,id)plt.plot(range(len(data)),data)plt.xlabel(x)plt.ylabel(y)plt.title(title)plt.show()#測試 def test():net.eval()batch_loss, correct, total = 0.0,0.0 ,0.0for data, label in test_loader:data, label = data.to(device), label.to(device)output = net(data)batch_loss +=loss(output, label).cpu().item()predict_label = torch.argmax(output, dim=1)correct += torch.sum(predict_label==label).cpu().item()total +=len(label)return correct/total, batch_loss/len(test_loader)#主程序 def main(): print('training on: ',device)print('batch_size:',batch_size)print('epochs:',epochs)print('learning_rate:',lr)plt.figure()train_acc_list, train_loss_list, test_acc_list, test_loss_list = [],[],[],[]for epoch in range(epochs):train_acc, train_loss = train()test_acc, test_loss = test()print('epoch %d: train acc: %.2f%% train loss:%.4f, test acc: %.2f%%, test loss:%.4f'%(epoch, 100*train_acc, train_loss, 100*test_acc, test_loss))train_acc_list.append(train_acc)train_loss_list.append(train_loss)test_acc_list.append(test_acc)test_loss_list.append(test_loss)# #繪圖 # pltfigure(x='epoch', y='acc', title='epoch-train_acc', id=1, data=train_acc_list) # pltfigure(x='epoch', y='loss', title='epoch-train_loss',id=2, data= train_loss_list) # pltfigure(x='epoch', y='acc', title='epoch-test_acc', id=3, data=test_acc_list) # pltfigure(x='epoch', y='loss', title='epoch-test_loss', id=4, data=test_loss_list)main()?實驗結果
總結
以上是生活随笔為你收集整理的KDD_cup99 pytorch的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: BelgiumTS交通数据集分类-pyt
- 下一篇: cnn识别mnist、Fashion-M