kaggle入门竞赛之泰坦尼克事故存活预测(xgboost方法)
傳送門:https://www.missshi.cn/api/view/blog/5a06a441e519f50d0400035e
本文我們詳細講解如何利用xgboost方法來解決泰坦尼克沉船事故人員存活預測的問題。
 實現語言以Python為例來進行講解。
評價標準
我們的目標是預測泰坦尼克號中哪些乘客可以幸存下來。
評價維度就是預測的準確率。
需要上傳的文件格式如下:
 一個csv文件,包含418行數據和1行Title。
 每行都是有兩列,其中,第一列為乘客的ID,第二列為是否能夠幸存(幸存為1,否則為0)。
例如:
PassengerId,Survived
 892,0
 893,1
 894,0
 Etc.
數據集
https://pan.baidu.com/s/1pxgXW4s075j7zLWQpeoc4w
數據集中輸入的特征包含如下字段:
第三方庫引入
首先,我們來看下用xgboost解決這個問題需要引入哪些第三方庫吧:
# Load in our libraries import pandas as pd import numpy as np import re import sklearn import xgboost as xgb import seaborn as sns import matplotlib.pyplot as plt %matplotlib inlineimport plotly.offline as py py.init_notebook_mode(connected=True) import plotly.graph_objs as go import plotly.tools as tlsimport warnings warnings.filterwarnings('ignore')# Going to use these 5 base models for the stacking from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier from sklearn.svm import SVC from sklearn.cross_validation import KFold;其中,numpy和pandas是在進行數據計算和分析中最常用的第三方庫。
 re是正則表達式庫。
 sklearn是專門用于機器學習的第三方庫。
 matplotlib,seaborn和plotly是Python用于繪圖的第三方庫。
 xgboost是Python基于xgboost算法開發的第三方庫。
特征的分析和提取
在傳統機器學習算法中,我們首先需要分析數據的內在結構,找出數據的結構特征信息。
# Load in the train and test datasets train = pd.read_csv('../input/train.csv') test = pd.read_csv('../input/test.csv')# Store our passenger ID for easy access PassengerId = test['PassengerId']train.head(3)我們利用pandas庫的方法直接讀入excel方法后,讀取訓練集的前三行數據如下:
 
接下來,我們需要清除一些我們無法直接利用的一些特征:
# 12. 去除無法直接利用的特征 drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp'] train = train.drop(drop_elements, axis=1) train = train.drop(['CategoricalAge', 'CategoricalFare'], axis=1) test = test.drop(drop_elements, axis=1)到目前為止,我們已經對特征進行了加工、處理和過濾。
接下來,我們需要簡單的通過當前的數據進行一些可視化來幫助我們進一步進行分析。
print(train.head(3))
 接下來,讓我們看一下目前這些特征之間的相關性吧:
相關系數圖如下:
 
 Pearson相關系數(Pearson CorrelationCoefficient)是用來衡量兩個數據集合是否在一條線上面,它用來衡量定距變量間的線性關系。
當Pearson相關系數越接近1時,表示兩個特征之間的相關性越強;而當兩個特征之間的相關性越接近于0時,表示兩個特征之間的相關性越低。
其他參數計算
下面,我們需要計算一些在后續訓練過程中需要使用的參數:
ntrain = train.shape[0] ntest = test.shape[0] SEED = 0 # for reproducibility NFOLDS = 5 # set folds for out-of-fold prediction kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)分類器封裝
接下來,我們對Sklearn分類器進行一下封裝,便于我們后續直接調用:
class SklearnHelper(object):def __init__(self, clf, seed=0, params=None):params['random_state'] = seedself.clf = clf(**params)def train(self, x_train, y_train):self.clf.fit(x_train, y_train)def predict(self, x):return self.clf.predict(x)def fit(self,x,y):return self.clf.fit(x,y)def feature_importances(self,x,y):print(self.clf.fit(x,y).feature_importances_)同樣,我們也對XGBoost分類器進行相關的封裝:
def get_oof(clf, x_train, y_train, x_test):oof_train = np.zeros((ntrain,))oof_test = np.zeros((ntest,))oof_test_skf = np.empty((NFOLDS, ntest))for i, (train_index, test_index) in enumerate(kf):x_tr = x_train[train_index]y_tr = y_train[train_index]x_te = x_train[test_index]clf.train(x_tr, y_tr)oof_train[test_index] = clf.predict(x_te)oof_test_skf[i, :] = clf.predict(x_test)oof_test[:] = oof_test_skf.mean(axis=0)return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)下面,我們使用5種模型對其進行分類:
模型構建
首先是對模型的參數進行設置:
## 模型參數設置 # 1. Random Forest rf_params = {'n_jobs':-1,'n_estimators':500,'warm_start':True,# 'max_features':0.2,'max_depth':6,'min_samples_leaf':2,'max_features':'sqrt','verbose':0 } # 2. Extra Trees et_params = {'n_jobs':-1,'n_estimators':500,# 'max_features':0.5,'max_depth':8,'min_samples_leaf':2,'verbose':0 } # 3. Adaboost ada_params = {'n_estimators': 500,'learning_rate':0.75 } # 4. Gradint Boosting gb_params = {'n_estimators':500,'max_features':0.2,'max_depth':5,'min_samples_leaf':2,'verbose':0 } # 5. SVM svc_params = {'kernel' : 'linear','C':0.025 }下面,根據設置的參數來創建模型對象:
# 模型構建 rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params) et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params) ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params) gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params) svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)接下來,將我們的數據轉換為模型需要的numpy數組的格式:
# 模型構建 y_train = train['Survived'].ravel() train = train.drop(['Survived'], axis=1) x_train = train.values # Creates an array of the train data x_test = test.values # Creats an array of the test data接下來,我們用5個模型分別用于XGBoost訓練模型中進行訓練預測:
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifierprint("Training is complete")我們將每個模型中的特征提取出來:
rf_feature = rf.feature_importances(x_train,y_train) et_feature = et.feature_importances(x_train, y_train) ada_feature = ada.feature_importances(x_train, y_train) gb_feature = gb.feature_importances(x_train,y_train)
 整理得到的特征值如下:
用圖像的方式可以更加明顯的表現出來:
trace = go.Scatter(y = feature_dataframe['Random Forest feature importances'].values,x = feature_dataframe['features'].values,mode='markers',marker=dict(sizemode = 'diameter',sizeref = 1,size = 25, # size= feature_dataframe['AdaBoost feature importances'].values,#color = np.random.randn(500), #set color equal to a variablecolor = feature_dataframe['Random Forest feature importances'].values,colorscale='Portland',showscale=True),text = feature_dataframe['features'].values ) data = [trace]layout= go.Layout(autosize= True,title= 'Random Forest Feature Importance',hovermode= 'closest', # xaxis= dict( # title= 'Pop', # ticklen= 5, # zeroline= False, # gridwidth= 2, # ),yaxis=dict(title= 'Feature Importance',ticklen= 5,gridwidth= 2),showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')# Scatter plot trace = go.Scatter(y = feature_dataframe['Extra Trees feature importances'].values,x = feature_dataframe['features'].values,mode='markers',marker=dict(sizemode = 'diameter',sizeref = 1,size = 25, # size= feature_dataframe['AdaBoost feature importances'].values,#color = np.random.randn(500), #set color equal to a variablecolor = feature_dataframe['Extra Trees feature importances'].values,colorscale='Portland',showscale=True),text = feature_dataframe['features'].values ) data = [trace]layout= go.Layout(autosize= True,title= 'Extra Trees Feature Importance',hovermode= 'closest', # xaxis= dict( # title= 'Pop', # ticklen= 5, # zeroline= False, # gridwidth= 2, # ),yaxis=dict(title= 'Feature Importance',ticklen= 5,gridwidth= 2),showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')# Scatter plot trace = go.Scatter(y = feature_dataframe['AdaBoost feature importances'].values,x = feature_dataframe['features'].values,mode='markers',marker=dict(sizemode = 'diameter',sizeref = 1,size = 25, # size= feature_dataframe['AdaBoost feature importances'].values,#color = np.random.randn(500), #set color equal to a variablecolor = feature_dataframe['AdaBoost feature importances'].values,colorscale='Portland',showscale=True),text = feature_dataframe['features'].values ) data = [trace]layout= go.Layout(autosize= True,title= 'AdaBoost Feature Importance',hovermode= 'closest', # xaxis= dict( # title= 'Pop', # ticklen= 5, # zeroline= False, # gridwidth= 2, # ),yaxis=dict(title= 'Feature Importance',ticklen= 5,gridwidth= 2),showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')# Scatter plot trace = go.Scatter(y = feature_dataframe['Gradient Boost feature importances'].values,x = feature_dataframe['features'].values,mode='markers',marker=dict(sizemode = 'diameter',sizeref = 1,size = 25, # size= feature_dataframe['AdaBoost feature importances'].values,#color = np.random.randn(500), #set color equal to a variablecolor = feature_dataframe['Gradient Boost feature importances'].values,colorscale='Portland',showscale=True),text = feature_dataframe['features'].values ) data = [trace]layout= go.Layout(autosize= True,title= 'Gradient Boosting Feature Importance',hovermode= 'closest', # xaxis= dict( # title= 'Pop', # ticklen= 5, # zeroline= False, # gridwidth= 2, # ),yaxis=dict(title= 'Feature Importance',ticklen= 5,gridwidth= 2),showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')
 
 
 
 下面,我們在數據中在新增一列,用于添加這個特征值的平均值:
 來用柱狀圖看下每個特征的重要程度吧:
 預測一下看看吧:
 最后,我們再看下這四個特征的相關性圖吧:
 最后,我們來對測試集的數據進行預測一下吧,并生成最終需要上傳的csv文件:
運行完成后,我們可以看到我們成功的得到了StackingSubmission.csv文件。
這個文件的格式就是Kaggle競賽要求的結果文件啦!
怎么樣?對Kaggle比賽是不是有了一點點的了解了?趕緊參與其中吧!
總結
以上是生活随笔為你收集整理的kaggle入门竞赛之泰坦尼克事故存活预测(xgboost方法)的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 金融贷款逾期的模型构建7——模型融合
- 下一篇: 【Pandas】apply,applym
