XGBoost的基本使用应用Kaggle便利店销量预测
XGBoost的基本使用應(yīng)用
導(dǎo)入XGBoost等相關(guān)包:
from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score加載數(shù)據(jù),提取特征集和標(biāo)簽:
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')X = dataset[:, 0:8] y = dataset[:, 8] dataset #array([[ 6. , 148. , 72. , ..., 0.627, 50. , 1. ], # [ 1. , 85. , 66. , ..., 0.351, 31. , 0. ], # [ 8. , 183. , 64. , ..., 0.672, 32. , 1. ], # ..., # [ 5. , 121. , 72. , ..., 0.245, 30. , 0. ], # [ 1. , 126. , 60. , ..., 0.349, 47. , 1. ], # [ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])將數(shù)據(jù)劃分為訓(xùn)練集和測試集:
seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)X_train.shape, X_test.shape, y_train.shape, y_test.shape #((514, 8), (254, 8), (514,), (254,))創(chuàng)建及訓(xùn)練模型:
model = XGBClassifier(n_jobs=-1) model.fit(X_train, y_train) #XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, # colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, # max_depth=3, min_child_weight=1, missing=None, n_estimators=100, # n_jobs=-1, nthread=None, objective='binary:logistic', # random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, # seed=None, silent=True, subsample=1)使用訓(xùn)練后的模型對測試集進(jìn)行預(yù)測,并計算預(yù)測值與實際之間的acc值:
y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Accuracy: %.2f%%" % (accuracy * 100.0)) #Accuracy: 77.95%使用訓(xùn)練后的模型對測試集進(jìn)行預(yù)測,得到每個類別的預(yù)測概率:
y_pred = model.predict(X_test) y_pred #array([0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., # 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., # 0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0., # . . . # 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., # 0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., # 0., 0., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1.]) y_pred_proba = model.predict_proba(X_test) y_pred_proba #array([[0.9545844 , 0.04541559], # [0.05245447, 0.9475455 ], # [0.41897488, 0.5810251 ], # # [0.42821795, 0.57178205], # [0.2364142 , 0.7635858 ], # [0.05780089, 0.9421991 ]], dtype=float32)監(jiān)控模型表現(xiàn):xgboost可以在模型訓(xùn)練時,評價模型在測試集上的表現(xiàn),也可以輸出每一步的分?jǐn)?shù)
model = XGBClassifier() eval_set = [(X_test,y_test)] model.fit(X_train,y_train,early_stopping_rounds=10,eval_metric="logloss",eval_set=eval_set,verbose=True) #10輪驗證集效果不提升,停止 #那么它會在每加入一棵樹后打印出logloss #[0] validation_0-logloss:0.60491 #[1] validation_0-logloss:0.55934 #[2] validation_0-logloss:0.53068 #[3] validation_0-logloss:0.51795 #[4] validation_0-logloss:0.51153 #[5] validation_0-logloss:0.50935 #[6] validation_0-logloss:0.50818 #[7] validation_0-logloss:0.51097 #[8] validation_0-logloss:0.51760 #[9] validation_0-logloss:0.51912 #[10] validation_0-logloss:0.52503 #[11] validation_0-logloss:0.52697 #[12] validation_0-logloss:0.53335 #[13] validation_0-logloss:0.53905 #[14] validation_0-logloss:0.54546 #[15] validation_0-logloss:0.54613 #[16] validation_0-logloss:0.54982輸出各特征重要程度:
from xgboost import plot_importance from matplotlib import pyplot %matplotlib inlineplot_importance(model) pyplot.show()
xgboost根據(jù)結(jié)構(gòu)分?jǐn)?shù)的增益情況計算出來選擇哪個特征作為分割點,而某個特征的重要性就是它在所有樹中出現(xiàn)的次數(shù)之和。也就是說一個屬性越多的被用來在模型中構(gòu)建決策樹,它的重要性就相對越高
調(diào)參
如何調(diào)參呢,下面是三個超參數(shù)的一般實踐最佳值,可以先將它們設(shè)定為這個范圍,然后畫出learning curves,再再調(diào)節(jié)參數(shù)找到最佳模型:
- learning_rate=0.1 或更小,越小就需要多假如若學(xué)習(xí)器
- tree_depth = 2~8
- subsample=訓(xùn)練集的30%~80%
接下來我們用GridSearchCV來進(jìn)行調(diào)參會更加方便一些
可以調(diào)的參數(shù)組合有:
樹的個數(shù)和大小(n_estimators and max_depth)學(xué)習(xí)率和樹的個數(shù)(learning_rate and n_estimators).行列的subsampling rates(subsample,colsample_bytree and colsample_bylevel )
導(dǎo)入調(diào)參相關(guān)包:
from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold創(chuàng)建模型及參數(shù)搜索空間:
model_GS = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] max_depth = [1, 2, 3, 4, 5] param_grid = dict(learning_rate=learning_rate, max_depth=max_depth)設(shè)置分層抽樣驗證及創(chuàng)建搜索對象:
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed) grid_search = GridSearchCV(model_GS, param_grid=param_grid, scoring='neg_log_loss', n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, y)y_pred = grid_result.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Accuracy: %.2f%%" % (accuracy * 100.0)) #Accuracy: 81.10% grid_result.best_score_, grid_result.best_params_ #(-0.47171179660714796, {'learning_rate': 0.2, 'max_depth': 1})便利店銷量預(yù)測
Describe
羅斯曼在7個歐洲國家經(jīng)營著3000多家藥店。目前,羅斯曼商店經(jīng)理的任務(wù)是提前六周預(yù)測每日銷售額。商店銷售受到許多因素的影響,包括促銷、競爭、學(xué)校和國定假日、季節(jié)性和地域性。由于成千上萬的經(jīng)理根據(jù)自己的特殊情況預(yù)測銷售,結(jié)果的準(zhǔn)確性可能會有很大差異。
在他們的第一場Kaggle競賽中,Rossmann挑戰(zhàn)你預(yù)測德國各地1115家商店6周的日銷售額。可靠的銷售預(yù)測使商店經(jīng)理能夠制定有效的員工時間表,提高工作效率和積極性。通過幫助Rossmann創(chuàng)建一個穩(wěn)健的預(yù)測模型,您將幫助門店經(jīng)理專注于對他們來說最重要的事情:他們的客戶和團(tuán)隊!
我們?yōu)槟峁┝?115家羅斯曼商店的歷史銷售數(shù)據(jù)。任務(wù)是預(yù)測測試集的“銷售”列。請注意,數(shù)據(jù)集中的一些商店因翻新而暫時關(guān)閉。
Evaluation
Submissions are evaluated on the Root Mean Square Percentage Error (RMSPE). The RMSPE is calculated as
RMSPE?=1n∑i=1n(yi?y^iyi)2\operatorname{RMSPE}=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(\frac{y_{i}-\hat{y}_{i}}{y_{i}}\right)^{2}} RMSPE=n1?i=1∑n?(yi?yi??y^?i??)2?
where y_i denotes the sales of a single store on a single day and yhat_i denotes the corresponding prediction. Any day and store with 0 sales is ignored in scoring.
自定義損失函數(shù)的方法太過于復(fù)雜(一階導(dǎo)、二階導(dǎo)不求)(自定義損失函數(shù)要求計算出一階導(dǎo)、二階導(dǎo),不是光有損失函數(shù)就行了),我們可以通過自定義評估指標(biāo)來完成這個預(yù)測
Files
- train.csv - historical data including Sales
- test.csv - historical data excluding Sales
- sample_submission.csv - a sample submission file in the correct format
- store.csv - supplemental information about the stores
Data fields
Most of the fields are self-explanatory. The following are descriptions for those that aren't.
- Id - an Id that represents a (Store, Date) duple within the test set
- Store - a unique Id for each store
- Sales - the turnover for any given day (this is what you are predicting)
- Customers - the number of customers on a given day
- Open - an indicator for whether the store was open: 0 = closed, 1 = open
- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
- StoreType - differentiates between 4 different store models: a, b, c, d
- Assortment - describes an assortment level: a = basic, b = extra, c = extended
- CompetitionDistance - distance in meters to the nearest competitor store
- CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
- Promo - indicates whether a store is running a promo on that day
- Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
- Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
- PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store
==promotion是否促銷
promotion是否在長周期的促銷
Promo2Since[Year/Week]長周期促銷從哪年那個星期開始
PromoInterval促銷情況
引入所需的庫
import pandas as pd import datetime import csv import numpy as np import os import scipy as sp import xgboost as xgb import itertools import operator import warnings warnings.filterwarnings("ignore") from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.base import TransformerMixin from matplotlib import pylab as plt plot = Truegoal = 'Sales' myid = 'Id'當(dāng)你的eval metric和loss function并不一致的時候
Early stopping
按照原來的loss function去優(yōu)化,一顆一顆樹生長和添加,但是在驗證集上,盯著eval metric去看,在驗證集上評估指標(biāo)不再優(yōu)化的時候,停止集成模型的生長。
有標(biāo)簽的數(shù)據(jù)部分(訓(xùn)練集) + 無標(biāo)簽/需要做預(yù)估的部分(測試集)
訓(xùn)練集 = 真正的訓(xùn)練集 + 驗證集(利用它去完成模型選擇和調(diào)參)
定義一些變換和評判準(zhǔn)則
使用不同的evaluation function的時候要特別注意這個
def ToWeight(y):# y is np.arrayw = np.zeros(y.shape, dtype=float)ind = y != 0w[ind] = 1./(y[ind]**2)return wdef rmspe(yhat, y):w = ToWeight(y)rmspe = np.sqrt(np.mean( w * (y - yhat)**2 ))return rmspedef rmspe_xg(yhat, y):# y = y.valuesy = y.get_label()y = np.exp(y) - 1yhat = np.exp(yhat) - 1w = ToWeight(y)rmspe = np.sqrt(np.mean(w * (y - yhat)**2))return "rmspe", rmspe store = pd.read_csv('store.csv') store.head() train_df = pd.read_csv('train.csv') train_df.head() test_df = pd.read_csv('test.csv') test_df.head()加載數(shù)據(jù)
def load_data():"""加載數(shù)據(jù),設(shè)定數(shù)值型和非數(shù)值型數(shù)據(jù)"""store = pd.read_csv('store.csv')train_org = pd.read_csv('train.csv',dtype={'StateHoliday':pd.np.string_})test_org = pd.read_csv('test.csv',dtype={'StateHoliday':pd.np.string_})train = pd.merge(train_org,store, on='Store', how='left')test = pd.merge(test_org,store, on='Store', how='left')features = test.columns.tolist()numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']features_numeric = test.select_dtypes(include=numerics).columns.tolist()features_non_numeric = [f for f in features if f not in features_numeric]return (train,test,features,features_non_numeric)數(shù)據(jù)與特征處理
def process_data(train,test,features,features_non_numeric):"""Feature engineering and selection."""# # FEATURE ENGINEERINGtrain = train[train['Sales'] > 0]for data in [train,test]:# year month daydata['year'] = data.Date.apply(lambda x: x.split('-')[0])data['year'] = data['year'].astype(float)data['month'] = data.Date.apply(lambda x: x.split('-')[1])data['month'] = data['month'].astype(float)data['day'] = data.Date.apply(lambda x: x.split('-')[2])data['day'] = data['day'].astype(float)# promo interval "Jan,Apr,Jul,Oct"data['promojan'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jan" in x else 0)#TypeError: argument of type 'float' is not iterable 為什么使用isinstance(x,float)data['promofeb'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Feb" in x else 0)data['promomar'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Mar" in x else 0)data['promoapr'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Apr" in x else 0)data['promomay'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "May" in x else 0)data['promojun'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jun" in x else 0)data['promojul'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Jul" in x else 0)data['promoaug'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Aug" in x else 0)data['promosep'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Sep" in x else 0)data['promooct'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Oct" in x else 0)data['promonov'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Nov" in x else 0)data['promodec'] = data.PromoInterval.apply(lambda x: 0 if isinstance(x, float) else 1 if "Dec" in x else 0)# # Features set.noisy_features = [myid,'Date']features = [c for c in features if c not in noisy_features]features_non_numeric = [c for c in features_non_numeric if c not in noisy_features]features.extend(['year','month','day'])# Fill NAclass DataFrameImputer(TransformerMixin):# http://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learndef __init__(self):"""Impute missing values.Columns of dtype object are imputed with the most frequent valuein column.Columns of other types are imputed with mean of column."""def fit(self, X, y=None):self.fill = pd.Series([X[c].value_counts().index[0] # modeif X[c].dtype == np.dtype('O') else X[c].mean() for c in X], # meanindex=X.columns)return selfdef transform(self, X, y=None):return X.fillna(self.fill)train = DataFrameImputer().fit_transform(train)test = DataFrameImputer().fit_transform(test)# Pre-processing non-numberic valuesle = LabelEncoder()for col in features_non_numeric:le.fit(list(train[col])+list(test[col]))train[col] = le.transform(train[col])test[col] = le.transform(test[col])# LR和神經(jīng)網(wǎng)絡(luò)這種模型都對輸入數(shù)據(jù)的幅度極度敏感,請先做歸一化操作scaler = StandardScaler()for col in set(features) - set(features_non_numeric) - \set([]): # TODO: add what not to scalescaler.fit(list(train[col])+list(test[col]))train[col] = scaler.transform(train[col])test[col] = scaler.transform(test[col])return (train,test,features,features_non_numeric)訓(xùn)練與分析
predict_result = log(y+1) y = e^(predict_result)-1 def XGB_native(train,test,features,features_non_numeric):depth = 6eta = 0.01ntrees = 8000mcw = 3params = {"objective": "reg:linear","booster": "gbtree","eta": eta,"max_depth": depth,"min_child_weight": mcw,"subsample": 0.7,"colsample_bytree": 0.7,"silent": 1}print "Running with params: " + str(params)print "Running with ntrees: " + str(ntrees)print "Running with features: " + str(features)# Train model with local splittsize = 0.05X_train, X_test = cross_validation.train_test_split(train, test_size=tsize)dtrain = xgb.DMatrix(X_train[features], np.log(X_train[goal] + 1))dvalid = xgb.DMatrix(X_test[features], np.log(X_test[goal] + 1))watchlist = [(dvalid, 'eval'), (dtrain, 'train')]gbm = xgb.train(params, dtrain, ntrees, evals=watchlist, early_stopping_rounds=100, feval=rmspe_xg, verbose_eval=True)train_probs = gbm.predict(xgb.DMatrix(X_test[features]))indices = train_probs < 0train_probs[indices] = 0error = rmspe(np.exp(train_probs) - 1, X_test[goal].values)print error# Predict and Exporttest_probs = gbm.predict(xgb.DMatrix(test[features]))indices = test_probs < 0test_probs[indices] = 0submission = pd.DataFrame({myid: test[myid], goal: np.exp(test_probs) - 1})if not os.path.exists('result/'):os.makedirs('result/')submission.to_csv("./result/dat-xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.csv" % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize)) , index=False)# Feature importanceif plot:outfile = open('xgb.fmap', 'w')i = 0for feat in features:outfile.write('{0}\t{1}\tq\n'.format(i, feat))i = i + 1outfile.close()importance = gbm.get_fscore(fmap='xgb.fmap')importance = sorted(importance.items(), key=operator.itemgetter(1))df = pd.DataFrame(importance, columns=['feature', 'fscore'])df['fscore'] = df['fscore'] / df['fscore'].sum()# Plotitupplt.figure()df.plot()df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(25, 15))plt.title('XGBoost Feature Importance')plt.xlabel('relative importance')plt.gcf().savefig('Feature_Importance_xgb_d%s_eta%s_ntree%s_mcw%s_tsize%s.png' % (str(depth),str(eta),str(ntrees),str(mcw),str(tsize))) print "=> 載入數(shù)據(jù)中..." train,test,features,features_non_numeric = load_data() print "=> 處理數(shù)據(jù)與特征工程..." train,test,features,features_non_numeric = process_data(train,test,features,features_non_numeric) print "=> 使用XGBoost建模..." XGB_native(train,test,features,features_non_numeric) train.head() # => 載入數(shù)據(jù)中... # => 處理數(shù)據(jù)與特征工程... # => 使用XGBoost建模... # Running with params: {'subsample': 0.7, 'eta': 0.01, 'colsample_bytree': 0.7, 'silent': 1, 'objective': 'reg:linear', 'max_depth': 6, 'min_child_weight': 3, 'booster': 'gbtree'} # Running with ntrees: 8000 # Running with features: ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'year', 'month', 'day'] # [0] eval-rmspe:0.999864 train-rmspe:0.999864 # Multiple eval metrics have been passed: 'train-rmspe' will be used for early stopping.# Will train until train-rmspe hasn't improved in 100 rounds. # [1] eval-rmspe:0.999838 train-rmspe:0.999837 # [2] eval-rmspe:0.99981 train-rmspe:0.999809 # [3] eval-rmspe:0.999779 train-rmspe:0.999779 # . . . # [503] eval-rmspe:0.314933 train-rmspe:0.342737 # [504] eval-rmspe:0.315016 train-rmspe:0.342834 # [505] eval-rmspe:0.31512 train-rmspe:0.342928 # Stopping. Best iteration: # [405] eval-rmspe:0.312829 train-rmspe:0.33589# 0.315119522982總結(jié)
以上是生活随笔為你收集整理的XGBoost的基本使用应用Kaggle便利店销量预测的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 菜菜sklearn——XGBoost(3
- 下一篇: LightGBM用法速查表