ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(二)
ML之XGBoost:XGBoost參數調優的優秀外文翻譯—《XGBoost中的參數調優完整指南(帶python中的代碼)》(二)
?
?
目錄
2. xgboost參數/XGBoost?Parameters
一般參數/General Parameters
Booster參數/Booster Parameters
學習任務參數/Learning Task Parameters
???????
?
?
?
?
?
?
?
原文題目:《Complete Guide to Parameter Tuning in XGBoost with codes in Python》
原文地址:https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
所有權為原文所有,本文只負責翻譯。
相關文章
ML之XGBoost:XGBoost算法模型(相關配圖)的簡介(XGBoost并行處理)、關鍵思路、代碼實現(目標函數/評價函數)、安裝、使用方法、案例應用之詳細攻略
ML之XGBoost:Kaggle神器XGBoost算法模型的簡介(資源)、安裝、使用方法、案例應用之詳細攻略
ML之XGBoost:XGBoost參數調優的優秀外文翻譯—《XGBoost中的參數調優完整指南(帶python中的代碼)》(一)
ML之XGBoost:XGBoost參數調優的優秀外文翻譯—《XGBoost中的參數調優完整指南(帶python中的代碼)》(二)
ML之XGBoost:XGBoost參數調優的優秀外文翻譯—《XGBoost中的參數調優完整指南(帶python中的代碼)》(三)
ML之XGBoost:XGBoost參數調優的優秀外文翻譯—《XGBoost中的參數調優完整指南(帶python中的代碼)》(四)
2. xgboost參數/XGBoost?Parameters
The overall parameters have been?divided into 3 categories by XGBoost authors:
XGBoost作者將總體參數分為3類:
一般參數:指導整體功能
Booster參數:在每個步驟中引導單個助推器(樹/回歸)
學習任務參數:指導優化執行
I will give analogies to GBM here and highly recommend to read?this article?to learn from the very basics.
我將在此對GBM進行類比,并強烈建議閱讀本文以從非常基礎的內容中學習。
?
一般參數/General Parameters
These define the overall functionality of XGBoost.
這些定義了XGBoost的整體功能。
- Select the type of model to run at each iteration. It has 2 options:
選擇要在每次迭代中運行的模型類型。它有兩種選擇:- gbtree: tree-based models ? ? ?gbtree:基于樹的模型
- gblinear: linear models ? ? ? ? ? ?GBLinear:線性模型
- Silent mode is activated is set to 1, i.e. no running messages will be printed.
?silent模式激活設置為1,即不會打印正在運行的消息。 - It’s generally good to keep it 0 as the messages?might help in understanding the model.
一般來說,最好保持0,因為消息可能有助于理解模型。
nthread[默認為最大可用線程數(如果未設置)]
- This is used for parallel processing and number of cores in the system should be entered
這用于并行處理,應輸入系統中的內核數。 - If you wish to run on all cores, value?should not be entered and algorithm will detect automatically
如果您希望在所有核心上運行,則不應輸入值,算法將自動檢測。
There are 2 more parameters which are set automatically by XGBoost and you need not worry about them. Lets move on to Booster parameters.
還有兩個參數是由xgboost自動設置的,您不必擔心它們。讓我們繼續討論助推器參數。
?
Booster參數/Booster Parameters
Though?there are 2 types of boosters, I’ll consider only?tree booster?here because it always outperforms the linear booster and thus the later is rarely used.
雖然有兩種助推器,但這里我只考慮樹助推器,因為它總是優于線性助推器,因此很少使用后者。
- Analogous to learning rate in GBM ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?類似于GBM中的學習率
- Makes the model more robust by shrinking the weights on each step ? ?通過收縮每一步的權重,使模型更加健壯
- Typical final values to be used: 0.01-0.2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?使用的典型最終值:0.01-0.2
- Defines the minimum?sum of weights of all observations required in a child.
定義子級中所需的所有觀察值的最小權重之和。 - This is similar to?min_child_leaf?in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
這與GBM中的Min_Child_Leaf類似,但不完全相同。這是指觀測值的最小“權重和”,而GBM的最小“觀測數”。 - Used to control over-fitting. Higher values prevent a model from learning relations which might be highly?specific to the?particular sample selected for a tree.
用于控制裝配。較高的值會阻止模型學習關系,這可能與為樹選擇的特定樣本高度相關。 - Too high values can lead to under-fitting hence, it should be tuned using CV.
過高的數值會導致擬合不足,因此應使用cv對其進行調整。
- The maximum depth of a tree, same as GBM. ? ? ? ??樹的最大深度,與gbm相同。
- Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
用于控制擬合,因為更高的深度將允許模型學習特定于特定樣本的關系。 - Should be tuned using CV.
應該使用cv進行調整。 - Typical values: 3-10
典型值:3-10
- The maximum number of terminal nodes or leaves in a tree.
樹中終端節點或葉的最大數目。 - Can be defined in place of?max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
可在最大深度處定義。由于創建了二叉樹,“n”的深度最多可生成2^n個葉。 - If this is defined, GBM will ignore max_depth.
如果定義了這一點,GBM將忽略最大深度。
- A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
只有當所產生的拆分使損失函數正減少時,才會拆分節點。gamma指定進行分割所需的最小損失減少。 - Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
使算法保守。這些值可能因損失函數而變化,應進行調整。
- In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
在最大增量步驟中,我們允許每棵樹的權重估計為。如果該值設置為0,則表示沒有約束。如果將其設置為正值,將有助于使更新步驟更加保守。 - Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
通常不需要這個參數,但當類極不平衡時,它可能有助于邏輯回歸。 - This is generally not used but you can explore further if you wish.
這通常不使用,但如果您愿意,您可以進一步探索。
- Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
與GBM的子樣本相同。表示每棵樹隨機采樣的觀測分數。 - Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
值越小,算法越保守,防止過擬合,但過小的值可能導致過擬合。 - Typical values: 0.5-1
典型值:0.5-1
- Similar to max_features in GBM. Denotes the fraction of columns?to be randomly samples for each tree.
類似于GBM中的max_功能。表示每棵樹隨機采樣的列的分數。 - Typical values: 0.5-1
典型值:0.5-1
- Denotes the subsample ratio of columns for each split, in each level.
表示每個級別中每個拆分的列的子樣本比率。 - I don’t use this often because subsample and colsample_bytree will do the job for you.?but you can explore further if you feel so.
我不經常使用這個,因為Subsample和Colsample Bytree將為您完成這項工作。但是如果您覺得這樣,您可以進一步探索。
- L2 regularization term on weights (analogous to Ridge regression)
權的L2正則化項(類似于嶺回歸) - This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
用于處理xgboost的正則化部分。雖然許多數據科學家不經常使用它,但應該探索它來減少過度擬合。
- L1 regularization term on weight?(analogous to Lasso?regression)
L1重量上的正則化項(類似于lasso回歸) - Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
可以在高維情況下使用,以便算法在實現時運行更快
- A value greater than 0 should be?used in case of high class imbalance as it helps in faster convergence.
大于0的值應用于高級不平衡的情況,因為它有助于更快的收斂。
?
???????學習任務參數/Learning Task Parameters
These parameters are used to define the optimization objective the metric to be calculated at each step.
這些參數用于定義優化目標,即在每個步驟中要計算的度量。
- This defines the?loss function to be minimized. Mostly used values are:
這定義了要最小化的損失函數。最常用的值是:???????- binary:logistic?–logistic regression for binary classification, returns?predicted probability (not class)
二進制:logistic–logistic回歸用于二進制分類,返回預測概率(非類別) - multi:softmax?–multiclass classification using the softmax objective, returns predicted class (not probabilities)
multi:softmax–使用softmax目標的多類分類,返回預測類(不是概率)- you also need to set an additional?num_class?(number of classes) parameter defining the number of unique classes
您還需要設置一個額外的num_class(類數)參數來定義唯一類的數目。
- you also need to set an additional?num_class?(number of classes) parameter defining the number of unique classes
- multi:softprob?–same as softmax, but returns?predicted probability of each data point belonging to each class.
multi:softprob–與softmax相同,但返回屬于每個類的每個數據點的預測概率。
- binary:logistic?–logistic regression for binary classification, returns?predicted probability (not class)
- The metric to be used for?validation data.
用于驗證數據的度量。 - The default values are rmse for regression and error for classification.
默認值為回歸的RMSE和分類的錯誤。 - Typical?values are: ? ? ?????????典型值為:
- rmse?– root mean square error ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??rmse–均方根誤差
- mae?–?mean absolute error? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?平均絕對誤差
- logloss?–?negative?log-likelihood? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 對數損失–負對數可能性
- error?–?Binary classification error rate (0.5 threshold)? ? ? 錯誤–二進制分類錯誤率(0.5閾值)
- merror?–?Multiclass classification error rate? ? ? ? ? ? ? ? ? ? ??多類分類錯誤率 ? ? ? ? ? ? ? ? ??
- mlogloss?–?Multiclass logloss? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?多類日志損失
- auc:?Area under the curve? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?AUC:曲線下面積
- The random number seed.
隨機數種子。 - Can be used for generating reproducible results and also for parameter tuning.
可用于生成可重復的結果,也可用于參數調整。
If you’ve been using Scikit-Learn till now, these parameter names might not look familiar. A good news is that xgboost module in python has an sklearn wrapper called XGBClassifier. It uses sklearn style naming convention. The parameters names which will change are:
如果您到目前為止一直在使用scikit-learn,這些參數名稱可能看起來不熟悉。一個好消息是,python中的xgboost模塊有一個名為xgbclassifier的sklearn包裝器。它使用sklearn樣式命名約定。將更改的參數名稱為:
You must be wondering that we have defined everything except something similar to the “n_estimators” parameter in GBM. Well this exists as a parameter in XGBClassifier. However, it has to be passed as “num_boosting_rounds” while calling the fit function in the standard xgboost implementation.
您一定想知道,除了類似于GBM中的“n_Estimators”參數之外,我們已經定義了所有內容。這在XGBClassifier中作為一個參數存在。但是,在標準xgboost實現中調用fit函數時,必須將其作為“num-booting-rounds”傳遞。
I recommend you to go through the following parts of xgboost guide to better understand the parameters and codes:
我建議您仔細閱讀xgboost指南的以下部分,以便更好地了解參數和代碼:
?
?
?
總結
以上是生活随笔為你收集整理的ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(二)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ML之XGBoost:XGBoost参数
- 下一篇: ML之XGBoost:XGBoost参数