回归树与基于规则的模型(part2)--简单回归树
學習筆記,僅供參考,有錯必糾
回歸樹與基于規則的模型
簡單回歸樹
簡單回歸樹將數據劃分為若干組,其中組內的樣本點在結果變量取值上具有一定的同質性。為了實現這種同質性劃分,回歸樹需要決定:
- 用于切分的預測變量及對應的切分點
- 樹的深度或復雜度
- 最終節點上的預測方程
在這里,我們首先關注最終節點為常數的模型。
構建回歸樹有許多不同的方法,其中,最常用的就是the classi?cation and regression tree (CART) methodology of Breiman.
對于回歸問題,模型首先從完整的數據集S開始,搜索每一個預測變量的每一個不同的取值,以此將數據劃分為兩組(S1,S2S_1,S_2S1?,S2?),其中S1S_1S1?和S2S_2S2?的選取需要整體誤差平方和達到最小:
SSE=∑i∈S1(yi?y ̄1)2+∑i∈S2(yi?y ̄2)2SSE=\sum_{i \in S_1}(y_i-\overline{y}_1)^2 + \sum_{i \in S_2}(y_i-\overline{y}_2)^2 SSE=i∈S1?∑?(yi??y?1?)2+i∈S2?∑?(yi??y?2?)2
式中y ̄1\overline{y}_1y?1?和y ̄2\overline{y}_2y?2?是S1S_1S1?和S2S_2S2?組內訓練集結果變量的的平均值
接下來分別在S1S_1S1?和S2S_2S2?中,模型繼續搜索預測變量的切分點,以使得SSESSESSE達到最大的縮減,由于回歸樹的這一過程,本質上是遞歸的切分,因此這種方法也通常稱為遞歸劃分。
When the predictor is continuous, the process for ?nding the optimal split point is straightforward(直接) since the data can be ordered in a natural way(數據可以進行自然的排序).
Binary predictors(0-1取值的預測變量) are also easy to split, because there is only one possible split point(因為它只有一種可能的切分點).
進行完整生長后的樹可能會變得很大,因此會傾向于過度擬合訓練集。因此,樹經常會進行剪枝,以回到一個較小的深度。采用的剪枝過程稱為cost–complexity tuning(代價-復雜度調優).
這一過程的目的是找到一個"合適大小的樹",以使得誤差達到最小。To do this, we penalize the error rate using the size of the tree(為了達到這個目的,我們利用樹的大小對誤差進行懲罰):
SSEcp=SSE+cp?TerminalNodesSSE_{c_p}=SSE+c_p*Terminal \; Nodes SSEcp??=SSE+cp??TerminalNodes
其中cpc_pcp?被稱為復雜度參數,For a speci?c value of the complexity parameter(對于一個給定的復雜度參數取值), we ?nd the smallest pruned tree(剪枝后的數) that has the lowest penalized error rate(使懲罰后的誤差達到最小).
As with other regularization methods(正則化方法) previously discussed, smaller penalties tend to produce more complex models, which result in larger trees.
Larger values of the complexity parameter may result in a tree with one split (a stump) or, perhaps, even a tree with no splits.
為了找到最優的剪枝樹,需要在一系列的cpc_pcp?取值上對數據進行計算,這一過程會對每一個cpc_pcp?值計算一個SSE。但我們知道的是,當選擇了一個不同的樣本時,SSE的數值也會有所變化,為了體現每一個cpc_pcp?取值下SSE的變異,Breiman等人建議使用類似于交叉驗證的方法。他們還提出了一倍標準差準則作為優化準則,來給出最簡單的樹:在一倍的標準差之內,找到最簡單的使得絕對誤差最小的樹。
Alternatively, the model can be tuned by choosing the value of the complexity parameter(復雜度參數) associated with the smallest possible RMSE value.
This particular tree methodology can also handle missing data(處理缺失值). When building the tree, missing data are ignored. For each split, a variety of alternatives are evaluated.(對于每個拆分,模型都會計算一系列的備選方案).
這一系列備選方案被稱為代理切分(surrogate splits).
A surrogate split is one whose results are similar to the original split actually used in the tree(代理切分是指與樹中實際切分結果相類似的備選切分方案).If a surrogate split approximates the original split well, it can be used when the predictor data associated with the original split are not available.(如果代理切分對原始切分的近似效果良好,當原始切分的預測變量有缺失值時,代理切分可以發揮作用)
Once the tree has been ?nalized, we begin to assess the relative importance of the predictors(預測變量的相對重要性) to the outcome. One way to compute an aggregate measure of importance is to keep track of the overall reduction in the optimization criteria for each predictor(記錄每一個預測變量對優化目標的減少量) . If SSE is the optimization criteria(優化準則), then the reduction in the SSE for the training set is aggregated for each predictor(那么每一個預測變量都可以計算訓練集上整體的SSE減少量). Intuitively, predictors that appear higher in the tree (earlier splits) or those that appear multiple times in the tree will be more important than predictors that occur lower in the tree or not at all.
An advantage(優勢) of tree-based models is that, when the tree is not large, the model is simple and interpretable(簡單而又有解釋性). Also, this type of tree can be computed quickly (despite using multiple exhaustive searches[盡管使用了若干次窮舉搜索]). Tree models intrinsically conduct feature selection(特征選擇); if a predictor is never used in a split, the prediction equation is independent of these data. This advantage is weakened when there are highly correlated predictors(高度相關的預測變量). If two predictors are extremely correlated, the choice of which to use in a split is somewhat random(選擇一個作為切分點就幾乎是隨機的).
While trees are highly interpretable and easy to compute, they do have some noteworthy disadvantages(顯著的缺點). First, single regression trees are more likely to have sub-optimal predictive performance(次優預測能力) compared to other modeling approaches. This is partly due to the simplicity of the model(這在一定程度上是由模型的簡潔性決定的). By construction, tree models partition(劃分) the data into rectangular regions of the predictor space. If the relationship between predictors and the outcome is not adequately described by these rectangles, then the predictive performance of a tree will not be optimal. Also, the number of possible predicted outcomes from a tree is ?nite and is determined by the number of terminal nodes(樹模型給出的結果變量預測值只有有限種可能,它由最終節點的數目決定).
An additional disadvantage is that an individual tree tends to be unstable.If the data are slightly altered, a completely di?erent set of splits might be found.
Finally, these trees su?er from selection bias(選擇偏差): predictors with a higher number of distinct values are favored over more granular predictors(具有很多不同取值的預測變量通常比取值較離散的預測變量更容易出現在模型中).
The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.
Also, as the number of missing values increases, the selection of predictors becomes more biased(有偏)
文獻中確實存在若干無偏的回歸樹方法。Loh提出了廣義的、無偏的、檢測交互項和進行估計的GUIDE算法。
GUIDE solves the problem by decoupling the process(分離流程) of selecting the split variable(選擇切分變量) and the split value(選擇切分點). This algorithm ranks the predictors using statistical hypothesis testing(假設檢驗) and then ?nds the appropriate split value associated with the most important factor(然后對于最重要的變量尋找合適的切分點).
Hothorn提出了條件推斷樹(conditional inference trees)In this model, statistical hypothesis tests are used to do an exhaustive search across the predictors and their possible split points(在這個模型中,首先會利用假沒檢驗對預測變量和可能的切分點進行窮舉搜索). For a candidate split(對于一個備選切分點), a statistical test is used to evaluate the di?erence between the means of the two groups created by the split and a p-value can be computed for the test(假設檢驗可以用來評估由切分點形成的兩組之間均值的差異,然后計算檢驗的p值).
By default, this algorithm does not use pruning(剪枝); as the data sets are further split, the decrease in the number of samples reduces the power of the hypothesis tests(樣本量的減小會降低檢驗的功效). This results in higher p-values(這會導致p值增大) and a lower likelihood of a new split (and over-?tting). However, statistical hypothesis tests are not directly related to predictive performance(預測效果), and, because of this, it is still advisable to choose the complexity of the tree on the basis of performance[因此建議的做法仍然是根據模型的效能選擇合適的復雜度參數] (via resampling[或者通過重抽樣]).
總結
以上是生活随笔為你收集整理的回归树与基于规则的模型(part2)--简单回归树的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 水星 MW325R V1~V3 无线路由
- 下一篇: 腾达 N318 V6 无线路由器固定IP