直线回归数据 离群值_处理离群值:OLS与稳健回归
直線回歸數據 離群值
When it comes to regression analysis — outliers (or values that are well outside of the mean for a particular set of data) can cause issues.
說到回歸分析,離群值(或特定數據集的均值超出平均值)可能會引起問題。
背景 (Background)
Let’s consider this issue further using the Pima Indians Diabetes dataset.
讓我們使用比馬印第安人糖尿病數據集來進一步考慮這個問題。
Here is a boxplot of BMI values across patients. We can see that according to the above boxplot, there are several outliers present that are much larger than that indicated by the interquartile range.
這是患者之間BMI值的箱線圖。 我們可以看到,根據上面的箱線圖,存在一些離群值大于四分位數范圍指示的離群值。
Furthermore, we also have visual indication of a positively skewed distribution — where several positive outliers “push” the distribution out to the right:
此外,我們還可以看到正偏分布的視覺指示-多個正離群值將分布“推”到右側:
Source: RStudio資料來源:RStudioOutliers can cause issues when it comes to conducting regression analysis. A key assumption of this model is the line of best fit, or the regression line that minimises the distance between the regression line and the individual observations.
在進行回歸分析時,異常值可能會引起問題。 該模型的關鍵假設是最佳擬合線 ,或者是使回歸線與各個觀測值之間的距離最小的回歸線。
Clearly, if outliers are present, then this weakens the predictive power of the regression model as the line of best fit. It also violates the assumption that the data is normally distributed.
顯然,如果存在離群值,則這會削弱回歸模型作為最佳擬合線的預測能力。 它也違反了數據是正態分布的假設。
In this regard, both an OLS regression model and robust regression models (using Huber and Bisquare weights) are run in order to predict BMI values across the test set — with a view to measuring whether accuracy was significantly improved by using the latter model.
在這方面,為了預測整個測試集的BMI值,運行了OLS回歸模型和魯棒回歸模型(使用Huber和Bisquare權重),以期通過使用后者模型來測量準確性是否得到了顯著提高。
Here is a quick overview of the data and the correlations between each feature:
以下是數據及其每個功能之間的相關性的簡要概述:
Source: RStudio資料來源:RStudio最小二乘 (OLS)
Using the above correlation plot in ensuring that the independent variables in the regression model are not strongly correlated with each other, the regression model is defined as follows:
使用上面的相關圖,以確?;貧w模型中的自變量之間不存在強烈的相關性,回歸模型的定義如下:
reg1 <- lm(BMI ~ Outcome + Age + Insulin + SkinThickness, data=trainset)Note that Outcome is a categorical variable between 0 and 1 (not diabetic vs. diabetic).
請注意, 結果是介于0和1之間的分類變量(非糖尿病vs.糖尿病)。
The data is split into both a training set and a test set (to serve as unseen data for the model).
數據被分為訓練集和測試集(作為模型的看不見的數據)。
For the training set — 80% of this set is used to train the regression model, while 20% is used as a validation set to assess the results.
對于訓練集,該訓練集的80%用于訓練回歸模型,而20%作為驗證集來評估結果。
# Training and Validation Datatrainset <- diabetes1[1:479, ]
valset <- diabetes1[480:599, ]
Here are the OLS results:
這是OLS結果:
> # OLS Regression> summary(ols <- lm(BMI ~ Outcome + Age + Insulin + SkinThickness, data=trainset))Call:
lm(formula = BMI ~ Outcome + Age + Insulin + SkinThickness, data = trainset)Residuals:
Min 1Q Median 3Q Max
-12.0813 -4.2762 -0.8733 3.4031 28.2196Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.0498978 0.9740705 28.797 < 2e-16 ***
Outcome 4.1290646 0.6171707 6.690 6.30e-11 ***
Age -0.0101171 0.0248626 -0.407 0.684
Insulin 0.0000262 0.0027077 0.010 0.992
SkinThickness 0.1513285 0.0195945 7.723 6.81e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.176 on 474 degrees of freedom
Multiple R-squared: 0.2135, Adjusted R-squared: 0.2069
F-statistic: 32.17 on 4 and 474 DF, p-value: < 2.2e-16
Outcome and SkinThickness are identified as significant variables at the 5% level. While the R-Squared of 21.35% is quite low — this can be expected as there are many more variables that can influence BMI which have not been included in the model.
結果和皮膚厚度被確定為5%水平的重要變量。 盡管21.35%的R平方非常低-但可以預期,因為還有更多可能影響BMI的變量尚未包含在模型中。
Let’s drop the age and insulin variables from the OLS model and run it once again.
讓我們從OLS模型中刪除年齡和胰島素變量,然后再次運行。
> # OLS Regression> summary(ols <- lm(BMI ~ Outcome + SkinThickness, data=trainset))Call:
lm(formula = BMI ~ Outcome + SkinThickness, data = trainset)Residuals:
Min 1Q Median 3Q Max
-12.1740 -4.2115 -0.8532 3.3852 28.3072Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.70940 0.49723 55.728 <2e-16 ***
Outcome 4.06953 0.59223 6.872 2e-11 ***
SkinThickness 0.15247 0.01774 8.595 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.165 on 476 degrees of freedom
Multiple R-squared: 0.2132, Adjusted R-squared: 0.2099
F-statistic: 64.51 on 2 and 476 DF, p-value: < 2.2e-16
穩健回歸 (Robust Regressions)
A modified version of the above regression is now run — also known as a robust regression. The reason we refer to the regression as “robust” is that such models are less sensitive to violations of OLS, including the presence of outliers in the data. The following presentation gives more information as to how specifically a robust regression works.
現在運行上述回歸的修改版本-也稱為健壯回歸。 我們將回歸稱為“穩健”的原因是,此類模型對違反OLS(包括數據中存在異常值)的敏感度較低。 以下演示提供了有關穩健回歸如何具體工作的更多信息。
In this example, we will use two different types of weighting to run this type of regression: Huber and Bisquare weights.
在此示例中,我們將使用兩種不同類型的權重來運行這種類型的回歸: Huber和Bisquare權重。
The same regressions are run once again, but this time using the above weightings.
再次運行相同的回歸,但是這次使用上述權重。
胡貝爾權重 (Huber Weights)
> # Huber Weights> rr.huber <- rlm(BMI ~ Outcome + SkinThickness, data=trainset)
> summary(rr.huber)Call: rlm(formula = BMI ~ Outcome + SkinThickness, data = trainset)
Residuals:
Min 1Q Median 3Q Max
-12.4130 -3.6492 -0.3479 3.7717 28.7081Coefficients:
Value Std. Error t value
(Intercept) 27.0596 0.4685 57.7581
Outcome 3.7631 0.5580 6.7438
SkinThickness 0.1645 0.0167 9.8445Residual standard error: 5.47 on 476 degrees of freedom
雙平方權重 (Bisquare Weights)
> # Bisquare weighting> rr.bisquare <- rlm(BMI ~ Outcome + SkinThickness, data=trainset, psi = psi.bisquare)
> summary(rr.bisquare)Call: rlm(formula = BMI ~ Outcome + SkinThickness, data = trainset,
psi = psi.bisquare)
Residuals:
Min 1Q Median 3Q Max
-12.1991 -3.6106 -0.3015 3.8074 28.8724Coefficients:
Value Std. Error t value
(Intercept) 27.0524 0.4793 56.4472
Outcome 3.6491 0.5708 6.3927
SkinThickness 0.1636 0.0171 9.5689Residual standard error: 5.483 on 476 degrees of freedom
比較方式 (Comparison)
Here is the performance of the regression models in predicting the test set values (both on a root mean squared error and mean absolute percentage error basis):
這是回歸模型預測測試集值的性能(均基于均方根誤差和均值絕對百分比誤差 ):
RMSE (RMSE)
- OLS: 5.81 OLS:5.81
- Huber: 5.86 胡貝爾:5.86
- Bisquare: 5.87 雙方塊:5.87
瑪普 (MAPE)
- OLS: 0.139 OLS:0.139
- Huber: 0.137 胡貝爾:0.137
- Bisquare: 0.137 雙平方:0.137
We can see that the errors increased slightly on an RMSE basis (contrary to our expectations), while there was only a marginal decrease on an MAPE basis.
我們可以看到,基于RMSE的誤差略有增加(與我們的預期相反),而基于MAPE的誤差僅略有減少。
異常值是否“具有影響力”? (Are the outliers “influential”?)
Using a robust regression to account for outliers did not show significant accuracy improvements as might have been expected.
使用穩健的回歸來解決離群值并不能像預期的那樣顯示出明顯的準確性提高。
However, simply because outliers might be present in a dataset — doesn’t necessarily mean that those outliers are influential.
但是,僅因為異常值可能存在于數據集中-并不一定意味著這些異常值具有影響力。
By influential, we mean that the outlier has a direct effect on the response variable.
所謂影響力,是指離群值直接影響響應變量。
This can be determined by using Cook’s Distance.
這可以通過使用庫克距離來確定。
Source: RStudio資料來源:RStudioWe can see that while outliers are indicated as being present in the dataset — they still do not approach the threshold as outlined by Cook’s distance at the top right-hand corner of the graph.
我們可以看到,雖然異常值被指示為存在于數據集中-但它們仍未接近圖右上角的庫克距離所概述的閾值。
In this regard, it is now evident why the robust regression did not show superior performance to OLS from an accuracy standpoint — the outliers are not influential enough to warrant using a robust regression.
在這方面,現在顯而易見的是,為什么從準確性的角度來看,穩健回歸沒有表現出優于OLS的性能-異常值的影響力不足以保證使用穩健回歸。
結論 (Conclusion)
Robust regressions are useful when it comes to modelling outliers in a dataset and there have been cases where they can produce superior results to OLS.
在對數據集中的異常值進行建模時,穩健的回歸非常有用,并且在某些情況下可以產生比OLS更好的結果。
However, those outliers must be influential and in this regard one must practice caution in using robust regressions in a situation such as this — where outliers are present but they do not particularly influence the response variable.
但是,這些離群值必須具有影響力,因此在這種情況下(在存在離群值但它們并沒有特別影響響應變量的情況下),使用穩健回歸時必須謹慎行事。
Hope you enjoyed this article, and any questions or feedback are greatly welcomed. You can find the code and datasets for this example at the MGCodesandStats GitHub repository.
希望您喜歡本文,并歡迎任何問題或反饋。 您可以在MGCodesandStats GitHub存儲庫中找到此示例的代碼和數據集。
Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.
免責聲明:本文按“原樣”撰寫,不作任何擔保。 它旨在提供數據科學概念的概述,并且不應以任何方式解釋為專業建議。
翻譯自: https://towardsdatascience.com/working-with-outliers-ols-vs-robust-regressions-5cf861168ac4
直線回歸數據 離群值
總結
以上是生活随笔為你收集整理的直线回归数据 离群值_处理离群值:OLS与稳健回归的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 消息称五菱宏光 MINI EV 纪念版车
- 下一篇: 适马新款 50mm F1.4 DG DN