多元线性回归练习-预测房价
目的:
找到數據集中關于特征的描述。使用數據集中的其他變量來構建最佳模型以預測平均房價。
數據集說明:
數據集總共包含506個案例。
每種情況下,數據集都有14個屬性:
| MedianHomePrice | 房價中位數 |
| CRIM | 人均城鎮犯罪率 |
| ZN | 25,000平方英尺以上土地的住宅用地比例 |
| INDIUS | 每個城鎮非零售業務英畝的比例。 |
| CHAS | 查爾斯河虛擬變量(如果束縛河,則為1;否則為0) |
| NOX- | 氧化氮濃度(百萬分之一) |
| RM | 每個住宅的平均房間數 |
| AGE | 1940年之前建造的自有住房的比例 |
| DIS | 到五個波士頓就業中心的加權距離 |
| RAD | 徑向公路的可達性指數 |
| TAX | 每10,000美元的全值財產稅率 |
| PTRATIO | 各鎮師生比例 |
| B | 1000(Bk-0.63)^ 2,其中Bk是按城鎮劃分的黑人比例 |
| LSTAT | 人口狀況降低百分比 |
| MEDV | 自有住房的中位價格(以$ 1000為單位) |
設定庫和數據。
import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factorfrom sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import r2_score from patsy import dmatrices import matplotlib.pyplot as plt %matplotlib inlinenp.random.seed(42)#加載內置數據集,了解即可 boston_data = load_boston() df = pd.DataFrame() df['MedianHomePrice'] = boston_data.target df2 = pd.DataFrame(boston_data.data) df2.columns = boston_data.feature_names df = df.join(df2) df.head()| 24.0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
| 21.6 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
| 34.7 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
| 33.4 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
| 36.2 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
1.獲取數據集中每個特征的匯總
使用 corr 方法計算各變量間的相關性,判斷是否存在多重線性。
#繪制熱力圖 import seaborn as sns plt.subplots(figsize=(10,10))#調節圖像大小 sns.heatmap(df.corr(), annot = True, vmax = 1, square = True, cmap='RdPu')2.拆分數據集
創建一個 training 數據集與一個 test 數據集,其中20%的數據在 test 數據集中。將結果存儲在 X_train, X_test, y_train, y_test 中。
X = df.drop('MedianHomePrice' , axis=1, inplace=False) y = df['MedianHomePrice'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42 )3.標準化
使用 [StandardScaler]來縮放數據集中的所有 x 變量。將結果存儲在 X_scaled_train 中。
#把y_train的索引改為從0開始,因為原索引與下面的training_data索引不一致,合并會出錯 y_train = pd.Series(y_train.values) #使用 StandardScaler 來縮放數據集中的所有 x 變量,將結果存儲在 X_scaled_train 中。 X_scaled_train = StandardScaler()#創建一個 pandas 數據幀并存儲縮放的 x 變量以及 y_train。命名為 training_data 。 training_data = X_scaled_train.fit_transform(X_train) training_data = pd.DataFrame(training_data, columns = X_train.columns)training_data['MedianHomePrice'] = y_train training_data.head()| 1.287702 | -0.500320 | 1.033237 | -0.278089 | 0.489252 | -1.428069 | 1.028015 | -0.802173 | 1.706891 | 1.578434 | 0.845343 | -0.074337 | 1.753505 | 12.0 |
| -0.336384 | -0.500320 | -0.413160 | -0.278089 | -0.157233 | -0.680087 | -0.431199 | 0.324349 | -0.624360 | -0.584648 | 1.204741 | 0.430184 | -0.561474 | 19.9 |
| -0.403253 | 1.013271 | -0.715218 | -0.278089 | -1.008723 | -0.402063 | -1.618599 | 1.330697 | -0.974048 | -0.602724 | -0.637176 | 0.065297 | -0.651595 | 19.4 |
| 0.388230 | -0.500320 | 1.033237 | -0.278089 | 0.489252 | -0.300450 | 0.591681 | -0.839240 | 1.706891 | 1.578434 | 0.845343 | -3.868193 | 1.525387 | 13.4 |
| -0.325282 | -0.500320 | -0.413160 | -0.278089 | -0.157233 | -0.831094 | 0.033747 | -0.005494 | -0.624360 | -0.584648 | 1.204741 | 0.379119 | -0.165787 | 18.2 |
4.模型1:所有特征
對訓練集training_data進行線性擬合,查看p值判斷顯著性
#用所有的縮放特征來擬合線性模型,以預測此響應(平均房價)。不要忘記添加一個截距。 training_data['intercept'] = 1 X_train1= training_data.drop('MedianHomePrice' , axis=1, inplace=False) lm = sm.OLS(training_data['MedianHomePrice'], X_train1) result = lm.fit() result.summary()| MedianHomePrice | R-squared:0.751 |
| OLS | Adj. R-squared:0.743 |
| Least Squares | F-statistic:90.43 |
| Sun, 10 May 2020 | Prob (F-statistic):6.21e-109 |
| 20:22:27 | Log-Likelihood:-1194.3 |
| 404 | AIC:2417. |
| 390 | BIC:2473. |
| 13 | |
| nonrobust |
| coefstd errtP>|t|[0.0250.975] | |||||
| -1.0021 | 0.308 | -3.250 | 0.001 | -1.608 | -0.396 |
| 0.6963 | 0.370 | 1.882 | 0.061 | -0.031 | 1.423 |
| 0.2781 | 0.464 | 0.599 | 0.549 | -0.634 | 1.190 |
| 0.7187 | 0.247 | 2.914 | 0.004 | 0.234 | 1.204 |
| -2.0223 | 0.498 | -4.061 | 0.000 | -3.001 | -1.043 |
| 3.1452 | 0.329 | 9.567 | 0.000 | 2.499 | 3.792 |
| -0.1760 | 0.407 | -0.432 | 0.666 | -0.977 | 0.625 |
| -3.0819 | 0.481 | -6.408 | 0.000 | -4.027 | -2.136 |
| 2.2514 | 0.652 | 3.454 | 0.001 | 0.970 | 3.533 |
| -1.7670 | 0.704 | -2.508 | 0.013 | -3.152 | -0.382 |
| -2.0378 | 0.321 | -6.357 | 0.000 | -2.668 | -1.408 |
| 1.1296 | 0.271 | 4.166 | 0.000 | 0.596 | 1.663 |
| -3.6117 | 0.395 | -9.133 | 0.000 | -4.389 | -2.834 |
| 22.7965 | 0.236 | 96.774 | 0.000 | 22.333 | 23.260 |
| 133.052 | Durbin-Watson:2.114 |
| 0.000 | Jarque-Bera (JB):579.817 |
| 1.379 | Prob(JB):1.24e-126 |
| 8.181 | Cond. No.9.74 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
5.判斷解釋變量間是否存在相關性:
計算訓練集中的vif
#計算數據集中每個 x_variable 的 vif def vif_calculator(df, response):'''INPUT:df - 包含x和y的數據集response - 反應變量的列名stringOUTPUT:vif - a dataframe of the vifs'''df2 = df.drop(response, axis = 1, inplace=False)#刪除反應變量列features = "+".join(df2.columns)y, X = dmatrices(response + ' ~' + features, df, return_type='dataframe')vif = pd.DataFrame()vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]vif["features"] = X.columnsvif = vif.round(1)return vif vif = vif_calculator(training_data, 'MedianHomePrice') vif C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1685: RuntimeWarning: divide by zero encountered in double_scalarsreturn 1 - self.ssr/self.centered_tss| 0.0 | Intercept |
| 1.7 | CRIM |
| 2.5 | ZN |
| 3.9 | INDUS |
| 1.1 | CHAS |
| 4.5 | NOX |
| 1.9 | RM |
| 3.0 | AGE |
| 4.2 | DIS |
| 7.7 | RAD |
| 8.9 | TAX |
| 1.9 | PTRATIO |
| 1.3 | B |
| 2.8 | LSTAT |
| 0.0 | intercept |
結合vif、相關性和p值,判斷要刪除哪些變量:
vif限制在4以內。INDUS、RAD、TAX、NOX的VIF較大
TAX 和 RAD 之間具有強相關性,INDUS 和 NOX 也是如此,因此,每組相關性高的變量只要刪除一個就能有效地減小另一個的 VIF。
p值限制在0.05以內。AGE和INDUS的p值較大。
根據查看 p 值和VIF的結果,如果選擇保留RAD和INDUS,那么刪除 AGE、 NOX 與TAX,刪掉這些特征之后,用剩余的特征擬合一個新的線性模型。
6.模型2:刪除 AGE、 NOX 與TAX
X_train1 = training_data.drop(['AGE','NOX','TAX','MedianHomePrice'] , axis=1, inplace=False) lm1 = sm.OLS(training_data['MedianHomePrice'], X_train1) result1 = lm1.fit() result1.summary()| MedianHomePrice | R-squared:0.733 |
| OLS | Adj. R-squared:0.727 |
| Least Squares | F-statistic:108.1 |
| Sun, 10 May 2020 | Prob (F-statistic):2.77e-106 |
| 21:02:41 | Log-Likelihood:-1208.0 |
| 404 | AIC:2438. |
| 393 | BIC:2482. |
| 10 | |
| nonrobust |
| coefstd errtP>|t|[0.0250.975] | |||||
| -0.9116 | 0.317 | -2.876 | 0.004 | -1.535 | -0.289 |
| 0.5622 | 0.363 | 1.548 | 0.123 | -0.152 | 1.276 |
| -0.8746 | 0.411 | -2.128 | 0.034 | -1.683 | -0.067 |
| 0.6896 | 0.252 | 2.738 | 0.006 | 0.194 | 1.185 |
| 3.2406 | 0.330 | 9.818 | 0.000 | 2.592 | 3.889 |
| -2.1728 | 0.434 | -5.010 | 0.000 | -3.025 | -1.320 |
| 0.4380 | 0.389 | 1.126 | 0.261 | -0.327 | 1.202 |
| -1.6369 | 0.310 | -5.288 | 0.000 | -2.246 | -1.028 |
| 1.2106 | 0.279 | 4.345 | 0.000 | 0.663 | 1.758 |
| -3.9851 | 0.381 | -10.470 | 0.000 | -4.733 | -3.237 |
| 22.7965 | 0.243 | 93.916 | 0.000 | 22.319 | 23.274 |
| 126.568 | Durbin-Watson:2.033 |
| 0.000 | Jarque-Bera (JB):542.197 |
| 1.310 | Prob(JB):1.83e-118 |
| 8.034 | Cond. No.4.66 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
根據p值,應刪除 RAD ,保留其他變量。
7.模型3:刪除 AGE、 NOX 、TAX、RAD
X_train2 = training_data.drop(['AGE','NOX','TAX','RAD', 'MedianHomePrice'] , axis=1, inplace=False) lm2 = sm.OLS(training_data['MedianHomePrice'], X_train2) result2 = lm2.fit() result2.summary()| MedianHomePrice | R-squared:0.733 |
| OLS | Adj. R-squared:0.726 |
| Least Squares | F-statistic:119.9 |
| Sun, 10 May 2020 | Prob (F-statistic):4.60e-107 |
| 21:02:09 | Log-Likelihood:-1208.6 |
| 404 | AIC:2437. |
| 394 | BIC:2477. |
| 9 | |
| nonrobust |
| coefstd errtP>|t|[0.0250.975] | |||||
| -0.7616 | 0.288 | -2.647 | 0.008 | -1.327 | -0.196 |
| 0.6151 | 0.360 | 1.707 | 0.089 | -0.093 | 1.323 |
| -0.7544 | 0.397 | -1.900 | 0.058 | -1.535 | 0.026 |
| 0.7067 | 0.252 | 2.810 | 0.005 | 0.212 | 1.201 |
| 3.3022 | 0.326 | 10.142 | 0.000 | 2.662 | 3.942 |
| -2.2235 | 0.432 | -5.153 | 0.000 | -3.072 | -1.375 |
| -1.5090 | 0.288 | -5.239 | 0.000 | -2.075 | -0.943 |
| 1.1502 | 0.273 | 4.206 | 0.000 | 0.613 | 1.688 |
| -3.9413 | 0.379 | -10.406 | 0.000 | -4.686 | -3.197 |
| 22.7965 | 0.243 | 93.884 | 0.000 | 22.319 | 23.274 |
| 134.948 | Durbin-Watson:2.028 |
| 0.000 | Jarque-Bera (JB):619.161 |
| 1.381 | Prob(JB):3.56e-135 |
| 8.399 | Cond. No.4.36 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
仔細檢查所有的 VIF 是否小于4。與先前模型相比,Rsquared 值沒有發生變化。
training_data2 = training_data.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) vif = vif_calculator(training_data2, 'MedianHomePrice') vif| 0.0 | Intercept |
| 1.4 | CRIM |
| 2.2 | ZN |
| 2.7 | INDUS |
| 1.1 | CHAS |
| 1.8 | RM |
| 3.2 | DIS |
| 1.4 | PTRATIO |
| 1.3 | B |
| 2.4 | LSTAT |
| 0.0 | intercept |
8.模型評估
對各個模型的測試預測值和實際測試值的匹配度進行打分
#含有全部變量的模型 lm_full = LinearRegression() lm_full.fit(X_train, y_train) lm_full.score(X_test, y_test)#打分 0.66848257539715972 #刪除AGE、NOX、TAX X_train_red = X_train.drop(['AGE','NOX','TAX'] , axis=1, inplace=False) X_test_red = X_test.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)#刪除 AGE、 NOX 、TAX、RAD X_train_red2 = X_train.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) X_test_red2 = X_test.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) lm_red = LinearRegression()#刪除AGE、NOX、TAX的模型 lm_red.fit(X_train_red, y_train) print(lm_red.score(X_test_red, y_test))#打分lm_red2 = LinearRegression()#刪除 AGE、 NOX 、TAX、RAD的模型 lm_red2.fit(X_train_red2, y_train) print(lm_red2.score(X_test_red2, y_test))#打分 0.639421781821 0.63441065636從評分可以看出,在此測試集中,擁有所有變量的模型表現最佳。后續可以用交叉驗證 (即在多個訓練和測試集里重復這一操作)來確定模型效果是否有穩定性。
總結
以上是生活随笔為你收集整理的多元线性回归练习-预测房价的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 网页自动跳转 5种方法
- 下一篇: android qq勋章动画,qq最新的