當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

多元线性回归练习-预测房价

發布時間：2024/8/1 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了多元线性回归练习-预测房价小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

目的：

找到數據集中關于特征的描述。使用數據集中的其他變量來構建最佳模型以預測平均房價。

數據集說明：

數據集總共包含506個案例。

每種情況下，數據集都有14個屬性：

特征說明

MedianHomePrice	房價中位數
CRIM	人均城鎮犯罪率
ZN	25,000平方英尺以上土地的住宅用地比例
INDIUS	每個城鎮非零售業務英畝的比例。
CHAS	查爾斯河虛擬變量（如果束縛河，則為1；否則為0）
NOX-	氧化氮濃度（百萬分之一）
RM	每個住宅的平均房間數
AGE	1940年之前建造的自有住房的比例
DIS	到五個波士頓就業中心的加權距離
RAD	徑向公路的可達性指數
TAX	每10,000美元的全值財產稅率
PTRATIO	各鎮師生比例
B	1000（Bk-0.63）^ 2，其中Bk是按城鎮劃分的黑人比例
LSTAT	人口狀況降低百分比
MEDV	自有住房的中位價格（以$ 1000為單位）

設定庫和數據。

import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factorfrom sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import r2_score from patsy import dmatrices import matplotlib.pyplot as plt %matplotlib inlinenp.random.seed(42)#加載內置數據集，了解即可 boston_data = load_boston() df = pd.DataFrame() df['MedianHomePrice'] = boston_data.target df2 = pd.DataFrame(boston_data.data) df2.columns = boston_data.feature_names df = df.join(df2) df.head() MedianHomePriceCRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT01234

24.0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
21.6	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
34.7	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
33.4	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
36.2	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

1.獲取數據集中每個特征的匯總

使用 corr 方法計算各變量間的相關性，判斷是否存在多重線性。

#繪制熱力圖 import seaborn as sns plt.subplots(figsize=(10,10))#調節圖像大小 sns.heatmap(df.corr(), annot = True, vmax = 1, square = True, cmap='RdPu')

2.拆分數據集

創建一個 training 數據集與一個 test 數據集，其中20％的數據在 test 數據集中。將結果存儲在 X_train, X_test, y_train, y_test 中。

X = df.drop('MedianHomePrice' , axis=1, inplace=False) y = df['MedianHomePrice'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42 )

3.標準化

使用 [StandardScaler]來縮放數據集中的所有 x 變量。將結果存儲在 X_scaled_train 中。

#把y_train的索引改為從0開始，因為原索引與下面的training_data索引不一致，合并會出錯 y_train = pd.Series(y_train.values) #使用 StandardScaler 來縮放數據集中的所有 x 變量,將結果存儲在 X_scaled_train 中。 X_scaled_train = StandardScaler()#創建一個 pandas 數據幀并存儲縮放的 x 變量以及 y_train。命名為 training_data 。 training_data = X_scaled_train.fit_transform(X_train) training_data = pd.DataFrame(training_data, columns = X_train.columns)training_data['MedianHomePrice'] = y_train training_data.head() CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMedianHomePrice01234

1.287702	-0.500320	1.033237	-0.278089	0.489252	-1.428069	1.028015	-0.802173	1.706891	1.578434	0.845343	-0.074337	1.753505	12.0
-0.336384	-0.500320	-0.413160	-0.278089	-0.157233	-0.680087	-0.431199	0.324349	-0.624360	-0.584648	1.204741	0.430184	-0.561474	19.9
-0.403253	1.013271	-0.715218	-0.278089	-1.008723	-0.402063	-1.618599	1.330697	-0.974048	-0.602724	-0.637176	0.065297	-0.651595	19.4
0.388230	-0.500320	1.033237	-0.278089	0.489252	-0.300450	0.591681	-0.839240	1.706891	1.578434	0.845343	-3.868193	1.525387	13.4
-0.325282	-0.500320	-0.413160	-0.278089	-0.157233	-0.831094	0.033747	-0.005494	-0.624360	-0.584648	1.204741	0.379119	-0.165787	18.2

4.模型1:所有特征

對訓練集training_data進行線性擬合，查看p值判斷顯著性

#用所有的縮放特征來擬合線性模型，以預測此響應（平均房價）。不要忘記添加一個截距。 training_data['intercept'] = 1 X_train1= training_data.drop('MedianHomePrice' , axis=1, inplace=False) lm = sm.OLS(training_data['MedianHomePrice'], X_train1) result = lm.fit() result.summary() OLS Regression ResultsDep. Variable: R-squared: Model: Adj. R-squared: Method: F-statistic: Date: Prob (F-statistic):Time: Log-Likelihood: No. Observations: AIC: Df Residuals: BIC: Df Model:Covariance Type:

MedianHomePrice	0.751
OLS	0.743
Least Squares	90.43
Sun, 10 May 2020	6.21e-109
20:22:27	-1194.3
404	2417.
390	2473.
13
nonrobust

coefstd errtP>|t|[0.0250.975]CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATintercept


-1.0021	0.308	-3.250	0.001	-1.608	-0.396
0.6963	0.370	1.882	0.061	-0.031	1.423
0.2781	0.464	0.599	0.549	-0.634	1.190
0.7187	0.247	2.914	0.004	0.234	1.204
-2.0223	0.498	-4.061	0.000	-3.001	-1.043
3.1452	0.329	9.567	0.000	2.499	3.792
-0.1760	0.407	-0.432	0.666	-0.977	0.625
-3.0819	0.481	-6.408	0.000	-4.027	-2.136
2.2514	0.652	3.454	0.001	0.970	3.533
-1.7670	0.704	-2.508	0.013	-3.152	-0.382
-2.0378	0.321	-6.357	0.000	-2.668	-1.408
1.1296	0.271	4.166	0.000	0.596	1.663
-3.6117	0.395	-9.133	0.000	-4.389	-2.834
22.7965	0.236	96.774	0.000	22.333	23.260

Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No.

133.052	2.114
0.000	579.817
1.379	1.24e-126
8.181	9.74

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

5.判斷解釋變量間是否存在相關性：

計算訓練集中的vif

#計算數據集中每個 x_variable 的 vif def vif_calculator(df, response):'''INPUT:df - 包含x和y的數據集response - 反應變量的列名stringOUTPUT:vif - a dataframe of the vifs'''df2 = df.drop(response, axis = 1, inplace=False)#刪除反應變量列features = "+".join(df2.columns)y, X = dmatrices(response + ' ~' + features, df, return_type='dataframe')vif = pd.DataFrame()vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]vif["features"] = X.columnsvif = vif.round(1)return vif vif = vif_calculator(training_data, 'MedianHomePrice') vif C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1685: RuntimeWarning: divide by zero encountered in double_scalarsreturn 1 - self.ssr/self.centered_tss VIF Factorfeatures01234567891011121314

0.0	Intercept
1.7	CRIM
2.5	ZN
3.9	INDUS
1.1	CHAS
4.5	NOX
1.9	RM
3.0	AGE
4.2	DIS
7.7	RAD
8.9	TAX
1.9	PTRATIO
1.3	B
2.8	LSTAT
0.0	intercept

結合vif、相關性和p值，判斷要刪除哪些變量：

vif限制在4以內。INDUS、RAD、TAX、NOX的VIF較大

TAX 和 RAD 之間具有強相關性，INDUS 和 NOX 也是如此，因此，每組相關性高的變量只要刪除一個就能有效地減小另一個的 VIF。

p值限制在0.05以內。AGE和INDUS的p值較大。

根據查看 p 值和VIF的結果，如果選擇保留RAD和INDUS，那么刪除 AGE、 NOX 與TAX，刪掉這些特征之后，用剩余的特征擬合一個新的線性模型。

6.模型2：刪除 AGE、 NOX 與TAX

X_train1 = training_data.drop(['AGE','NOX','TAX','MedianHomePrice'] , axis=1, inplace=False) lm1 = sm.OLS(training_data['MedianHomePrice'], X_train1) result1 = lm1.fit() result1.summary() OLS Regression ResultsDep. Variable: R-squared: Model: Adj. R-squared: Method: F-statistic: Date: Prob (F-statistic):Time: Log-Likelihood: No. Observations: AIC: Df Residuals: BIC: Df Model:Covariance Type:

MedianHomePrice	0.733
OLS	0.727
Least Squares	108.1
Sun, 10 May 2020	2.77e-106
21:02:41	-1208.0
404	2438.
393	2482.
10
nonrobust

coefstd errtP>|t|[0.0250.975]CRIMZNINDUSCHASRMDISRADPTRATIOBLSTATintercept


-0.9116	0.317	-2.876	0.004	-1.535	-0.289
0.5622	0.363	1.548	0.123	-0.152	1.276
-0.8746	0.411	-2.128	0.034	-1.683	-0.067
0.6896	0.252	2.738	0.006	0.194	1.185
3.2406	0.330	9.818	0.000	2.592	3.889
-2.1728	0.434	-5.010	0.000	-3.025	-1.320
0.4380	0.389	1.126	0.261	-0.327	1.202
-1.6369	0.310	-5.288	0.000	-2.246	-1.028
1.2106	0.279	4.345	0.000	0.663	1.758
-3.9851	0.381	-10.470	0.000	-4.733	-3.237
22.7965	0.243	93.916	0.000	22.319	23.274

Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No.

126.568	2.033
0.000	542.197
1.310	1.83e-118
8.034	4.66

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

根據p值，應刪除 RAD ，保留其他變量。

7.模型3：刪除 AGE、 NOX 、TAX、RAD

X_train2 = training_data.drop(['AGE','NOX','TAX','RAD', 'MedianHomePrice'] , axis=1, inplace=False) lm2 = sm.OLS(training_data['MedianHomePrice'], X_train2) result2 = lm2.fit() result2.summary() OLS Regression ResultsDep. Variable: R-squared: Model: Adj. R-squared: Method: F-statistic: Date: Prob (F-statistic):Time: Log-Likelihood: No. Observations: AIC: Df Residuals: BIC: Df Model:Covariance Type:

MedianHomePrice	0.733
OLS	0.726
Least Squares	119.9
Sun, 10 May 2020	4.60e-107
21:02:09	-1208.6
404	2437.
394	2477.
9
nonrobust

coefstd errtP>|t|[0.0250.975]CRIMZNINDUSCHASRMDISPTRATIOBLSTATintercept


-0.7616	0.288	-2.647	0.008	-1.327	-0.196
0.6151	0.360	1.707	0.089	-0.093	1.323
-0.7544	0.397	-1.900	0.058	-1.535	0.026
0.7067	0.252	2.810	0.005	0.212	1.201
3.3022	0.326	10.142	0.000	2.662	3.942
-2.2235	0.432	-5.153	0.000	-3.072	-1.375
-1.5090	0.288	-5.239	0.000	-2.075	-0.943
1.1502	0.273	4.206	0.000	0.613	1.688
-3.9413	0.379	-10.406	0.000	-4.686	-3.197
22.7965	0.243	93.884	0.000	22.319	23.274

Omnibus: Durbin-Watson: Prob(Omnibus): Jarque-Bera (JB): Skew: Prob(JB): Kurtosis: Cond. No.

134.948	2.028
0.000	619.161
1.381	3.56e-135
8.399	4.36

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

仔細檢查所有的 VIF 是否小于4。與先前模型相比，Rsquared 值沒有發生變化。

training_data2 = training_data.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) vif = vif_calculator(training_data2, 'MedianHomePrice') vif VIF Factorfeatures012345678910

0.0	Intercept
1.4	CRIM
2.2	ZN
2.7	INDUS
1.1	CHAS
1.8	RM
3.2	DIS
1.4	PTRATIO
1.3	B
2.4	LSTAT
0.0	intercept

8.模型評估

對各個模型的測試預測值和實際測試值的匹配度進行打分

#含有全部變量的模型 lm_full = LinearRegression() lm_full.fit(X_train, y_train) lm_full.score(X_test, y_test)#打分 0.66848257539715972 #刪除AGE、NOX、TAX X_train_red = X_train.drop(['AGE','NOX','TAX'] , axis=1, inplace=False) X_test_red = X_test.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)#刪除 AGE、 NOX 、TAX、RAD X_train_red2 = X_train.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) X_test_red2 = X_test.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False) lm_red = LinearRegression()#刪除AGE、NOX、TAX的模型 lm_red.fit(X_train_red, y_train) print(lm_red.score(X_test_red, y_test))#打分lm_red2 = LinearRegression()#刪除 AGE、 NOX 、TAX、RAD的模型 lm_red2.fit(X_train_red2, y_train) print(lm_red2.score(X_test_red2, y_test))#打分 0.639421781821 0.63441065636

從評分可以看出，在此測試集中，擁有所有變量的模型表現最佳。后續可以用交叉驗證（即在多個訓練和測試集里重復這一操作）來確定模型效果是否有穩定性。

總結

以上是生活随笔為你收集整理的多元线性回归练习-预测房价的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：网页自动跳转 5种方法
下一篇： android qq勋章动画,qq最新的

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

多元线性回归练习-预测房价

目的：

數據集說明：

1.獲取數據集中每個特征的匯總

2.拆分數據集

3.標準化

4.模型1:所有特征

5.判斷解釋變量間是否存在相關性：

6.模型2：刪除 AGE、 NOX 與TAX

7.模型3：刪除 AGE、 NOX 、TAX、RAD

8.模型評估

總結

6.模型2：刪除 AGE、 NOX 與TAX

7.模型3：刪除 AGE、 NOX 、TAX、RAD