當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习基于加州房价的线性回归实验

發(fā)布時(shí)間：2023/12/8 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习基于加州房价的线性回归实验小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1.線性回歸閉合形式參數(shù)求解的原理

如果定義X為m*(n+1)的矩陣，Y為m1的矩陣，θ為(n+1)1維的矩陣，那么在之前的定義中就可以表示為h(x)=Xθ。則代價(jià)函數(shù)可以表示為J(θ)=1/2(Xθ-y)Т(Xθ-y),J(θ)為凹函數(shù)，我們要讓其值最小化，只需對(duì)該函數(shù)求導(dǎo)，然后令導(dǎo)數(shù)為0即可求得θ。對(duì)其求導(dǎo)后得到XTXθ-XTy，令其等于0，得到θ=(XTX)^-1XT*y。

2.線性回歸梯度下降參數(shù)求解的原理

我們構(gòu)造了擬合函數(shù)h(θ),并且得到了損失函數(shù)J（θ），我們要求得使J（θ）取得最小值的θ，其原理還是求偏導(dǎo)然后使導(dǎo)數(shù)為0，我們對(duì)J（θ）求導(dǎo)得到(hθ(x) ? y) xj，然后可以得到對(duì)θj的更新公式

由于數(shù)據(jù)量較大，所以采用了隨機(jī)梯度下降，但是準(zhǔn)確度相較于批量梯度下降來說會(huì)下降。在我的程序里，由于數(shù)據(jù)采用矩陣形式存儲(chǔ)，所以更新過程可以替換為

其中θ為(n+1)1維，X為m(n+1)維，Y為m*1維。梯度為0用損失函數(shù)差值小于1e-18來表示，說明這個(gè)點(diǎn)是損失函數(shù)的極小值點(diǎn)，但并不一定是最小值點(diǎn)。

3.相關(guān)文件

4.程序清單

相關(guān)包：

from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split import pandas as pd import numpy as np import os import time from numpy import median from sklearn.preprocessing import OneHotEncoder

（一）讀取數(shù)據(jù)：

# 讀取數(shù)據(jù) HOUSE_PATH = './'def load_housing_data(housing_path=HOUSE_PATH):csv_path = os.path.join(housing_path, 'housing.csv')return pd.read_csv(csv_path) housing = load_housing_data()

（二）數(shù)據(jù)處理：

# 將中位數(shù)補(bǔ)全空位 median = housing["total_bedrooms"].median() housing["total_bedrooms"].fillna(median, inplace=True)# 獨(dú)熱編碼 housing_category = housing[["ocean_proximity"]] cat_encoder = OneHotEncoder() housing_category_onehot = cat_encoder.fit_transform(housing_category) housing = housing.drop("ocean_proximity", axis=1) housing_values = np.c_[housing.values, housing_category_onehot.toarray()] housing_fixed = pd.DataFrame(housing_values,columns=list(housing.columns) +['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],index=housing.index )

（三）分析數(shù)據(jù)相關(guān)性

# 分析數(shù)據(jù)相關(guān)性 corr_matrix = housing_fixed.corr() # 用corr計(jì)算兩兩特征之間的相關(guān)性系數(shù) correlation = corr_matrix["median_house_value"].sort_values(ascending=False) # 跟街區(qū)價(jià)格中位數(shù)特征的其他特征的相關(guān)系數(shù) print(correlation)

（四）將數(shù)據(jù)集分類

# 將數(shù)據(jù)集分類 train_set, test_set = train_test_split(housing_fixed, test_size=0.3, random_state=42) X_train = train_set[['longitude', 'latitude', 'housing_median_age','total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', '<1H OCEAN', 'INLAND', 'ISLAND','NEAR BAY', 'NEAR OCEAN']] y_train = train_set[["median_house_value"]] X_test = test_set[['longitude', 'latitude', 'housing_median_age','total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', '<1H OCEAN', 'INLAND', 'ISLAND','NEAR BAY', 'NEAR OCEAN']] y_test = test_set[["median_house_value"]] X = np.hstack([np.ones((len(X_train_std), 1)), X_train_std]) # 訓(xùn)練集X Y = np.array(y_train_std) # 訓(xùn)練集Y x = np.hstack([np.ones((len(X_test_std), 1)), X_test_std]) # 測試集x y = np.array(y_test_std) # 測試集y y_var = np.var(y) # 標(biāo)準(zhǔn)差

（五）特征標(biāo)準(zhǔn)化

# 特征標(biāo)準(zhǔn)化 stdsc = StandardScaler() X_train_std = stdsc.fit_transform(X_train) X_test_std = stdsc.transform(X_test) y_train_std = stdsc.fit_transform(y_train) y_test_std = stdsc.fit_transform(y_test)

（六）初始化θ

theta = np.zeros((14, 1)) # 初始化theta

（七）正規(guī)方程

# 正規(guī)方程 def nomal(X, Y):theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)return theta # 損失函數(shù) def cost_function(x, theta, y):cost = np.sum((np.dot(x, theta)-y)**2)return cost/(2*len(y))# 正規(guī)方程 def nomal(X, Y):theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)return theta# 梯度下降方向 def gradient(X, theta, Y):return X.T.dot((X.dot(theta)-Y))/len(Y)# 梯度下降 def gradient_descent(X, theta, Y, eta):while True:last_theta = thetagrad = gradient(X, theta, Y)theta = theta - eta*gradif abs(cost_function(X, last_theta, Y) - cost_function(X, theta, Y)) < 1e-18:breakreturn theta# theta = nomal(X, Y) # 閉合形式求解 theta = gradient_descent(X, theta, Y, 0.001) # 梯度下降# 評(píng)估項(xiàng)(R2) def evaluation(x, theta, y, y_var):return 1 - ((np.sum((np.dot(x, theta)-y)**2))/(y_var*len(y)))MSE = np.sum(np.power((np.dot(x, theta)-y), 2))/len(y) cost = cost_function(x, theta, y) # 損失函數(shù)值 R2 = evaluation(x, theta, y, y_var) # 評(píng)估值end = time.time()# print("The normal equations:") print("Gradient descent:") print("theta=") print(theta) print("MSE=", MSE) print("cost=", cost) print("R2=", R2) print('Running time: %s Seconds' % (end-start))

實(shí)驗(yàn)結(jié)果：
（一）正規(guī)方程求解

（二）梯度下降求解

可以看到兩個(gè)方法得出的結(jié)果差別不大，用測試集進(jìn)行測試時(shí)候，損失函數(shù)值均為0.18，評(píng)估項(xiàng)R2均為0.6多，梯度下降的擬合效果會(huì)比正規(guī)方程的好一點(diǎn)。在運(yùn)算過程中，能很明顯看到正規(guī)方程的計(jì)算速度要比梯度下降快很多，原因在于梯度下降在更新θ時(shí)候需要迭代很多次才能得到較優(yōu)解。但是梯度下降在特征數(shù)量n較大時(shí)也能很好使用，而正規(guī)方程需要計(jì)算(X*X)-1，如果特征數(shù)量太多則運(yùn)算代價(jià)較大因?yàn)榫仃嚨倪\(yùn)算時(shí)間復(fù)雜度為O(n3)，而且只適用于線性模型，不適用于邏輯回歸模型等其他模型。在這個(gè)模型里面，由于特征數(shù)量不是很多，因此用正規(guī)式求解比較合理。

總結(jié)

以上是生活随笔為你收集整理的机器学习基于加州房价的线性回归实验的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：疫情对广州房价的影响
下一篇： FPGA中ICAP原语的使用——Mult