當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

房价预测：回归问题

發布時間：2025/4/16 编程问答 68 豆豆

生活随笔收集整理的這篇文章主要介紹了房价预测：回归问题小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

還有一種常見的機器學習問題是回歸問題，它預測的是連續值而不是離散標簽，例如，根據氣象數據預測明天氣溫，或者根據軟件說明書預測項目完成所需要的時間。

數據介紹

這里我們介紹一下數據。要預測的是是20世紀70年代波士頓房屋價格的中位數。這里給出的數據包括犯罪率、當期房產稅率等。本次，我們有的數據點相對較少，只有506個，分為404個訓練樣本和102個測試樣本。輸入數據的每個特征都有不同的取值范圍。有些特征是比例，取值范圍為0-1，有的特征取值范圍為1-12；還有的特征取值范圍為0-100等。

from keras.datasets import boston_housing(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()train_data.shapetest_data.shape

結果分別是：
(404, 13)
(102, 13)

這里每個樣本有13個特征，比如犯罪率、每個住宅平均房屋間數、告訴公路的可達性等。

我們的目標（或者說希望的測試結果）是房屋價格的中位數，單位是千美元

train_targets

array([ 15.2, 42.3, 50. , 21.1, 17.7, 18.5, 11.3, 15.6, 15.6,
14.4, 12.1, 17.9, 23.1, 19.9, 15.7, 8.8, 50. , 22.5,
24.1, 27.5, 10.9, 30.8, 32.9, 24. , 18.5, 13.3, 22.9,…

對數據格式進行處理

將取值范圍差異很大的數據直接輸入到神經網絡中，雖然網絡會自動適應這種取值范圍不同的數據，但是不進行數據處理直接學效果很不好。因為數據差異比較大的數據在網絡中會整體學習效果有較大影響，所以我們需要先做標準化處理（0-1標準差）。

mean = train_data.mean(axis=0) # axis = 0表示變成一行，實際上是求每列均值 train_data -= mean std = train_data.std(axis=0) train_data /= stdtest_data -= mean test_data /= std

構建網絡

由于樣本很小，所以我們用一個非常小的網絡，其中包含兩個隱藏層，每層有64個單元，一般來說，訓練數據越少，過擬合就會越嚴重，而較小的網絡可以降低過擬合。

from keras import models from keras import layersdef build_model():# Because we will need to instantiate# the same model multiple times,# we use a function to construct it.model = models.Sequential()model.add(layers.Dense(64, activation='relu',input_shape=(train_data.shape[1],)))model.add(layers.Dense(64, activation='relu'))model.add(layers.Dense(1))model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])return model

網絡的最后一層只有一個單元，沒有激活，是一個線性層。這是標量回歸（標量回歸是預測單一連續值的回歸）的典型設置。添加激活函數將會限制輸出范圍。例如，如果向最后一層添加sigmoid激活函數，網絡只學會預測0-1范圍內的值。這里最后一層是純線性的，所以網絡可以學會預測任何范圍內的值。

這里，我們使用了mse損失函數（均方誤差），這是回歸問題常用的損失函數。

K折交叉驗證

由于我們數據點很小，驗證集會非常小（比如大約100個樣本）。因此，驗證分數可能會有很大波動，不同劃分的結果可能會對數據產生較大的影響，所以我們使用K折交叉驗證。

import numpy as npk = 4 num_val_samples = len(train_data) // k num_epochs = 100 all_scores = [] for i in range(k):print('processing fold #', i)# Prepare the validation data: data from partition # kval_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]# Prepare the training data: data from all other partitionspartial_train_data = np.concatenate([train_data[:i * num_val_samples],train_data[(i + 1) * num_val_samples:]],axis=0)partial_train_targets = np.concatenate( # concatenate合并兩個array數組，按行合并，axis =0 ，豎著合并[train_targets[:i * num_val_samples],train_targets[(i + 1) * num_val_samples:]],axis=0)# Build the Keras model (already compiled)model = build_model()# Train the model (in silent mode, verbose=0)model.fit(partial_train_data, partial_train_targets,epochs=num_epochs, batch_size=1, verbose=0)# Evaluate the model on the validation dataval_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)all_scores.append(val_mae)

我們這里設置迭代次數為100，運行結果如下：

all_scores

[2.0750808349930412, 2.117215852926273, 2.9140411863232605, 2.4288365227161068]

np.mean(all_scores)

2.3837935992396706
可以看到，經過4折交叉驗證之后，預測結果和真實房間基本相差2400美元。

我們下面做500輪次，并修改最后一部分代碼：

from keras import backend as K# Some memory clean-up K.clear_session()num_epochs = 500 all_mae_histories = [] for i in range(k):print('processing fold #', i)# Prepare the validation data: data from partition # kval_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]# Prepare the training data: data from all other partitionspartial_train_data = np.concatenate([train_data[:i * num_val_samples],train_data[(i + 1) * num_val_samples:]],axis=0)partial_train_targets = np.concatenate([train_targets[:i * num_val_samples],train_targets[(i + 1) * num_val_samples:]],axis=0)# Build the Keras model (already compiled)model = build_model()# Train the model (in silent mode, verbose=0)history = model.fit(partial_train_data, partial_train_targets,validation_data=(val_data, val_targets),epochs=num_epochs, batch_size=1, verbose=0)mae_history = history.history['val_mean_absolute_error']all_mae_histories.append(mae_history)

下面我們可以計算每個輪次中所有折MAE的平均值。

average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

畫圖

import matplotlib.pyplot as pltplt.plot(range(1, len(average_mae_history) + 1), average_mae_history) plt.xlabel('Epochs') plt.ylabel('Validation MAE') plt.show()

這里，我們看到縱軸范圍比較大，而且數據方差比較大，這張圖所表達的規律不太明顯。所以我們：

刪除前10個點
將每個數據點替換為前面數據點的移動平均值，來得到光滑曲線

def smooth_curve(points, factor=0.9):smoothed_points = []for point in points:if smoothed_points:previous = smoothed_points[-1]smoothed_points.append(previous * factor + point * (1 - factor))else:def smooth_curve(points, factor=0.9):smoothed_points = []for point in points:if smoothed_points:previous = smoothed_points[-1]smoothed_points.append(previous * factor + point * (1 - factor))else:smoothed_points.append(point)return smoothed_pointssmooth_mae_history = smooth_curve(average_mae_history[10:])plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history) plt.xlabel('Epochs') plt.ylabel('Validation MAE') plt.show()

從圖中可以看出，驗證MAE在80輪后不再顯著下降，之后開始出現過擬合。

訓練最終模型

我們得到最佳迭代次數這個超參數，大概是80，下面在全部訓練集上訓練結果

# Get a fresh, compiled model. model = build_model() # Train it on the entirety of the data. model.fit(train_data, train_targets,epochs=80, batch_size=16, verbose=0) test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

結果是：

test_mae_score

2.5532484335057877

預測房價和實際值大概相差2550元。

更多精彩內容，歡迎關注我的微信公眾號：數據瞎分析

總結

以上是生活随笔為你收集整理的房价预测：回归问题的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

房价

上一篇：用Keras进行手写字体识别（MNIST
下一篇：使用预训练的卷积神经网络（猫狗图片分类）