回归:预测燃油效率
回歸:預測燃油效率
在一個回歸問題中,我們的目標是預測一個連續值的輸出,比如價格或概率。這與一個分類問題形成對比,我們的目標是從一系列類中選擇一個類(例如,一張圖片包含一個蘋果或一個橘子,識別圖片中的水果)。
本筆記本使用經典的[auto-mpg](https://archive.ics.uci.edu/ml/datasets/auto+mpg)數據集,建立了預測70年代末和80年代初汽車燃油效率的模型。為了做到這一點,我們將為該模型提供從那個時期開始的許多汽車的描述。此描述包括以下屬性:氣缸、排量、馬力和重量。
此示例使用“tf.keras”API,有關詳細信息,請參閱[本指南](https://www.tensorflow.org/guide/keras)。
import pathlib import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import keras from keras import layers %matplotlib inlineThe Auto MPG dataset
The dataset is available from the UCI Machine Learning Repository.
Get the data
First download the dataset.
dataset_path = keras.utils.get_file("auto-mpg.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data") dataset_path 'C:\\Users\\YIUYE\\.keras\\datasets\\auto-mpg.data'Import it using pandas
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight','Acceleration', 'Model Year', 'Origin'] raw_dataset = pd.read_csv(dataset_path, names=column_names,na_values = "?", comment='\t',sep=" ", skipinitialspace=True)dataset = raw_dataset.copy() dataset.tail()| 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | 1 |
| 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | 2 |
| 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | 1 |
| 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | 1 |
| 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | 1 |
Clean the data
The dataset contains a few unknown values.
dataset.isnull().sum() MPG 0 Cylinders 0 Displacement 0 Horsepower 6 Weight 0 Acceleration 0 Model Year 0 Origin 0 dtype: int64To keep this initial tutorial simple drop those rows.
dataset = dataset.dropna()The "Origin" column is really categorical, not numeric. So convert that to a one-hot:
origin = dataset.pop('Origin') dataset['USA'] = (origin == 1)*1.0 dataset['Europe'] = (origin == 2)*1.0 dataset['Japan'] = (origin == 3)*1.0 dataset.tail()| 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | 1.0 | 0.0 | 0.0 |
| 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | 0.0 | 1.0 | 0.0 |
| 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | 1.0 | 0.0 | 0.0 |
| 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | 1.0 | 0.0 | 0.0 |
| 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | 1.0 | 0.0 | 0.0 |
現在將數據集拆分為一個訓練集和一個測試集。
我們將在模型的最終評估中使用測試集。
train_dataset = dataset.sample(frac=0.8,random_state=0) test_dataset = dataset.drop(train_dataset.index) sns.pairplot(train_dataset[[ "Cylinders", "Displacement", "Weight"]], diag_kind="kde") sns.set()Also look at the overall statistics:
train_stats = train_dataset.describe() train_stats.pop("MPG") train_stats = train_stats.transpose() train_stats| 314.0 | 5.477707 | 1.699788 | 3.0 | 4.00 | 4.0 | 8.00 | 8.0 |
| 314.0 | 195.318471 | 104.331589 | 68.0 | 105.50 | 151.0 | 265.75 | 455.0 |
| 314.0 | 104.869427 | 38.096214 | 46.0 | 76.25 | 94.5 | 128.00 | 225.0 |
| 314.0 | 2990.251592 | 843.898596 | 1649.0 | 2256.50 | 2822.5 | 3608.00 | 5140.0 |
| 314.0 | 15.559236 | 2.789230 | 8.0 | 13.80 | 15.5 | 17.20 | 24.8 |
| 314.0 | 75.898089 | 3.675642 | 70.0 | 73.00 | 76.0 | 79.00 | 82.0 |
| 314.0 | 0.624204 | 0.485101 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| 314.0 | 0.178344 | 0.383413 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| 314.0 | 0.197452 | 0.398712 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
Split features from labels
Separate the target value, or “label”, from the features. This label is the value that you will train the model to predict.
train_labels = train_dataset.pop('MPG') test_labels = test_dataset.pop('MPG')Normalize the data
Look again at the train_stats block above and note how different the ranges of each feature are.
規范化使用不同尺度和范圍的特征是一個很好的實踐。雖然模型可能在沒有特征規范化的情況下收斂,但它使訓練變得更加困難,并且使生成的模型依賴于輸入中使用的單元的選擇。
注意:盡管我們有意只從訓練數據集生成這些統計信息,但這些統計信息也將用于規范化測試數據集。我們需要這樣做,以將測試數據集投影到模型所訓練的相同分發中。
def norm(x):return (x - train_stats['mean']) / train_stats['std'] normed_train_data = norm(train_dataset) normed_test_data = norm(test_dataset) def build_model():model = keras.Sequential([layers.Dense(64, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),layers.Dense(64, activation=tf.nn.relu),layers.Dense(1)])optimizer = keras.optimizers.RMSprop(0.001)model.compile(loss='mean_squared_error',optimizer=optimizer,metrics=['mean_absolute_error', 'mean_squared_error'])return model model = build_model() model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_10 (Dense) (None, 64) 640 _________________________________________________________________ dense_11 (Dense) (None, 64) 4160 _________________________________________________________________ dense_12 (Dense) (None, 1) 65 ================================================================= Total params: 4,865 Trainable params: 4,865 Non-trainable params: 0 _________________________________________________________________Now try out the model. Take a batch of 10 examples from the training data and call model.predict on it.
example_batch = normed_train_data[:10] example_result = model.predict(example_batch) example_result array([[-0.03468257],[-0.01342154],[-0.15384783],[-0.18010283],[ 0.03922582],[-0.12172151],[ 0.10603201],[ 0.2442987 ],[ 0.00099315],[ 0.18530795]], dtype=float32)It seems to be working, and it produces a result of the expected shape and type.
Train the model
Train the model for 1000 epochs, and record the training and validation accuracy in the history object.
# Display training progress by printing a single dot for each completed epoch class PrintDot(keras.callbacks.Callback):def on_epoch_end(self, epoch, logs):if epoch % 100 == 0: print('')print('.', end='')EPOCHS = 1000history = model.fit(normed_train_data, train_labels,epochs=EPOCHS, validation_split = 0.2, verbose=0,callbacks=[PrintDot()]) .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... hist = pd.DataFrame(history.history) hist['epoch'] = history.epoch hist.tail()| 2.075518 | 0.940943 | 2.075518 | 8.913726 | 2.351839 | 8.913726 | 995 |
| 2.130111 | 0.953561 | 2.130111 | 9.769884 | 2.438282 | 9.769884 | 996 |
| 2.221040 | 0.951258 | 2.221040 | 9.664708 | 2.382888 | 9.664708 | 997 |
| 2.301870 | 0.980407 | 2.301870 | 9.934311 | 2.425505 | 9.934311 | 998 |
| 2.002580 | 0.887644 | 2.002580 | 9.484982 | 2.414742 | 9.484982 | 999 |
此圖顯示在大約100個周期后,驗證錯誤幾乎沒有改善,甚至惡化。讓我們更新“model.fit”調用,以便在驗證分數沒有提高時自動停止培訓。我們將使用一個早期的回調來測試每個時代的訓練條件。如果一個設定的時間段沒有顯示出改善,那么自動停止訓練。
model = build_model()# The patience parameter is the amount of epochs to check for improvement early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)history = model.fit(normed_train_data, train_labels, epochs=EPOCHS,validation_split = 0.2, verbose=0, callbacks=[early_stop, PrintDot()])plot_history(history) ................................................. loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=0)print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae)) Testing set Mean Abs Error: 1.79 MPGMake predictions
Finally, predict MPG values using data in the testing set:
test_predictions = model.predict(normed_test_data).flatten()plt.scatter(test_labels, test_predictions) plt.xlabel('True Values [MPG]') plt.ylabel('Predictions [MPG]') plt.axis('equal') plt.axis('square') plt.xlim([0,plt.xlim()[1]]) plt.ylim([0,plt.ylim()[1]]) _ = plt.plot([-100, 100], [-100, 100]) error = test_predictions - test_labels plt.hist(error, bins = 25) plt.xlabel("Prediction Error [MPG]") _ = plt.ylabel("Count")它不是很高斯的,但是我們可以預期,因為樣本的數量非常小。
總結
- 上一篇: 有谁可以说下阿克蒂思卫浴AQ390智能马
- 下一篇: 特斯拉发布第三季度财报 总营收87.