當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

梯度提升树python_梯度增强树回归— Spark和Python

發(fā)布時(shí)間：2023/12/15 python 19 豆豆

生活随笔收集整理的這篇文章主要介紹了梯度提升树python_梯度增强树回归— Spark和Python 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

梯度提升樹python

This story demonstrates the implementation of a “gradient boosted tree regression” model using python & spark machine learning. The dataset used is “bike rental info” from 2011–2012 in the capital bike share system. Our goal is to predict the count of bike rentals.

這個(gè)故事展示了如何使用python和spark機(jī)器學(xué)習(xí)實(shí)現(xiàn)“梯度提升樹回歸”模型。在首都自行車共享系統(tǒng)中，使用的數(shù)據(jù)集是2011-2012年的“ 自行車租賃信息 ”。我們的目標(biāo)是預(yù)測自行車租賃的數(shù)量 。

1.加載數(shù)據(jù) (1. Load the data)

The data in store is a CSV file. We are to create a spark data frame containing the bike data set. We cache this data so that we read it only once from the disk.

存儲(chǔ)中的數(shù)據(jù)是CSV文件。我們將創(chuàng)建一個(gè)包含自行車數(shù)據(jù)集的spark數(shù)據(jù)框。我們緩存此數(shù)據(jù)，以便僅從磁盤讀取一次。

#load the dataset & cache
df = spark.read.csv("/databricks-datasets/bikeSharing/data-001/hour.csv", header="true", inferSchema="true")df.cache()df.cache()#view the imported dataset
display(df)

輸出： (Output:)

2.預(yù)處理數(shù)據(jù) (2. Pre-Process the data)

Fields such as “weekday” are indexed, and all the other fields except date “dteday” are numerical. The count is our target "label". The “cnt” column we aim to predict equals the sum of the “casual” & “registered” columns.

索引諸如“工作日”的字段，除日期“ dteday”以外的所有其他字段均為數(shù)字。計(jì)數(shù)是我們的目標(biāo)“標(biāo)簽”。我們旨在預(yù)測的“ cnt”列等于“休閑”和“注冊”列的總和。

The next steps involve removing the “casual” and “registered” columns from the dataset to make sure we do not use them in predicting “cnt”. So, we discard the “dteday” and use the columns “season”, “yr”, “mnth” and “weekday”.

下一步涉及從數(shù)據(jù)集中刪除“休閑”和“注冊”列，以確保我們在預(yù)測“ cnt”時(shí)不使用它們。因此，我們丟棄“ dteday”并使用“ season”，“ yr”，“ mnth”和“ weekday”列。

#drop the features mentioned
df = df.drop("instant").drop("dteday").drop("casual").drop("registered")#print the schema of our dataset to see the type of each column
df.printSchema()Image by Author圖片作者

3.轉(zhuǎn)換數(shù)據(jù)類型 (3. Cast Data types)

The DataFrame uses string categories, and we know that the columns are numerical in nature. So we cast them in order to proceed.

DataFrame使用字符串類別，并且我們知道這些列本質(zhì)上是數(shù)字。因此，我們將其投放以進(jìn)行下一步。

# casts all columns to a numeric typefrom pyspark.sql.functions import col # for indicating a column using a string in the line belowdf = df.select([col(c).cast("double").alias(c) for c in df.columns])df.printSchema()Image by Author圖片作者

4.訓(xùn)練和測試集 (4. Train & Test Sets)

The data prep step splits the dataset into train and test sets. We train/tune the model on the training set.

數(shù)據(jù)準(zhǔn)備步驟將數(shù)據(jù)集分為訓(xùn)練集和測試集。我們在訓(xùn)練集上訓(xùn)練/調(diào)整模型。

# Split 70% for training and 30% for testingtrain, test = df.randomSplit([0.7, 0.3])print("We have %d training examples and %d test examples." % (train.count(), test.count())

There are 12160 training samples & 5219 test samples.

有12160個(gè)訓(xùn)練樣本和5219個(gè)測試樣本。

5.機(jī)器學(xué)習(xí)管道 (5. Machine Learning Pipeline)

Since the data is prepared, let’s learn the ML model to predict rentals for the future.

由于數(shù)據(jù)已經(jīng)準(zhǔn)備好，讓我們學(xué)習(xí)機(jī)器學(xué)習(xí)模型來預(yù)測未來的租金。

For every row in the data, feature vectors should describe what we know: such as the weather, week(day), etc., & the label is generally what we aim to predict, in this case — (“cnt”).

對(duì)于數(shù)據(jù)中的每一行，特征向量都應(yīng)描述我們所知道的：例如天氣，星期(天)等，在這種情況下，標(biāo)簽通常是我們要預(yù)測的目標(biāo)(“ cnt”)。

We then put a Pipeline with the stages mentioned:

然后，我們將管道與提到的階段放在一起：

VectorAssembler: This assembles feature columns into a feature vector.
VectorAssembler：將特征列組裝成特征向量。
VectorIndexer: This identifies columns that are meant to be categorical heuristically, and identifies any column with a small number of distinct values as being categorical.
VectorIndexer：這將標(biāo)識(shí)按啟發(fā)式分類的列，并將具有少量不同值的任何列標(biāo)識(shí)為分類。
GBTRegressor: This uses the (GBT) algorithm to learn & predict rental aggregates from feature vectors.
GBTRegressor：使用(GBT)算法從特征向量中學(xué)習(xí)和預(yù)測租金總額。
CrossValidator: The GBT algorithm & it’s parameters, are tuned to improve accuracy of our models.
CrossValidator：對(duì)GBT算法及其參數(shù)進(jìn)行了調(diào)整，以提高模型的準(zhǔn)確性。

from pyspark.ml.feature import VectorAssembler, VectorIndexerfeaturesCols = df.columnsfeaturesCols.remove('cnt')# Concatenates all feature columns into a single feature vector in a new column "rawFeatures"vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol="rawFeatures")# Identifies categorical features and indexes themvectorIndexer = VectorIndexer(inputCol="rawFeatures", outputCol="features", maxCategories=4)

Next, we define training stage of the Pipeline. GBTRegressor takes in vectors of the features and the labels as input in order to learn to predict the target labels of newer samples.

接下來，我們定義管道的培訓(xùn)階段。 GBTRegressor接受要素的矢量和標(biāo)簽作為輸入，以便學(xué)習(xí)預(yù)測較新樣本的目標(biāo)標(biāo)簽。

from pyspark.ml.regression import GBTRegressor# Takes the "features" column and learns to predict "cnt"
gbt = GBTRegressor(labelCol="cnt")

We then use cross validation to tune the parameters & achieve the best results. It trains multiple models, chooses the best, minimizing a metric. Our metric is Root Mean Squared Error (RMSE).

然后，我們使用交叉驗(yàn)證來調(diào)整參數(shù)并獲得最佳結(jié)果。它訓(xùn)練多個(gè)模型，選擇最佳模型，從而最小化指標(biāo)。我們的指標(biāo)是均方根誤差(RMSE)。

from pyspark.ml.tuning import CrossValidator, ParamGridBuilderfrom pyspark.ml.evaluation import RegressionEvaluator# Define a grid of hyperparameters to test:
# - maxDepth: max depth of each decision tree in the GBT ensemble
# - maxIter: iterations, i.e., number of trees in each GBT ensemble
# In this example notebook, we keep these values small. In practice, to get the highest accuracy, you would likely want to try deeper trees (10 or higher) and more trees in the ensemble (>100)paramGrid = ParamGridBuilder()\
.addGrid(gbt.maxDepth, [2, 5])\
.addGrid(gbt.maxIter, [10, 100])\
.build()
# We define an evaluation metric. This tells CrossValidator how well we are doing by comparing the true labels with predictions.
evaluator = RegressionEvaluator(metricName="rmse", labelCol=gbt.getLabelCol(), predictionCol=gbt.getPredictionCol())
# Declare the CrossValidator, which runs model tuning for us.
cv = CrossValidator(estimator=gbt, evaluator=evaluator, estimatorParamMaps=paramGrid)

Lastly, we tie our features & model training together into one Pipeline.

最后，我們將功能和模型培訓(xùn)結(jié)合在一起，形成一條管道。

Image by Author圖片作者 from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv])

6.訓(xùn)練和測試管道 (6. Train & Test the Pipeline)

pipelineModel = pipeline.fit(train)

MLlib will allow trials in MLflow. After tuning fit() call is done, the MLflow UI can be accessed to view the logged runs.

MLlib將允許在MLflow中進(jìn)行試用。調(diào)優(yōu)fit()調(diào)用完成后，可以訪問MLflow UI以查看記錄的運(yùn)行。

predictions = pipelineModel.transform(test)display(predictions.select("cnt", "prediction", *featuresCols))Image by Author圖片作者

The result may not be the best, but that’s where model tuning kicks in.

結(jié)果可能不是最好的，但這就是進(jìn)行模型調(diào)整的地方。

The (RMSE) mentioned above, tells us how well our model predicts on new samples.

上面提到的(RMSE)告訴我們模型對(duì)新樣本的預(yù)測效果如何。

Lower the RMSE, the better.

RMSE越低越好。

rmse = evaluator.evaluate(predictions)
print("RMSE on our test set: %g" % rmse)

RMSE of the test set: 44.6918

測試集的RMSE：44.6918

7.改進(jìn)模型的技巧 (7. Tips on improving the model)

There are several ways we could further improve our model:

有幾種方法可以進(jìn)一步改善模型：

Expert knowledge
專業(yè)知識(shí)
Better Tuning
更好的調(diào)音
Feature Engineering
特征工程

Different combinations of the hyperparameters are used to find the best solution.

使用超參數(shù)的不同組合來找到最佳解決方案。

Connect on LinkedIn and check out my Github for the complete notebook.

在LinkedIn上連接并查看我的Github以獲取完整的筆記本。

翻譯自: https://towardsdatascience.com/gradient-boosted-tree-regression-spark-dd5ac316a252