使用机器学习预测天气_使用机器学习的二手车价格预测
使用機器學習預測天氣
You can reach all Python scripts relative to this on my GitHub page. If you are interested, you can also find the scripts used for data cleaning and data visualization for this study in the same repository. And the project is also deployed using Django on Heroku. View Deployment
您可以在我的 GitHub頁面 上 找到 所有與此相關(guān)的Python腳本 。 如果您有興趣,還可以在同一存儲庫中找到用于此研究的數(shù)據(jù)清理和數(shù)據(jù)可視化的腳本。 而且該項目還使用Django在Heroku上進行了部署。 查看部署
內(nèi)容 (Content)
為什么通過對數(shù)轉(zhuǎn)換來縮放價格特征? (Why is price feature scaled by log transformation?)
In the regression model, for any fixed value of X, Y is distributed in this problem data-target value (Price ) not normally distributed, it is right skewed.
在回歸模型中,對于X的任何固定值,Y均以非正態(tài)分布的這個問題數(shù)據(jù)目標值(價格)分布,因此右偏。
To solve this problem, the log transformation on the target variable is applied when it has skewed distribution and we need to apply an inverse function on the predicted values to get the actual predicted target value.
為了解決這個問題,當目標變量具有偏斜分布時,對它進行對數(shù)轉(zhuǎn)換,我們需要對預測值應用反函數(shù)以獲得實際的預測目標值。
Due to this, for evaluating the model, the RMSLE is calculated to check the error and the R2 Score is also calculated to evaluate the accuracy of the model.
因此,為了評估模型,將計算RMSLE以檢查誤差,并且還計算R2分數(shù)以評估模型的準確性。
一些關(guān)鍵概念: (Some Key Concepts:)
Learning Rate: Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network concerning the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.
學習率:學習率是一個超參數(shù),它控制我們在網(wǎng)絡(luò)上調(diào)整與損耗梯度有關(guān)的權(quán)重的程度。 值越低,我們沿著下坡行駛的速度就越慢。 盡管就確保我們不錯過任何局部最小值而言,這可能是一個好主意(使用較低的學習率),但這也意味著我們將花費很長的時間進行收斂,尤其是如果我們陷入困境高原地區(qū)。
n_estimators: This is the number of trees you want to build before taking the maximum voting or averages of predictions. A higher number of trees give you better performance but make your code slower.
n_estimators :這是在進行最大投票或平均預測之前要構(gòu)建的樹數(shù)。 數(shù)量更多的樹可為您提供更好的性能,但會使您的代碼變慢。
R2 Score: It is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean.
R2得分:它是統(tǒng)計數(shù)據(jù)與擬合回歸線的接近程度的一種統(tǒng)計量度。 也稱為確定系數(shù),或用于多元回歸的多重確定系數(shù)。 0%表示該模型無法解釋響應數(shù)據(jù)均值附近的變化。
1.數(shù)據(jù): (1. The Data:)
The dataset used in this project was downloaded from Kaggle.
該項目中使用的數(shù)據(jù)集是從Kaggle下載的。
2.數(shù)據(jù)清理: (2. Data Cleaning:)
The first step is to remove irrelevant/useless features like ‘URL’, ’region_url’, ’vin’, ’image_url’, ’description’, ’county’, ’state’ from the dataset.
第一步是從數(shù)據(jù)集中刪除不相關(guān)/無用的功能,例如“ URL”,“ region_url”,“ vin”,“ image_url”,“ description”,“ county”,“ state”。
As a next step, check missing values for each feature.
下一步,檢查每個功能的缺失值。
Showing missing values (Image By Panwar Abhash Anil)顯示缺失值(Panwar Abhash Anil攝)Next, now missing values were filled with appropriate values by an appropriate method.
接下來,現(xiàn)在通過適當?shù)姆椒ㄓ眠m當?shù)闹堤畛淙鄙俚闹怠?
To fill the missing values, IterativeImputer method is used and different estimators are implemented then calculated MSE of each estimator using cross_val_score
為了填充缺失值,使用了IterativeImputer方法,并實現(xiàn)了不同的估計量,然后使用cross_val_score計算每個估計量的MSE
From the above figure, we can conclude that the ExtraTreesRegressor estimator will be better for the imputation method to fill the missing value.
從上圖可以得出結(jié)論, ExtraTreesRegressor估計器將更適合插補方法來填充缺失值。
Panwar Abhash Anil)Panwar Abhash Anil攝 )At last, after dealing with missing values there zero null values.
最后,在處理了缺失值之后,零值為零。
Outliers: InterQuartile Range (IQR) method is used to remove the outliers from the data.
離群值:四分位數(shù)間距(IQR)方法用于從數(shù)據(jù)中刪除離群值。
Panwar Abhash Anil)Panwar Abhash Anil攝 ) Panwar Abhash Anil)Panwar Abhash Anil攝 ) Panwar Abhash Anil)Panwar Abhash Anil攝 )- From figure 1, the prices whose log is below 6.55 and above 11.55 are the outliers 從圖1中,對數(shù)低于6.55和高于11.55的價格是異常值
- From figure 2, it is impossible to conclude something so IQR is calculated to find outliers i.e. odometer values below 6.55 and above 11.55 are the outliers. 從圖2中無法得出結(jié)論,因此要計算IQR以找到異常值,即里程表值低于6.55而高于11.55就是異常值。
- From figure 3, the year below 1995 and above 2020 are the outliers. 根據(jù)圖3,1995年以下和2020年以上的年份是異常值。
At last, Shape of dataset before process= (435849, 25) and after process= (374136, 18). Total 61713 rows and 7 cols removed.
最后,處理之前的數(shù)據(jù)集的形狀=(435849,25),處理之后的數(shù)據(jù)集的形狀=(374136,18)。 總共61713行和7列刪除。
3.數(shù)據(jù)預處理: (3. Data preprocessing:)
Label Encoder: In our dataset, 12 features are categorical variables and 4 numerical variables (price column excluded). To apply the ML models, we need to transform these categorical variables into numerical variables. And sklearn library LabelEncoder is used to solve this problem.
標簽編碼器:在我們的數(shù)據(jù)集中,有12個要素是分類變量和4個數(shù)字變量(不包括價格欄)。 要應用ML模型,我們需要將這些分類變量轉(zhuǎn)換為數(shù)值變量。 sklearn庫LabelEncoder用于解決此問題。
Normalization: The dataset is not normally distributed. All the features have different ranges. Without normalization, the ML model will try to disregard coefficients of features that have low values because their impact will be so small compared to the big value. Hence to normalized, sklearn library i.e. MinMaxScaler is used.
標準化 :數(shù)據(jù)集不是正態(tài)分布的。 所有功能都有不同的范圍。 如果不進行歸一化,則ML模型將嘗試忽略具有低值的要素的系數(shù),因為與大值相比,其影響將很小。 因此,為了進行標準化,使用了sklearn庫,即MinMaxScaler 。
Train the data. In this process, 90% of the data was split for the train data and 10% of the data was taken as test data.
訓練數(shù)據(jù)。 在此過程中,將90%的數(shù)據(jù)拆分為火車數(shù)據(jù),并將10%的數(shù)據(jù)作為測試數(shù)據(jù)。
4.機器學習模型: (4. ML Models:)
In this section, different machine learning algorithms are used to predict price/target-variable.
在本節(jié)中,將使用不同的機器學習算法來預測價格/目標變量。
The dataset is supervised, so the models are applied in a given order:
數(shù)據(jù)集受到監(jiān)督,因此以給定順序應用模型:
Linear Regression
線性回歸
Ridge Regression
嶺回歸
Lasso Regression
套索回歸
K-Neighbors Regressor
K鄰域回歸器
Random Forest Regressor
隨機森林回歸
Bagging Regressor
裝袋機
Adaboost Regressor
Adaboost回歸器
XGBoost
XGBoost
1)線性回歸: (1) Linear Regression:)
In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). In linear regression, the relationships are modelled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. More Details
在統(tǒng)計中,線性回歸是對標量響應(或因變量)與一個或多個解釋變量(或自變量)之間的關(guān)系進行建模的線性方法。 在線性回歸中,使用線性預測函數(shù)對關(guān)系進行建模,這些函數(shù)的未知模型參數(shù)可從數(shù)據(jù)中估算出來。 這種模型稱為線性模型。 更多細節(jié)
Coefficients: The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable.
系數(shù):每個系數(shù)的符號表示預測變量和響應變量之間關(guān)系的方向。
- A positive sign indicates that as the predictor variable increases, the response variable also increases. 正號表示隨著預測變量的增加,響應變量也增加。
- A negative sign indicates that as the predictor variable increases, the response variable decreases. 負號表示隨著預測變量增加,響應變量減少。
Considering this figure, linear regression suggests that year, cylinder, transmission, fuel, and odometer these five variables are the most important.
考慮到這個數(shù)字,線性回歸表明年份,汽缸,變速箱,燃油和里程表這五個變量是最重要的。
Panwar Abhash Anil)Panwar Abhash Anil )2)嶺回歸: (2) Ridge Regression:)
Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value.
Ridge回歸是一種用于分析遭受多重共線性的多個回歸數(shù)據(jù)的技術(shù)。 當發(fā)生多重共線性時,最小二乘估計是無偏的,但是它們的方差很大,因此可能與真實值相去甚遠。
To find the best alpha value in ridge regression, yellowbrick library AlphaSelection was applied.
為了在嶺回歸中找到最佳的alpha值,應用了yellowbrick庫AlphaSelection 。
Graph showing best value of Alpha該圖顯示了Alpha的最佳價值From the figure, the best value of alpha to fit the dataset is 20.336.
從圖中可以看出,最適合該數(shù)據(jù)集的alpha值為20.336。
Note: The value of alpha is not constant it varies every time.
注意:alpha值不是恒定的,每次都會變化。
Using this value of alpha, Ridgeregressor is implemented.
使用此alpha值,可實現(xiàn)Ridgeregressor。
Graph showing Important Features該圖顯示重要功能Considering this figure, Lasso regression suggests that year, cylinder, transmission, fuel, and odometer these five variables are the most important.
考慮到該數(shù)字,Lasso回歸表明年份,汽缸,變速箱,燃油和里程表這五個變量是最重要的。
Panwar Abhash Anil)Panwar Abhash Anil攝 )The performance of ridge regression is almost the same as Linear Regression.
嶺回歸的性能幾乎與線性回歸相同。
3)套索回歸: (3)Lasso Regression:)
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point as mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters).
套索回歸是一種使用收縮的線性回歸。 收縮是指數(shù)據(jù)值平均向中心點收縮。 套索程序鼓勵使用簡單,稀疏的模型(即參數(shù)較少的模型)。
Why Lasso regression is used?
為什么使用套索回歸?
The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that cause regression coefficients for some variables to shrink toward zero.
套索回歸的目標是獲得使定量響應變量的預測誤差最小化的預測子集。 套索通過對模型參數(shù)施加約束來實現(xiàn)此目的,該約束會使某些變量的回歸系數(shù)縮小為零。
Panwar Abhash Anil)Panwar Abhash Anil攝 )But for this dataset, there is no need for lasso regression as there no much difference in error.
但是對于此數(shù)據(jù)集,不需要套索回歸,因為誤差沒有太大差異。
4)KNeighbors回歸器:基于k最近鄰的回歸。 (4)KNeighbors Regressor: Regression-based on k-nearest neighbors.)
The target is predicted by local interpolation of the targets associated with the nearest neighbours the training set.
通過與訓練集的最近鄰居相關(guān)聯(lián)的目標的局部插值來預測目標。
k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. Read More
k -NN是一種基于實例的學習或懶惰學習 ,其中功能僅在本地近似,所有計算都推遲到功能評估為止。
Panwar Abhash Anil)Panwar Abhash Anil攝 )From the above figure, for k=5 KNN give the least error. So dataset is trained using n_neighbors=5 and metric=’euclidean’.
從上圖可以看出,對于k = 5 KNN,誤差最小。 因此,使用n_neighbors = 5和metric ='euclidean'訓練數(shù)據(jù)集。
Panwar Abhash Anil)Panwar Abhash Anil攝 )The performance KNN is better and error is decreasing with increased accuracy.
性能KNN更好,并且誤差隨著精度的提高而降低。
5)隨機森林: (5) Random Forest:)
The random forest is a classification algorithm consisting of many decision trees. It uses bagging and feature randomness when building each tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Read More
隨機森林是一種由許多決策樹組成的分類算法。 在構(gòu)建每棵樹時,它使用套袋和特征隨機性來嘗試創(chuàng)建不相關(guān)的樹林,其委員會的預測比任何單個樹的預測更為準確。
In our model, 180 decisions are created with max_features 0.5
在我們的模型中,使用max_features 0.5創(chuàng)建了180個決策
Performance of Random Forest (True value vs predicted value)隨機森林的性能(真實值與預測值)This is the simple bar plot which illustrates that year is the most important feature of a car and then odometer variable and then others.
這是簡單的條形圖,它說明年份是汽車的最重要特征,然后是里程表變量,然后是其他變量。
Panwar Abhash Anil)Panwar Abhash Anil提供 )The performance of the Random forest is better and accuracy is increased by approx. 10% which is good. Since the random forest is using bagging when building each tree so next Bagging Regressor will be performed.
隨機森林的性能更好,并且準確性提高了約5%。 10%很好。 由于隨機森林在構(gòu)建每棵樹時正在使用裝袋,因此將執(zhí)行下一個裝袋回歸器。
6)套袋回歸器: (6) Bagging Regressor:)
A Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregates their predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. Read More
Bagging回歸器是一個集合元估計器,它使每個基本回歸器都適合原始數(shù)據(jù)集的隨機子集,然后將其預測(通過投票或平均)進行匯總以形成最終預測。 通過將隨機化引入其構(gòu)造過程中,然后使其整體,這種元估計器通常可以用作減少黑盒估計器(例如決策樹)方差的方法。
In our model, DecisionTreeRegressor is used as the estimator with max_depth=20 which creates 50 decision trees and the results show below.
在我們的模型中,DecisionTreeRegressor用作max_depth = 20的估計量,它創(chuàng)建了50個決策樹,結(jié)果如下所示。
Panwar Abhash Anil)Panwar Abhash Anil提供 )The performance of Random Forest is much better than Bagging regressor.
Random Forest的性能比Bagging回歸器要好得多。
The key difference between Random forest and Bagging: The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.
隨機森林和套袋的關(guān)鍵區(qū)別:最根本的區(qū)別是,在隨機森林中,只有功能的子集在總的隨機開出,并從子集的最佳分割特征選擇用于每個節(jié)點樹分割,不像在裝袋中考慮將所有要素拆分節(jié)點。
7)Adaboost回歸器: (7) Adaboost regressor:)
AdaBoost can be used to boost the performance of any machine learning algorithm. Adaboost helps you combine multiple “weak classifiers” into a single “strong classifier”. Library used: AdaBoostRegressor & Read More
AdaBoost可用于提高任何機器學習算法的性能。 Adaboost可幫助您將多個“弱分類器”組合為一個“強分類器”。 使用的庫: AdaBoostRegressor &
This is the simple bar plot which illustrates that year is the most important feature of a car and then odometer variable and then model, etc.
這是簡單的條形圖,它說明年份是汽車的最重要特征,然后是里程表變量,然后是模型,等等。
In our model, DecisionTreeRegressor is used as an estimator with 24 max_depth and creates 200 trees & learning the model with 0.6 learning_rate and result shown below.
在我們的模型中,DecisionTreeRegressor用作具有24個max_depth的估計量,并創(chuàng)建200棵樹并以0.6 learning_rate和以下所示的結(jié)果學習模型。
Panwar Abhash Anil)Panwar Abhash Anil提供 )8)XGBoost:XGBoost代表eXtreme Gradient Boosting (8) XGBoost: XGBoost stands for eXtreme Gradient Boosting)
XGBoost is an ensemble learning method.XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. The beauty of this powerful algorithm lies in its scalability, which drives fast learning through parallel and distributed computing and offers efficient memory usage. Read More
XGBoost是一種整體學習方法 .XGBoost是為速度和性能而設(shè)計的梯度增強決策樹的實現(xiàn)。 這種強大算法的優(yōu)點在于可擴展性,可擴展性通過并行和分布式計算驅(qū)動快速學習,并提供有效的內(nèi)存使用率。
This is the simple bar plot in descending of importance which illustrates that which feature/variable is an important feature of a car is more important.
這是重要性遞減的簡單條形圖,它說明哪個特征/變量是汽車的重要特征更為重要。
According to XGBoost, Odometer is an important feature whereas from the previous models year is an important feature.
根據(jù)XGBoost的介紹, 里程表是一項重要功能,而從以前的型號開始,年份是一項重要功能。
In this model,200 decision trees are created of 24 max depth and the model is learning the parameter with a 0.4 learning rate.
在該模型中,創(chuàng)建了200個最大深度為24的決策樹,并且該模型正在以0.4的學習率學習參數(shù)。
Panwar Abhash Anil)Panwar Abhash Anil提供 )4)模型性能比較: (4)Comparison of the performance of the models:)
Panwar Abhash Anil)Panwar Abhash Anil攝 ) Panwar Abhash Anil)Panwar Abhash Anil提供 )From the above figures, we can conclude that XGBoost regressor with 89.662% accuracy is performing better than other models.
從以上數(shù)據(jù)可以得出結(jié)論,精度為89.662%的XGBoost回歸器的性能優(yōu)于其他模型。
5)來自數(shù)據(jù)集的一些見解: (5) Some insights from the dataset:)
1From the pair plot, we can’t conclude anything. There is no correlation between the variables.
1從對圖中,我們無法得出任何結(jié)論。 變量之間沒有相關(guān)性。
Pair Plot to Find Correlation配對圖以找到相關(guān)性2From the distplot, we can conclude that initially, the price is increasing rapidly but after a particular point, the price starts decreasing.
2從distplot中,我們可以得出結(jié)論,最初,價格正在Swift上漲,但是在特定點之后,價格開始下降。
Panwar Abhash Anil)Panwar Abhash Anil攝 )3From figure 1, we analyze that the car price of the diesel variant is high then the price of the electric variant comes. Hybrid variant cars have the lowest price.
3從圖1中,我們分析出柴油車型的汽車價格高,然后電動車型的價格就來了。 混合動力汽車的價格最低。
Bar Plot showing the price of each fuel type條形圖顯示每種燃料類型的價格4 From figure 2, we analyze that the car price of the respective fuel also depends upon the condition of the car.
4從圖2中,我們分析了相應燃料的汽車價格還取決于汽車的狀況。
Bar Plot between fuel and price with hue condition帶有色相條件的燃料和價格之間的條形圖5From figure 3, we analyze that car prices are increasing per year after 1995, and from figure 4, the number of cars also increasing per year, and at some point i.e in 2012yr, the number of cars is nearly the same.
5從圖3中,我們分析了1995年以后汽車價格每年都在上漲,從圖4中,汽車數(shù)量也在逐年增加,在某個年份,即2012年,汽車數(shù)量幾乎是相同的。
Graph showing how the price varies per year該圖顯示了價格每年的變化6From figure 5, we can analyze that the price of the cars also depends upon the condition of the car, and from figure 6, price varies with the condition of the cars with there size also.
6從圖5中,我們可以分析出汽車的價格也取決于汽車的狀況,而從圖6中,價格也隨汽車的大小而變化。
Bar Plot showing the price respective of the condition of the car條形圖顯示了汽車狀況的價格7From figure 7–8, we analyze that price of the cars also various each transmission of a car. People are ready to buy the car having “other transmission” and the price of the cars having “manual transmission” is low.
7從圖7–8中,我們分析了汽車的價格也隨汽車的每個變速箱而變化。 人們準備購買具有“其他變速箱”的汽車,并且具有“手動變速箱”的汽車的價格很低。
Panwar Abhash Anil)Panwar Abhash Anil提供 )8 Below there are similar graphs with the same insight but different features.
8下面是具有相同見解但功能不同的相似圖表。
結(jié)論: (Conclusion:)
By performing different ML models, we aim to get a better result or less error with max accuracy. Our purpose was to predict the price of the used cars having 25 predictors and 509577 data entries.
通過執(zhí)行不同的ML模型,我們旨在以最大的精度獲得更好的結(jié)果或更少的誤差。 我們的目的是通過25個預測器和509577個數(shù)據(jù)輸入來預測二手車的價格。
Initially, data cleaning is performed to remove the null values and outliers from the dataset then ML models are implemented to predict the price of cars.
最初,執(zhí)行數(shù)據(jù)清理以從數(shù)據(jù)集中刪除空值和離群值,然后實施ML模型以預測汽車價格。
Next, with the help of data visualization features were explored deeply. The relation between the features is examined.
接下來,借助數(shù)據(jù)可視化功能進行了深入探索。 檢查特征之間的關(guān)系。
From the below table, it can be concluded that XGBoost is the best model for the prediction for used car prices. XGBoost as a regression model gave the best MSLE and RMSLE values.
從下表中可以得出結(jié)論,XGBoost是預測二手車價格的最佳模型。 XGBoost作為回歸模型可提供最佳的MSLE和RMSLE值。
Panwar Abhash Anil)Panwar Abhash Anil提供 )翻譯自: https://towardsdatascience.com/used-car-price-prediction-using-machine-learning-e3be02d977b2
使用機器學習預測天氣
總結(jié)
以上是生活随笔為你收集整理的使用机器学习预测天气_使用机器学习的二手车价格预测的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ue4 gpu构建_待在家里吗 为什么不
- 下一篇: 加速电动化,本田中国旗下技研科技与生产技