當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习线性回归学习心得_机器学习中的线性回归

發布時間：2023/12/15 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习线性回归学习心得_机器学习中的线性回归小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

機器學習線性回歸學習心得

機器學習中的線性回歸 (Linear Regression in Machine Learning)

There are two types of supervised machine learning algorithms or task: Regression and classification.

有監督的機器學習算法或任務有兩種類型：回歸和分類。

? Classification — Classification is a process of categorizing a given set of data into classes, it can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.

?分類-分類是將給定數據集分類為類的過程，可以在結構化或非結構化數據上執行。該過程從預測給定數據點的類別開始。這些類通常稱為目標，標簽或類別。

Example — Spam, Check defaulter in loan applicant.

示例—垃圾郵件，檢查貸款申請人中的違約者。

? Regression — It predicts the continuous value based on historical data. It Predict the future values on the basis of historical data. Example — No of Corona Patients in July, Sales of car in 2021

?回歸-根據歷史數據預測連續值。它根據歷史數據預測未來價值。示例— 7月的電暈患者人數，2021年的汽車銷量

Linear equation in algebra is the function for this algorithm, we try to find a linear relationship between two or more variables. If we draw this relationship in a two-dimensional space, we get a straight line. (between two variables),

代數中的線性方程是該算法的函數，我們試圖找到兩個或多個變量之間的線性關系。如果在二維空間中繪制此關系，則會得到一條直線。 (在兩個變量之間)，

It predicts the continuous variable “Y” based on given independent variable “X”. If we plot the independent variable (x) on the x-axis and dependent variable (y) on the y-axis. This algorithm gives us a straight line between y and x axis.

它基于給定的自變量“ X”預測連續變量“ Y”。如果我們在x軸上繪制自變量(x)，在y軸上繪制因變量(y)。該算法為我們在y和x軸之間提供了一條直線。

So, our equation looks like below.

因此，我們的等式如下所示。

Y =θ0 +θ1.X (Y = θ0 + θ1.X)

Where, y is predicted value

y是預測值

X is input or Features or Columns

X是輸入或特征或列

θ0 and θ1 are the model’s parameters. 1st one is intercept or Bias and 2nd one is slope or Weight.

θ0和θ1是模型的參數。第一個是攔截或偏差，第二個是斜率或權重。

You can see the equation looks like y = mx + c, which we have studied in our school curriculum's, where m is slope and c is intercept.

您會看到方程看起來像y = mx + c，這是我們在學校課程中研究的方程，其中m是斜率，c是截距。

More generally, a linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term (also called the intercept term), We can represent the equation for n independent variables or features or column.

更一般而言，線性模型通過簡單地計算輸入要素的加權總和加上一個稱為偏差項 (也稱為截距項 )的常數來進行預測。我們可以表示n個獨立變量或要素或列的方程式。

Y =θ0 X 0 +θ1×1 +θ2×2 +?+θnxn (y = θ0 x0+ θ1x1 + θ2x2 + ? + θnxn)

? ? is the predicted value.

? ?是預測值。

? n is the number of features.

? n是要素數量。

? xi is the ith feature value. X0 is always 0

? x i是第i個特征值。 X0始終為0

? θj is the jth model parameter (including the bias term θ0 and the feature weights

?θj的是第j個模型參數(包括偏項θ0和特征權重

θ1, θ2, ?, θn).

θ1 ， θ2 ，?， θn )。

Same we can write in Vector form.

同樣，我們可以用矢量形式寫。

Vector Form of the y predicted or Y-Haty預測或Y帽子的向量形式

? θ is the model’s parameter vector, containing the bias term θ0 and the feature

?θ是該模型的參數向量，含有偏項θ0和特征

weights θ1 to θn.

權重為θ1至θn 。

? x is the instance’s feature vector, containing x0 to xn, with x0 always equal to 1.

?x是實例的特征向量，從x 0到xn ， x 0始終等于1。

? θ ? x is the dot product of the vectors θ and x, which is equal to

?θ? x是向量θ和x的點積，等于

θ0x0 + θ1x1 + θ2x2 + ? + θnxn.

θ0 X 0 +θ1×1 +θ2×2 +?+θnxn。

? hθ is the hypothesis function, using the model parameters θ.

? 水平 θ是設定功能，利用該模型參數θ。

In Machine Learning, vectors are often represented as column vectors, which are 2D arrays with a single column. If θ and x are column vectors, then the prediction is: y = θT x, where θT is the transpose of θ (Swap the rows to column or column to rows) and θT x is the matrix multiplication of θT and x. It is of course the same prediction, except it is now represented as a single cell matrix rather than a scalar value.

在機器學習中，向量通常表示為列向量，它們是具有單個列的2D數組。如果θ和x是列向量，則預測為：y =θTx，其中θT是θ的轉置(將行交換到列或將列交換到行)，θTx是θT和x的矩陣乘法。當然，它是相同的預測，只不過它現在表示為單個單元矩陣而不是標量值。

Transpose — Swapping the rows to column of matrix or vice versa

轉置—將行交換到矩陣的列，反之亦然

A is Matrix and AT is Transpose of matrix AA是矩陣，AT是矩陣A的轉置

準備數據進行線性回歸 (Preparing Data for Linear Regression)

For an algorithm, data is most crucial part. There is saying “Garbage in, garbage out”. It means if you feed the garbage (Irrelevant data or noisy data), output of your model will be garbage (inaccurate). Hence, we need data to be in best form, before feeding and testing the model.

對于算法而言，數據是最關鍵的部分。有人說“垃圾進，垃圾出”。這意味著如果您輸入垃圾(Irelevant數據或嘈雜數據)，則模型的輸出將是垃圾(不準確)。因此，在饋送和測試模型之前，我們需要數據以最佳形式出現。

Below are few techniques to clean scale and modify data for Linear Regression. Linear Regression assumes below points like data is noise less and relationship between independent variable and dependent variables.

以下是一些清理線性比例和修改數據以線性回歸的技術。線性回歸假設以下幾點，例如數據更少噪音以及自變量和因變量之間的關系。

1-> Linear Assumption: Linear regression assumes that the relationship between your independent variable or X (input) and dependent variable or Y (output) is linear or it tends to be a linear for better performance. If relationship is not linear, Transformation can be applied on data.

1->線性假設：線性回歸假設您的自變量或X(輸入)與因變量或Y(輸出)之間的關系是線性的，或者為了獲得更好的性能而傾向于線性。如果關系不是線性的，則可以將轉換應用于數據。

2-> Remove Collinearity: If your independent variables have relation to each other, we should just take the most co related to Y and remove rest related to that X variable. Example If your data has DOB and Age, we can remove one.

2->刪除共線性：如果您的自變量相互之間具有關聯，則我們應僅取與Y相關的最大co并刪除與該X變量相關的其余co。示例如果您的數據具有DOB和Age，我們可以刪除其中一個。

3-> Gaussian Distributions: Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution or bell-shaped curve. You may get some benefit using transforms on your variables to make their distribution more Gaussian looking.

3->高斯分布：如果您的輸入和輸出變量具有高斯分布或鐘形曲線，則線性回歸將提供更可靠的預測。通過對變量進行變換，使它們的分布更具高斯外觀，您可能會獲得一些好處。

4-> Remove Noise: Linear regression assumes that your input and output variables are not noisy. We need to clean the data before feeding it to the model. This is most important for the output variable and you want to remove outliers in the dependent variable or Y if possible.

4->消除噪聲：線性回歸假設您的輸入和輸出變量沒有噪音。我們需要先清理數據，然后再將其提供給模型。這對于輸出變量最重要，如果可能，您要刪除因變量中的異常值或Y。

5-> Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization y.

5->重新縮放輸入：如果您使用標準化或歸一化y重新縮放輸入變量，則線性回歸通常會做出更可靠的預測。

Medium Story on Standardscaler vs MinMax Scaler (Normalization) — https://medium.com/@amitupadhyay6/standardscaler-and-normalization-with-code-and-graph-ba220025c054

關于Standardscaler與MinMax Scaler(規范化)的中級故事— https://medium.com/@amitupadhyay6/standardscaler-and-normalization-with-code-and-graph-ba220025c054

選擇績效指標 (Select a Performance Measure)

Our model is ready to predict the value, but before putting it in production, we need to check the performance of the model. For this purpose, we first need a measure of how well (or poorly) the model fits the training data. Most common performance measure of a regression model is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors. Therefore, to train a Linear Regression model, you need to find the value of θ that minimizes the RMSE. In practice, it is simpler to minimize the Mean Square Error (MSE) than the RMSE, and it leads to the same result.

我們的模型已經可以預測價值了，但是在將其投入生產之前，我們需要檢查模型的性能。為此，我們首先需要衡量模型擬合訓練數據的好壞程度。回歸模型最常見的性能指標是均方根誤差(RMSE)。它給出了系統通常會在預測中產生多少錯誤的想法，對于較大的錯誤，權重較高。因此，要訓練線性回歸模型，您需要找到使RMSE最小的θ值。實際上，最小均方誤差(MSE)比RMSE更簡單，并且得到相同的結果。

Root Mean Square Error:

根均方誤差：

The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit.

RMSE是殘差方差的平方根。它表示模型與數據的絕對擬合-觀察到的數據點與模型的預測值有多接近。 R平方是擬合的相對度量，而RMSE是擬合的絕對度量。

RMSE EquationRMSE方程

Mean Square Error:

均方誤差：

The mean squared error tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. It’s called the mean squared error as you’re finding the average of a set of errors.

均方誤差告訴您回歸線與一組點的接近程度。它通過獲取從點到回歸線的距離(這些距離就是“誤差”)并對它們進行平方來實現。當您找到一組誤差的平均值時，這稱為均方誤差。

MSE EquationMSE方程

Mean Absolute Error:

平均絕對誤差：

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outliers in your data. In that case, you may consider using the Mean Absolute Error .

盡管RMSE通常是回歸任務的首選性能指標，但在某些情況下，您可能更喜歡使用其他功能。例如，假設您的數據中有許多異常值。在這種情況下，您可以考慮使用平均絕對誤差。

Mean Absolute Error (MAE) is another loss function used for regression models. MAE is the sum of absolute differences between our target and predicted variables. So it measures the average magnitude of errors in a set of predictions, without considering their directions

平均絕對誤差(MAE)是用于回歸模型的另一個損失函數。 MAE是我們的目標變量和預測變量之間的絕對差之和。因此，它可以測量一組預測中的平均誤差幅度，而無需考慮其方向

MAE EquationMAE方程 Error in Prediction by Linear Regression線性回歸預測中的誤差

? Computing the root of a sum of squares (RMSE) corresponds to the Euclidean

?計算平方根(RMSE)的根對應于歐幾里得

norm: It is also called the ?2 norm, noted ∥ ? ∥2 (or just ∥ ? ∥).

規范：也稱為?2規范，記為∥?∥2(或僅稱為∥?∥)。

? Computing the sum of absolutes (MAE) corresponds to the ?1 norm, noted ∥ ? ∥1.

?計算絕對和(MAE)對應于?1m1范數。

It is also called the Manhattan norm because it measures the distance between two points.

之所以稱為曼哈頓范數，是因為它測量了兩個點之間的距離。

Euclidean and Manhattan Distance歐幾里得距離與曼哈頓距離

The most common interpretation of r-squared is how well the regression model fits the observed data. For example, an r-squared of 60% reveals that 60% of the data fit the regression model. Generally, a higher r-squared indicates a better fit for the model.

r平方的最常見解釋是回歸模型對觀測數據的擬合程度。例如，r-平方為60％表示60％的數據符合回歸模型。通常，較高的r平方表示該模型更合適。

R-squared or R2 explains the degree to which your input variables explain the variation of your output / predicted variable. So, if R-square is 0.8, it means 80% of the variation in the output variable is explained by the input variables. So, in simple terms, higher the R squared, the more variation is explained by your input variables and hence better is your model.

R平方或R2解釋輸入變量解釋輸出/預測變量變化的程度。因此，如果R平方為0.8，則意味著輸出變量中80％的變化由輸入變量解釋。因此，簡單來說，R平方越高，輸入變量說明的變化越大，因此您的模型越好。

Where SSres, Residual sum of squared errors of our regression model or error

其中SSres，我們回歸模型的平方誤差的殘差和或

R2 Square EquationR2平方方程

& SStot, is the total sum of squared errors

＆SStot，是平方誤差的總和

Adjusted R Square:

調整后的R平方：

However, the problem with R-squared is that it will either stay the same or increase with addition of more variables, even if they do not have any relationship with the output variables. This is where “Adjusted R square” comes to help. Adjusted R-square penalizes you for adding variables which do not improve your existing model.

但是，R平方的問題在于，即使它們與輸出變量沒有任何關系，它也會保持不變或隨著添加更多變量而增加。這是“調整后的R平方”提供幫助的地方。調整后的R平方會懲罰您添加不會改善現有模型的變量。

Hence, if you are building Linear regression on multiple variable, it is always suggested that you use Adjusted R-squared to judge goodness of model. In case you only have one input variable, R-square and Adjusted R squared would be exactly same.

因此，如果要在多個變量上建立線性回歸，則始終建議您使用調整后的R平方來判斷模型的優劣。如果只有一個輸入變量，則R平方和調整后的R平方將完全相同。

Typically, the more non-significant variables you add into the model, the gap in R-squared and Adjusted R-squared increases.

通常，添加到模型中的變量越不重要，R平方和調整R平方的差距就會增加。

Adjusted R2 Square Equation調整后的R2平方方程

As we know, whenever we add new feature or input variable, R2 square value will stay same or get increases, which is good sign, higher the r2 value, higher the performance of our model, but if you are adding the insignificant variable or feature, even though r2 value will remain same or gets increases. which should not happen, hence we use adjusted r2 square, lets see both the cases.

眾所周知，每當我們添加新特征或輸入變量時，R2平方值將保持不變或增加，這是一個好兆頭，r2值越高，模型的性能越高，但是如果添加的變量或特征不重要，，即使r2值保持不變或增加。這不應該發生，因此我們使用調整后的r2平方，讓我們看一下這兩種情況。

1-> Adding relevant feature or input variable in the data set — If you are adding relevant feature, r2 value will gets increase, which means (1-r2) will be small value and if you multiply the small value with big number { (1-r2) * (n-1)/(n-p-1)}, value will be reduced. ex 10 * 2 = 20 and 10 * 0.5 = 5, hence (1 - { (1-r2) * (n-1)/(n-p-1)}) will be bigger number as we are subtracting the small number from 1.

1->在數據集中添加相關特征或輸入變量-如果要添加相關特征，則r2值將增加，這意味著(1-r2)將為小值，并且將小值與大數{( 1-r2)*(n-1)/(np-1)}，值將減小。 ex 10 * 2 = 20和10 * 0.5 = 5，因此(1-{(1-r2)*(n-1)/(np-1)})將是較大的數字，因為我們要從1中減去較小的數字。

2-> Adding irrelevant feature or input variable in the data set — If you are adding irrelevant feature, r2 value will remain same or gets increase slightly, which means (1-r2) will be little less, but if we see the denominator (n-p-1), this value will be reduced, as the p is increases. hence nominator {(1-r2)(n-1)} divide small value, will make the bigger value, ex 10/2 = 5, reduce the denominator 10/0.5 = 20. Hence 1 — bigger number the resultant adjusted r2 value will be reduced. Which should suppose to be happen in the case of irrelevant feature addition in the data set.

2->在數據集中添加不相關的要素或輸入變量-如果要添加不相關的要素，則r2值將保持不變或略有增加，這意味著(1-r2)會少一點，但是如果看到分母( np-1)，隨著p的增加，該值將減小。因此，分母{(1-r2)(n-1)}除以較小的值，將得到較大的值，例如10/2 = 5，則分母減少10 / 0.5 =20。因此，1 —較大的數字就是調整后的r2值將減少。如果在數據集中添加了不相關的功能，則應該發生這種情況。

YouTube Link: Linear Regression → https://www.youtube.com/watch?v=w8S0uTLTaGA
YouTube鏈接：線性回歸→ https : //www.youtube.com/watch?v=w8S0uTLTaGA
Performance measure → https://www.youtube.com/watch?v=eNphW-kjT2I&list=PLbbnl6egUbNhJqmLfwX2eN7XmuqRS8rv8&index=15
績效指標→ https://www.youtube.com/watch?v=eNphW-kjT2I&list=PLbbnl6egUbNhJqmLfwX2eN7XmuqRS8rv8&index=15
GitHub Code → https://github.com/amitupadhyay6/My-Python/blob/master/Linear%20Regression%20on%20Boston.ipynb
GitHub代碼→ https://github.com/amitupadhyay6/My-Python/blob/master/Linear%20Regression%20on%20Boston.ipynb

翻譯自: https://medium.com/analytics-vidhya/linear-regression-in-machine-learning-eeee4dbc8bae