机器学习学习吴恩达逻辑回归_机器学习基础:逻辑回归
機器學習學習吳恩達邏輯回歸
In the previous stories, I had given an explanation of the program for implementation of various Regression models. As we move on to Classification, isn’t it surprising as to why the title of this algorithm still has the name, Regression. Let us understand the mechanism of the Logistic Regression and learn to build a classification model with an example.
在先前的故事中 ,我已經解釋了用于實現各種回歸模型的程序。 當我們繼續進行分類時 ,為什么該算法的標題仍然具有名稱Regression也不奇怪。 讓我們了解Logistic回歸的機制,并通過示例學習構建分類模型。
Logistic回歸概述 (Overview of Logistic Regression)
Logistic Regression is a classification model that is used when the dependent variable (output) is in the binary format such as 0 (False) or 1 (True). Examples include such as predicting if there is a tumor (1) or not (0) and if an email is a spam (1) or not (0).
Logistic回歸是一種分類模型,當因變量(輸出)采用二進制格式(例如0(假)或1(真))時使用。 例如,例如預測是否有腫瘤(1)(0)和電子郵件是否為垃圾郵件(1)(0)。
The logistic function, also called as sigmoid function was initially used by statisticians to describe properties of population growth in ecology. The sigmoid function is a mathematical function used to map the predicted values to probabilities. Logistic Regression has an S-shaped curve and can take values between 0 and 1 but never exactly at those limits. It has the formula of 1 / (1 + e^-value).
統計學家最初使用邏輯函數(也稱為S型函數)來描述生態學中人口增長的特性。 S形函數是用于將預測值映射到概率的數學函數。 Logistic回歸具有S形曲線,并且可以采用0到1之間的值,但永遠不能精確地處于那些極限。 它的公式為1 / (1 + e^-value) 。
Logistic Regression is an extension of the Linear Regression model. Let us understand this with a simple example. If we want to classify if an email is a spam or not, if we apply a Linear Regression model, we would get only continuous values between 0 and 1 such as 0.4, 0.7 etc. On the other hand, the Logistic Regression extends this linear regression model by setting a threshold at 0.5, hence the data point will be classified as spam if the output value is greater than 0.5 and not spam if the output value is lesser than 0.5.
Logistic回歸是線性回歸模型的擴展。 讓我們用一個簡單的例子來理解這一點。 如果我們要分類電子郵件是否為垃圾郵件,則應用線性回歸模型,我們將只能獲得0到1之間的連續值,例如0.4、0.7等。另一方面,邏輯回歸可以擴展此線性通過將閾值設置為0.5來建立回歸模型,因此,如果輸出值大于0.5,則數據點將被歸類為垃圾郵件;如果輸出值小于0.5,則數據點將被歸類為垃圾郵件。
In this way, we can use Logistic Regression to classification problems and get accurate predictions.
這樣,我們可以使用Logistic回歸對問題進行分類并獲得準確的預測。
問題分析 (Problem Analysis)
To apply the Logistic Regression model in practical usage, let us consider a DMV Test dataset which consists of three columns. The first two columns consist of the two DMV written tests (DMV_Test_1 and DMV_Test_2) which are the independent variables and the last column consists of the dependent variable, Results which denote that the driver has got the license (1) or not (0).
為了在實際應用中應用Logistic回歸模型,讓我們考慮由三列組成的DMV測試數據集。 前兩列包含兩個DMV書面測試( DMV_Test_1和DMV_Test_2 ),它們是自變量,最后一列包含因變量, 結果表示驅動程序已獲得許可證(1)或沒有獲得許可證(0)。
In this, we have to build a Logistic Regression model using this data to predict if a driver who has taken the two DMV written tests will get the license or not using those marks obtained in their written tests and classify the results.
在這種情況下,我們必須使用此數據構建Logistic回歸模型,以預測已參加兩次DMV筆試的駕駛員是否會使用他們在筆試中獲得的那些標記來獲得駕照,然后對結果進行分類。
步驟1:導入庫 (Step 1: Importing the Libraries)
As always, the first step will always include importing the libraries which are the NumPy, Pandas and the Matplotlib.
與往常一樣,第一步將始終包括導入NumPy,Pandas和Matplotlib庫。
import numpy as npimport matplotlib.pyplot as plt
import pandas as pd
步驟2:導入數據集 (Step 2: Importing the dataset)
In this step, we shall get the dataset from my GitHub repository as “DMVWrittenTests.csv”. The variable X will store the two “DMV Tests ”and the variable Y will store the final output as “Results”. The dataset.head(5)is used to visualize the first 5 rows of the data.
在這一步中,我們將從GitHub存儲庫中獲取數據集,名稱為“ DMVWrittenTests.csv”。 變量X將存儲兩個“ DMV測試 ”,變量Y將最終輸出存儲為“ 結果 ” 。 dataset.head(5)用于可視化數據的前5行。
dataset = pd.read_csv('https://raw.githubusercontent.com/mk-gurucharan/Classification/master/DMVWrittenTests.csv')X = dataset.iloc[:, [0, 1]].valuesy = dataset.iloc[:, 2].valuesdataset.head(5)>>
DMV_Test_1 DMV_Test_2 Results
34.623660 78.024693 0
30.286711 43.894998 0
35.847409 72.902198 0
60.182599 86.308552 1
79.032736 75.344376 1
步驟3:將資料集分為訓練集和測試集 (Step 3: Splitting the dataset into the Training set and Test set)
In this step, we have to split the dataset into the Training set, on which the Logistic Regression model will be trained and the Test set, on which the trained model will be applied to classify the results. In this the test_size=0.25 denotes that 25% of the data will be kept as the Test set and the remaining 75% will be used for training as the Training set.
在這一步中,我們必須將數據集分為訓練集和測試集,訓練集將在該訓練集上訓練邏輯回歸模型,測試集將在訓練集上應用訓練后的模型對結果進行分類。 在這種情況下, test_size=0.25表示將保留25%的數據作為測試集,而將剩余的75 %的數據用作培訓集 。
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
步驟4:功能縮放 (Step 4: Feature Scaling)
This is an additional step that is used to normalize the data within a particular range. It also aids in speeding up the calculations. As the data is widely varying, we use this function to limit the range of the data within a small limit ( -2,2). For example, the score 62.0730638 is normalized to -0.21231162 and the score 96.51142588 is normalized to 1.55187648. In this way, the scores of X_train and X_test are normalized to a smaller range.
這是一個附加步驟,用于對特定范圍內的數據進行規范化。 它還有助于加快計算速度。 由于數據變化很大,我們使用此功能將數據范圍限制在很小的限制(-2,2)內。 例如,將分數62.0730638標準化為-0.21231162,將分數96.51142588標準化為1.55187648。 這樣,將X_train和X_test的分數歸一化為較小的范圍。
from sklearn.preprocessing import StandardScalersc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
步驟5:在訓練集上訓練Logistic回歸模型 (Step 5: Training the Logistic Regression model on the Training Set)
In this step, the class LogisticRegression is imported and is assigned to the variable “classifier”. The classifier.fit() function is fitted with X_train and Y_train on which the model will be trained.
在此步驟中,將導入LogisticRegression類并將其分配給變量“ classifier” 。 classifier.fit()函數配有X_train和Y_train ,將在其上訓練模型。
from sklearn.linear_model import LogisticRegressionclassifier = LogisticRegression()
classifier.fit(X_train, y_train)
步驟6:預測測試集結果 (Step 6: Predicting the Test set results)
In this step, the classifier.predict() function is used to predict the values for the Test set and the values are stored to the variable y_pred.
在此步驟中, classifier.predict()函數用于預測測試集的值,并將這些值存儲到變量y_pred.
y_pred = classifier.predict(X_test)y_pred
步驟7:混淆矩陣和準確性 (Step 7: Confusion Matrix and Accuracy)
This is a step that is mostly used in classification techniques. In this, we see the Accuracy of the trained model and plot the confusion matrix.
這是分類技術中最常用的步驟。 在此,我們看到了訓練模型的準確性,并繪制了混淆矩陣。
The confusion matrix is a table that is used to show the number of correct and incorrect predictions on a classification problem when the real values of the Test Set are known. It is of the format
混淆矩陣是一個表,用于在已知測試集的實際值時顯示有關分類問題的正確和不正確預測的數量。 它的格式
Source — Self來源—自我The True values are the number of correct predictions made.
True值是做出正確預測的次數。
from sklearn.metrics import confusion_matrixcm = confusion_matrix(y_test, y_pred)from sklearn.metrics import accuracy_score
print ("Accuracy : ", accuracy_score(y_test, y_pred))
cm>>Accuracy : 0.88
>>array([[11, 0],
[ 3, 11]])
From the above confusion matrix, we infer that, out of 25 test set data, 22 were correctly classified and 3 were incorrectly classified. Pretty good for a start, isn’t it?
從上面的混淆矩陣中,我們推斷出,在25個測試集數據中,有22個被正確分類,而3個被錯誤分類。 一開始很不錯,不是嗎?
步驟8:將實際值與預測值進行比較 (Step 8: Comparing the Real Values with Predicted Values)
In this step, a Pandas DataFrame is created to compare the classified values of both the original Test set (y_test) and the predicted results (y_pred).
在此步驟中,將創建一個Pandas DataFrame來比較原始測試集( y_test )和預測結果( y_pred )的分類值。
df = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})df>>
Real Values Predicted Values
1 1
0 0
0 0
0 0
1 1
1 1
1 0
1 1
0 0
1 1
0 0
0 0
0 0
1 1
1 0
1 1
0 0
1 1
1 0
1 1
0 0
0 0
1 1
1 1
0 0
Though this visualization may not be of much use as it was with Regression, from this, we can see that the model is able to classify the test set values with a decent accuracy of 88% as calculated above.
盡管這種可視化可能不像使用回歸那樣有用,但是從中我們可以看到,該模型能夠以如上所述的88%的準確度對測試集值進行分類。
步驟9:可視化結果 (Step 9: Visualising the Results)
In this last step, we visualize the results of the Logistic Regression model on a graph that is plotted along with the two regions.
在最后一步中,我們在與兩個區域一起繪制的圖形上可視化Logistic回歸模型的結果。
from matplotlib.colors import ListedColormapX_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression')
plt.xlabel('DMV_Test_1')
plt.ylabel('DMV_Test_2')
plt.legend()
plt.show()Logistic Regression邏輯回歸
In this graph, the value 1 (i.e, Yes) is plotted in “Red” color and the value 0 (i.e, No) is plotted in “Green” color. The Logistic Regression line separates the two regions. Thus, any data with the two data points (DMV_Test_1 and DMV_Test_2) given, can be plotted on the graph and depending upon which region if falls in, the result (Getting the Driver’s License) can be classified as Yes or No.
在該圖中,值1(即“是”)以“ 紅色 ”顏色繪制,而值0(即“否”)以“ 綠色 ”顏色繪制。 Logistic回歸線將兩個區域分開。 因此,具有給定兩個數據點(DMV_Test_1和DMV_Test_2)的任何數據都可以繪制在圖形上,并且根據所落的區域,結果(獲得駕駛執照)可以分類為是或否。
As calculated above, we can see that there are three values in the test set that are wrongly classified as “No” as they are on the other side of the line.
如上所述,我們可以看到測試集中有3個值被錯誤地歸類為“否”,因為它們位于行的另一側。
Logistic Regression邏輯回歸結論— (Conclusion —)
Thus in this story, we have successfully been able to build a Logistic Regression model that is able to predict if a person is able to get the driving license from their written examinations and visualize the results.
因此,在這個故事中,我們已經成功地建立了Logistic回歸模型,該模型可以預測一個人是否能夠通過筆試獲得駕照并將結果可視化。
I am also attaching the link to my GitHub repository where you can download this Google Colab notebook and the data files for your reference.
我還將鏈接附加到我的GitHub存儲庫中,您可以在其中下載此Google Colab筆記本和數據文件以供參考。
You can also find the explanation of the program for other Classification models below:
您還可以在下面找到其他分類模型的程序說明:
- Logistic Regression 邏輯回歸
- K-Nearest Neighbors (KNN) Classification (Coming Soon) K最近鄰居(KNN)分類(即將推出)
- Support Vector Machine (SVM) Classification (Coming Soon) 支持向量機(SVM)分類(即將推出)
- Naive Bayes Classification (Coming Soon) 樸素貝葉斯分類(即將推出)
- Random Forest Classification (Coming Soon) 隨機森林分類(即將推出)
We will come across the more complex models of Regression, Classification and Clustering in the upcoming articles. Till then, Happy Machine Learning!
在接下來的文章中,我們將介紹更復雜的回歸,分類和聚類模型。 到那時,快樂機器學習!
翻譯自: https://towardsdatascience.com/machine-learning-basics-logistic-regression-890ef5e3a272
機器學習學習吳恩達邏輯回歸
總結
以上是生活随笔為你收集整理的机器学习学习吴恩达逻辑回归_机器学习基础:逻辑回归的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 通化红灯_我们如何构建廉价,可扩展的架构
- 下一篇: 软件测试 测试停止标准_停止正常测试