机器学习 模型性能评估_如何评估机器学习模型的性能
機器學(xué)習(xí) 模型性能評估
Table of contents:
目錄:
- Why evaluation is necessary? 為什么需要評估?
- Confusion Matrix 混淆矩陣
- Accuracy 準(zhǔn)確性
- Precision & Recall 精度和召回率
- ROC-AUC 中華民國
- Log Loss 日志損失
- Coefficient of Determination (R-Squared) 測定系數(shù)(R平方)
- Summary 摘要
為什么需要評估? (Why evaluation is necessary?)
Let me start with a very simple example.
讓我從一個非常簡單的例子開始。
Robin and Sam both started preparing for an entrance exam for engineering college. They both shared a room and put equal amount of hard work while solving numerical problems. They both studied almost the same hours for the entire year and appeared in the final exam. Surprisingly, Robin cleared but Sam did not. When asked, we got to know that their was one difference in their strategy of preparation, “test series”. Robin had joined a test series and he used to test his knowledge and understanding by giving those exams and then further evaluating where is he lagging. But Sam was confident and he just kept training himself.
羅賓和山姆都開始為工科大學(xué)準(zhǔn)備入學(xué)考試。 他們倆共享一個房間,并在解決數(shù)值問題時付出了相同的努力。 他們倆全年學(xué)習(xí)了幾乎相同的時間,并參加了期末考試。 出人意料的是,羅賓清除了,但薩姆沒有清除。 當(dāng)被問到時,我們知道他們是他們準(zhǔn)備策略“測試系列”的一個區(qū)別。 羅賓加入了一個測試系列,他過去通過參加那些考試來測試他的知識和理解力,然后進一步評估他落后的地方。 但是山姆很自信,他只是不斷地訓(xùn)練自己。
In the same fashion as discussed above, a machine learning model can be trained extensively with many parameters and new techniques but as long as you are skipping it’s evaluation, you cannot trust it.
以與上述相同的方式,可以使用許多參數(shù)和新技術(shù)對機器學(xué)習(xí)模型進行廣泛的訓(xùn)練,但是只要您不對其進行評估,就無法信任它。
如何閱讀混淆矩陣? (How to read Confusion Matrix?)
A confusion matrix is a correlation between the predictions of a model and the actual class labels of the data points.
混淆矩陣是模型的預(yù)測與數(shù)據(jù)點的實際類別標(biāo)簽之間的相關(guān)性。
Let’s say you are building a model which detects whether a person has diabetes or not. After train-test split you got a test set of length 100 out of which 70 data points are labeled positive (1) and 30 data points are labelled negative (0). Now let me draw the matrix for your test prediction:
假設(shè)您正在建立一個模型來檢測一個人是否患有糖尿病。 經(jīng)過火車測試拆分后,您得到了長度為100的測試集,其中70個數(shù)據(jù)點標(biāo)記為正(1),而30個數(shù)據(jù)點標(biāo)記為負(0)。 現(xiàn)在,讓我為您的測試預(yù)測繪制矩陣:
Out of 70 actual positive data points, your model predicted 64 points as positive and 6 as negative. Out of 30 actual negative points, it predicted 3 as positive and 27 as negative.
在70個實際的陽性數(shù)據(jù)點中,您的模型預(yù)測64個點為正,6個點為負。 在30個實際負點中,它預(yù)測3個正點和27個負點。
Note: In the notations True Positive, True Negative, False Positive & False Negative, notice that the second term (Positive or Negative) is denoting your prediction and 1st term denotes whether your predicted right or wrong.
注意:在符號True Positive,True Negative,False Positive和False Negative中 ,請注意第二項(Positive或Negative)表示您的預(yù)測,而第一項則表示您預(yù)測的是對還是錯。
Based on the above matrix we can define some very important ratios:
根據(jù)上面的矩陣,我們可以定義一些非常重要的比率:
TPR (True Positive Rate) =( True Positive / Actual Positive )
TPR(真正率)=(真正/實際正)
TNR (True Negative Rate) =( True Negative/ Actual Negative)
TNR(真負利率)=(真負/實際負)
FPR (False Positive Rate) =( False Positive / Actual Negative )
FPR(誤報率??)=(誤報/實際負)
FNR (False Negative Rate) =( False Negative / Actual Positive )
FNR(假陰性率)=(假陰性/實際陽性)
For our case of diabetes detection model we can calculate these ratios:
對于我們的糖尿病檢測模型,我們可以計算以下比率:
TPR = 91.4%
TPR = 91.4%
TNR = 90%
TNR = 90%
FPR = 10%
FPR = 10%
FNR = 8.6%
FNR = 8.6%
If you want your model to be smart then your model has to predict correctly. Which means your True Positives and True Negatives should be as high as possible and at the same time you need to minimize your mistakes for which your False Positives and False Negatives should be as low as possible. Also in terms of ratios, your TPR & TNR should be very high whereas FPR & FNR should be very low,
如果您希望模型很聰明,那么您的模型必須正確預(yù)測。 這意味著您的“正肯定”和“負否定”應(yīng)盡可能地高,同時您需要將錯誤正誤和“ 負否定”應(yīng)盡可能低的錯誤最小化。 同樣在比率方面,您的TPR和TNR應(yīng)該很高,而FPR和FNR應(yīng)該非常低,
A smart model: TPR ↑ , TNR ↑, FPR ↓, FNR ↓
智能模型:TPR↑,TNR↑,FPR↓,FNR↓
A dumb model: Any other combination of TPR, TNR, FPR, FNR
愚蠢的模型:TPR,TNR,FPR,FNR的任何其他組合
One may argue that it is not possible to take care of all four ratios equally because at the end of the day no model is perfect. Then what should we do?
可能有人爭辯說,不可能平等地照顧所有四個比率,因為最終沒有一種模型是完美的。 那我們該怎么辦?
Yes, it is true. So that is why we build a model keeping the domain in our mind. There are certain domain which demands us to keep a specific ratio as the main priority even at the cost of other ratios being poor. For e.g.In Cancer diagnosis, we cannot miss any positive patient at any cost. So we are suppose to keep TPR at the maximum and FNR close to 0. Even if we predict any healthy patient as diagnosed it is still okay as he can go for further check ups.
是的,它是真的。 因此,這就是為什么我們要建立一個將領(lǐng)域牢記在心的模型。 在某些領(lǐng)域中,即使以其他比率不佳為代價,也要求我們將特定比率作為主要優(yōu)先事項。 對于例如癌癥的診斷,我們不能不惜一切代價錯過任何陽性患者。 因此,我們假設(shè)將TPR保持在最大值,并將FNR保持在0附近。即使我們預(yù)測診斷出的任何健康患者也可以,因為他可以進行進一步檢查。
準(zhǔn)確性 (Accuracy)
Accuracy is what it’s literal meaning says, a measure of how accurate your model is.
準(zhǔn)確度是其字面意思,表示模型的準(zhǔn)確度。
Accuracy = Correct Predictions / Total Predictions
準(zhǔn)確性=正確的預(yù)測/總預(yù)測
By using confusion matrix, Accuracy = (TP + TN)/(TP+TN+FP+FN)
通過使用混淆矩陣,精度=(TP + TN)/(TP + TN + FP + FN)
Accuracy is one of the simplest performance metric we can use. But let me warn you, accuracy can sometimes lead you to false illusions about your model and hence you should first know your data set and algorithm used then only decide whether to use accuracy or not.
準(zhǔn)確性是我們可以使用的最簡單的性能指標(biāo)之一。 但是讓我警告您,準(zhǔn)確性有時會導(dǎo)致您對模型產(chǎn)生錯誤的幻想,因此您應(yīng)該首先了解所使用的數(shù)據(jù)集和算法,然后才決定是否使用準(zhǔn)確性。
Before going to the failure cases of accuracy, let me introduce you with two types of data sets:
在討論準(zhǔn)確性的失敗案例之前,讓我為您介紹兩種類型的數(shù)據(jù)集:
Balanced: A data set which contains almost equal entries for all labels/classes. E.g out of 1000 data points 600 are positive and 400 are negative.
平衡的:一個數(shù)據(jù)集,其中包含所有標(biāo)簽/類別幾乎相等的條目。 例如,在1000個數(shù)據(jù)點中,600個為正,400個為負。
Imbalanced: A data set which contains biased distribution of entries towards a particular label/class. E.g. out of 1000 entries 990 are positive class, 10 are negative class.
不平衡:一種數(shù)據(jù)集,其中包含偏向特定標(biāo)簽/類別的條目的分布。 例如,在1000個條目中,990個為肯定類別,10個為否定類別。
Very Important: Never use accuracy as a measure when dealing with imbalanced test set.
非常重要:處理不平衡的測試集時,切勿使用準(zhǔn)確性作為度量。
Why?
為什么?
Suppose you have an imbalanced test set of 1000 entries with 990 (+ve) and 10 (-ve). And somehow you ended up creating a poor model which always predict “+ve” due to the imbalanced train set. Now when you predict your test set labels, it will always predict “+ve”. So out of 1000 test set points, you get 1000 “+ve” predictions. Then your accuracy would come,
假設(shè)您有一個不平衡的測試集,其中包含990(+ ve)和10(-ve)的1000個條目。 最終,您以某種方式最終創(chuàng)建了一個糟糕的模型,該模型總是會因列車不平衡而預(yù)測“ + ve”。 現(xiàn)在,當(dāng)您預(yù)測測試集標(biāo)簽時,它將始終預(yù)測“ + ve”。 因此,從1000個測試設(shè)定點中,您可以獲得1000個“ + ve”預(yù)測。 然后你的準(zhǔn)確性就會來
990/1000 = 99%
990/1000 = 99%
Whoaa! Amazing! you are happy to see such an awesome accuracy score.
哇! 驚人! 您很高興看到如此出色的準(zhǔn)確性得分。
But, you should know that your model is really poor which always predicts “+ve” label.
但是,您應(yīng)該知道您的模型確實很差,總是預(yù)測為“ + ve”標(biāo)簽。
Very Important: Also, we cannot compare two models which returns probability scores and have same accuracy.
非常重要:同樣,我們無法比較兩個返回概率得分并具有相同準(zhǔn)確性的模型。
There are certain models which give probability of each data points for belonging to a particular class like that in Logistic Regression. Let us take this case:
有某些模型可以像Logistic回歸那樣給出每個數(shù)據(jù)點屬于特定類的概率。 讓我們來考慮這種情況:
Table 1表格1As you can see, If P(Y=1) > 0.5 it predicts class 1. When we calculate accuracy for both M1 and M2, it comes the same but it is quite evident that M1 is a much better model than M2 by taking a look the probability scores.
如您所見, 如果P(Y = 1)> 0.5則預(yù)測為類1。當(dāng)我們計算M1和M2的精度時,得出的結(jié)果相同,但是很明顯, 通過取M1 , M1比M2更好。看概率分數(shù)。
This issue is beautifully dealt by Log Loss which I will explain later in the blog.
Log Loss很好地解決了這個問題,我將在稍后的博客中進行解釋。
精度和召回率 (Precision & Recall)
Precision : It is the ratio of True Positives (TP) and the total positive predictions. Basically it tells us that how many times your positive prediction was actually positive.
精度:這是真實陽性率(TP)與陽性預(yù)測總數(shù)的比率。 基本上,它告訴我們您的正面預(yù)測實際上是正面多少次。
Recall : It is nothing but TPR (True Positive Rate explained above).It tells us about out of all positive points how many were predicted positive.
回想一下: TPR (True Positive Rate上面已經(jīng)解釋過了)什么都沒有,它告訴我們所有積極因素中有多少被預(yù)測為積極。
F- Measure: Harmonic mean of precision and recall.
F-測量:精確度和查全率的諧波平均值。
To understand this let’s see this example: When you ask a query in google, it returns 40 pages but only 30 were relevant. But your friend who is an employee at Google told you that there were 100 total relevant pages for that query. So it’s precision is 30/40 = 3/4 = 75% while it’s recall is 30/100 = 30%. So, in this case, precision is “how useful the search results are”, and recall is “how complete the results are”.
為了理解這一點,讓我們看一個例子:當(dāng)您在google中查詢時,它返回40個頁面,但只有30個相關(guān)。 但是您的Google雇員朋友告訴您,該查詢總共有100個相關(guān)頁面。 所以它的精度是30/40 = 3/4 = 75%,而召回率是30/100 = 30%。 因此,在這種情況下,精度是“搜索結(jié)果有多有用”,召回??率是“結(jié)果有多完整”。
ROC和AUC (ROC & AUC)
Receiver Operating Characteristic Curve (ROC):
接收器工作特性曲線(ROC):
It is a plot between TPR (True Positive Rate) and FPR (False Positive Rate) calculated by taking multiple threshold values from the reverse sorted list of probability scores given by a model.
它是通過從模型給出的概率得分的反向排序列表中獲取多個閾值而計算出的TPR(真正率)和FPR(假正率)之間的關(guān)系圖 。
A typical ROC curve典型的ROC曲線Now, how do we plot ROC?
現(xiàn)在,我們?nèi)绾卫L制ROC?
To answer this, let me take you back to table 1 above. Just consider M1 model. You see, for all x values we have a probability score. In that table we have assigned the data point which have a score more than 0.5 as class 1. Now sort all the values in descending order of probability scores and one by one take threshold values equal to all the probability scores. Then we will have threshold values = [0.96,0.94,0.92,0.14,0.11,0.08]. Corresponding to each threshold value predict the classes and calculate TPR and FPR. You will get 6 pairs of TPR & FPR. Just plot them, you will get ROC curve.
為了回答這個問題,讓我?guī)氐缴厦娴谋?。 僅考慮M1模型。 您會看到,對于所有x值,我們都有一個概率得分。 在該表中,我們將得分大于0.5的數(shù)據(jù)點分配為類別1。現(xiàn)在,以概率分數(shù)的降序?qū)λ兄颠M行排序,并以等于所有概率分數(shù)的閾值一一取值。 然后,我們將獲得閾值= [0.96,0.94,0.92,0.14,0.11,0.08]。 對應(yīng)每個閾值預(yù)測類別并計算TPR和FPR。 您將獲得6對TPR和FPR。 只需繪制它們,您將獲得ROC曲線。
Note: Since maximum TPR and FPR value is 1, the area under curve (AUC) of ROC lies betweem 0 and 1.
注意:由于最大TPR和FPR值為1,因此ROC的曲線下面積(AUC)位于0和1之間。
Area under the blue dashed line is 0.5. AUC = 0 means very poor model, AUC = 1 means perfect model. As long as your model’s AUC score is more than 0.5. your model is making sense because even a random model can score 0.5 AUC.
藍色虛線下方的面積是0.5。 AUC = 0表示模型很差,AUC = 1表示模型完美。 只要您模型的AUC分數(shù)大于0.5。 您的模型很有意義,因為即使是隨機模型也可以得分0.5 AUC。
Very Important: You can get very high AUC even in a case of dumb model generated from imbalanced data set. So always be careful while dealing with imbalanced data set.
非常重要:即使在從不平衡數(shù)據(jù)集生成的啞模型的情況下,您也可以獲得很高的AUC。 因此,在處理不平衡的數(shù)據(jù)集時請務(wù)必小心。
Note: AUC had nothing to do with the numerical values probability scores as long as order is maintained. AUC for all the models will be same as long as all the models give same order of data points after sorting based on probability scores.
注意:只要維持順序,AUC與數(shù)值概率分數(shù)無關(guān)。 只要所有模型在根據(jù)概率得分進行排序后給出相同順序的數(shù)據(jù)點,所有模型的AUC都將相同。
對數(shù)損失 (Log-Loss)
This performance metric checks the deviation of probability scores of the data points from the cut off score and assigns penalty proportional to the deviation.
該性能度量檢查數(shù)據(jù)點的概率分數(shù)與截止分數(shù)的偏差,并分配與偏差成比例的懲罰。
For each data point in a binary classification, we calculate it’s log loss using the formula below,
對于二進制分類中的每個數(shù)據(jù)點,我們使用以下公式計算對數(shù)損失:
Log Loss formula for a Binary Classification二元分類的對數(shù)損失公式here p = probability of the data point to belong to class 1, and y is the class label (0 or 1).
在這里,p =數(shù)據(jù)點屬于類別1的概率,而y是類別標(biāo)簽(0或1)。
Suppose if p_1 for some x_1 is 0.95 and p_2 for some x_2 is 0.55 and cut off probability for qualifying for class 1 is 0.5. Then both qualifies for class 1 but log loss of p_2 will be much more than log loss of p_1.
假設(shè)某些x_1的p_1為0.95,某些x_2的p_2為0.55,并且符合1類條件的截止概率為0.5。 然后兩者都符合類別1的條件,但是p_2的對數(shù)丟失將比p_1的對數(shù)丟失大得多。
As you can see from the curve, the range of log loss is [0, infinity).
從曲線中可以看到,對數(shù)損失的范圍是[0,無窮大]。
For each data point in a multi class classification, we calculate it’s log loss using the formula below,
對于多類別分類中的每個數(shù)據(jù)點,我們使用以下公式計算對數(shù)損失:
Log Loss formula for multi class classification多類別分類的對數(shù)損失公式y(o,c)=1 if x(o,c) belongs to class 1. Rest of the concept is same.
如果x(o,c)屬于類別1,則y(o,c)= 1。其余概念相同。
測定系數(shù) (Coefficient of Determination)
It is denoted by R2.While predicting target values of test set we encounter a few errors (e_i) which is difference between predicted value and actual value.
用R 2表示。 在預(yù)測測試集的目標(biāo)值時,我們會遇到一些誤差(e_i),這是預(yù)測值與實際值之間的差異。
Let’s say we have a test set with n entries. As we know all the data points will have a target value say [y1,y2,y3…….yn]. Let us take the predicted values of the test data be [f1,f2,f3,……fn].
假設(shè)我們有一個包含n個條目的測試集。 眾所周知,所有數(shù)據(jù)點都有一個目標(biāo)值[y1,y2,y3…….yn]。 讓我們將測試數(shù)據(jù)的預(yù)測值設(shè)為[f1,f2,f3,……fn]。
Calculate the Residual Sum of Squares ,which is the sum of all the errors (e_i) squared , by using this formula where fi is the predicted target value by a model for i’th data point.
計算平方的殘差平方和,這是所有(e_i)的平方,通過使用這個公式其中Fi是第i個用于數(shù)據(jù)點的模型所預(yù)測的目標(biāo)值的誤差的總和。
Total Sum of Squares總平方和Take the mean of all the actual target values:
取所有實際目標(biāo)值的平均值:
Then calculate Total Sum of Squares which is proportional to the variance of the test set target values:
然后計算與測試集目標(biāo)值的方差成比例的總平方和 :
If you observe both the formulas of sum of squares you can see that the only difference is the 2nd term i.e. y_bar and fi. Total sum of squares somewhat gives us an intuition that it is same as residual sum of squares only but with predicted values as [?, ?, ?,…….? ,n times]. Yes, your intuition is right. Let’s say there is a very simple mean model which gives prediction the average of target values every time irrespective of input data.
如果同時觀察兩個平方和公式,則可以看到唯一的區(qū)別是第二項,即y_bar和fi。 平方總和在某種程度上給我們一種直覺,即它僅與殘差平方和相同,但預(yù)測值為[?,?,?,…….?,n次]。 是的,您的直覺是正確的。 假設(shè)有一個非常簡單的均值模型,無論輸入數(shù)據(jù)如何,該模型均會每次預(yù)測目標(biāo)值的平均值。
Now we formulate R2 as:
現(xiàn)在我們將R2表示為:
As you can see now, R2 is a metric to compare your model with a very simple mean model which return average of target values every time irrespective of input data. The comparison has 4 cases:
正如您現(xiàn)在所看到的,R2是一種度量,用于將模型與非常簡單的均值模型進行比較,均值模型每次均返回目標(biāo)值的平均值,而與輸入數(shù)據(jù)無關(guān)。 比較有4種情況:
case 1: SS_R = 0
情況1:SS_R = 0
(R2 = 1) Perfect model with no errors at all.
(R2= 1)完美的模型,完全沒有錯誤。
case 2: SS_R > SS_T
情況2:SS_R> SS_T
(R2 < 0) Model is even worse than the simple mean model.
(R2<0)模型甚至比簡單的均值模型差。
case 3: SS_R = SS_T
情況3:SS_R = SS_T
(R2 = 0) Model is same as the simple mean model.
(R2= 0)模型與簡單均值模型相同。
case 4:SS_R < SS_T
情況4:SS_R <SS_T
(0<R2 <1) Model is okay.
(0 <R2<1)模型還可以。
摘要 (Summary)
So, in nutshell you should know your data set and problem very well and then you can always create a confusion matrix and check for it’s accuracy, precision, recall and plot the ROC curve and find out AUC as per your needs. But if your data set is imbalanced never use accuracy as a measure. If you want to evaluate your model even more deeply so that your probability scores are also given weight then go for Log Loss.
因此,簡而言之,您應(yīng)該非常了解您的數(shù)據(jù)集和問題,然后您始終可以創(chuàng)建一個混淆矩陣,并檢查其準(zhǔn)確性,精度,調(diào)用并繪制ROC曲線并根據(jù)需要找到AUC。 但是,如果您的數(shù)據(jù)集不平衡,切勿使用準(zhǔn)確性作為衡量標(biāo)準(zhǔn)。 如果您想對模型進行更深入的評估,以使您的概率得分也得到權(quán)重,請考慮對數(shù)損失。
Remember, always evaluate your training!
請記住,請務(wù)必評估您的訓(xùn)練!
翻譯自: https://medium.com/swlh/how-to-evaluate-the-performance-of-your-machine-learning-model-40769784d654
機器學(xué)習(xí) 模型性能評估
總結(jié)
以上是生活随笔為你收集整理的机器学习 模型性能评估_如何评估机器学习模型的性能的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 非独有功能!特斯拉卖力宣传单踏板模式网友
- 下一篇: 深度学习将灰度图着色_通过深度学习为视频