随机森林分类器_建立您的第一个随机森林分类器
隨機森林分類器
In this post, I will guide you through building a simple classifier using Random Forest from the scikit-learn library.
在本文中,我將指導您使用scikit-learn庫中的Random Forest構建簡單的分類器。
We will start by downloading data set from Kaggle, after that, we will do some basic data cleaning, and finally, we will fit the model and evaluate it. On the way, we will also create a baseline model that will be used for evaluation.
我們將從從Kaggle下載數據集開始,之后,我們將進行一些基本的數據清理,最后,我們將對模型進行擬合和評估。 在此過程中,我們還將創建一個用于評估的基線模型。
This article is suitable for beginner Data Scientists who would like to see the basic workflow for the Machine Leaning project and build their first classifier.
本文適合希望了解Machine Leaning項目的基本工作流程并建立其第一個分類器的初學者數據科學家。
Downloading and loading the data set
下載并加載數據集
We will be working with Heart Disease Data set that can be downloaded from Kaggle using this link.
我們將使用可從Kaggle使用此鏈接下載的心臟病數據集進行處理。
This data set consists of almost 300 hundred patients that either have or do not have heart issues. This is what we will be predicting.
該數據集包含將近300百萬患有或未患有心臟疾病的患者。 這就是我們將要預測的。
In order to do this, we will use thirteen different features:
為此,我們將使用十三種不同的功能:
Take time to familiarize yourself with these descriptions now so you have an understanding of what each column represents.
現在花一些時間來熟悉這些描述,以便您了解每一列所代表的含義。
Once you have downloaded the data set and placed it in the same folder as your Jupyter notebook file, you can use the following commands to load the data set.
一旦下載了數據集并將其與Jupyter筆記本文件放置在同一文件夾中,就可以使用以下命令加載數據集。
import pandas as pddf = pd.read_csv('data.csv')
df.head()
This is the head of the data frame that you will be working with.
這是您將要使用的數據框的標題。
Data Cleaning
數據清理
Did you spot question marks in the data frame above? It looks like the author of this data set have used them to indicate null values. Let’s replace them with real Nones.
您是否在上面的數據框中發現了問號? 該數據集的作者似乎已使用它們來指示空值。 讓我們用真正的None代替它們。
df.replace({'?': None}, inplace=True)Now that we have done that we can inspect how many null values are in our data set. We can do this with info() function.
現在,我們已經可以檢查數據集中有多少個空值。 我們可以使用info()函數來做到這一點。
df.info()We can see here that columns 10, 11, and 12 have a lot of nulls. ‘ Ca’ and ‘thal’ are actually almost empty and ‘slope’ has only 104 entries. This is too many missing values to fill in so let’s drop them.
我們在這里可以看到第10、11和12列有很多空值。 “ Ca”和“ thal”實際上幾乎是空的,“ slope”只有104個條目。 遺漏了太多的缺失值,因此我們將其刪除。
df.drop(columns=['slope', 'thal', 'ca'], inplace=True)The rest of the columns have none or little missing values. For simplicity, I suggest dropping the entries that do have them. We should not lose too much data.
其余列沒有缺失值或缺失值很小。 為簡單起見,我建議刪除包含它們的條目。 我們不應該丟失太多數據。
df.dropna(inplace=True)Another information that we could read from the result of the info() function is the fact that most of the columns are objects even though they seem to have numeric values.
我們可以從info()函數的結果中讀取的另一個信息是,即使大多數列似乎都具有數字值,它們也是對象。
My suspicion is that this was caused by the question marks in the initial data set. Now that we have removed them we should be able to change the objects to numeric values.
我懷疑這是由初始數據集中的問號引起的。 現在我們已經刪除了它們,我們應該能夠將對象更改為數值。
In order to do this, we will use pd.to_numeric() function on the whole data frame. The object values should become numbers and it should not affect the values that already numbers.
為此,我們將在整個數據幀上使用pd.to_numeric()函數。 對象值應成為數字,并且不應影響已經為數字的值。
df = df.apply(pd.to_numeric)df.info()
As you can see we are now left only with floats and integers. The info() function also confirm that the columns ‘ Ca’, ‘thal’, and ‘slope’ were dropped.
如您所見,我們現在只剩下浮點數和整數了。 info()函數還確認已刪除列' Ca','thal'和'slope' 。
Also, rows with null values got removed and as a result, we have a data set with 261 numeric variables.
同樣,具有空值的行也被刪除,因此,我們有一個包含261個數字變量的數據集。
There is one more thing we need to do before we can proceed. I have noticed that the last column ‘num’ has some trailing spaces in its name (you cannot see this with a bare eye) so let’s have a look at the list of column names.
在繼續之前,我們還需要做另一件事。 我注意到,最后一列“ num”的名稱中有一些尾隨空格(您不能用肉眼看到),因此讓我們看一下列名稱列表。
df.columnsYou should see the trailing spaces in the last column ‘num’. Let’s remove them by applying strip() function.
您應該在最后一列'num'中看到尾隨空格。 讓我們通過應用strip()函數將其刪除。
df.columns = [column.strip() for column in df.columns]Done!
做完了!
Exploratory Data Analysis
探索性數據分析
Let’s do some basic data analysis. We are going to look at the distribution of variables using histograms first.
讓我們做一些基本的數據分析。 我們將首先使用直方圖查看變量的分布。
import matplotlib.pyplot as pltplt.figure(figsize=(10,10))
df.hist()
plt.tight_layout()
What we can notice straight away is the fact that some variables are not continuous. Actually, only five features are continuous:’’age’, ‘chol’, ‘oldpeak’, ‘thalach’, ‘trestbps’ whereas the other are categorical variables.
我們可以立即注意到的事實是某些變量不是連續的。 實際上,只有五個特征是連續的:“ 年齡”,“膽汁”,“老峰”,“ thalach”,“ trestbps”,而另一個是分類變量。
Because we want to treat them differently in our exploration we should divide them into two groups.
因為我們希望在探索中區別對待它們,所以我們應將它們分為兩組。
continous_features = ['age', 'chol', 'oldpeak', 'thalach', 'trestbps'] non_continous_features = list(set(df.columns) - set(continous_features + ['num']))After doing this you can check their values by typing the variable names in Jupyter notebook cell.
完成此操作后,您可以通過在Jupyter筆記本單元格中鍵入變量名稱來檢查其值。
continous_featuresnon_continous_featuresNow we would like to inspect how the continuous features differ across the target variable. We will do this with a scatterplot.
現在,我們要檢查連續特征在目標變量之間的差異。 我們將使用散點圖進行此操作。
import seaborn as snsdf.num = df.num.map({0: 'no', 1: 'yes'})
sns.pairplot(df[continous_features + ['num']], hue='num')
* Note that we had to make the ‘num’ variable a string in order to use it as a hue parameter. We did it by mapping 0s to ‘no’ meaning healthy patients, and 1s to ‘yes’ meaning patients with heart disease.
*請注意,必須將'num'變量設置為字符串,才能將其用作色調參數。 我們通過將0映射到“ no”(表示健康患者),將1s映射到“ yes”(表示心臟病患者)來做到這一點。
If you look at the scatterplots and kdes you can see that there are district patterns for patients with heart disease in comparison to patients who are healthy.
如果您查看散點圖和kdes,您會發現與健康患者相比,心臟病患者存在區域模式。
In order to explore categorical variables, we will look at distinct values they can take by using describe() function.
為了探索分類變量,我們將研究使用describe()函數可以獲取的不同值。
df[non_continous_features].applymap(str).describe()We can see that ‘ exang’, ‘fbs’ and ‘sex’ are binary (they take only two distinct values). Whereas ‘cp’ and ‘resteceg’ take respectively four and three distinct values.
我們可以看到' exang','fbs'和'sex'是二進制的(它們僅采用兩個不同的值)。 而“ cp”和“ resteceg”分別取四個和三個不同的值。
The last two are ordered categorical variables as encoded by the data set authors. I am not sure if we should treat them like that or change them to dummy encodings. This would need further investigation and we could change the approach in the future. For now, we will leave them ordered.
最后兩個是由數據集作者編碼的有序分類變量。 我不確定是否應該這樣對待它們或將其更改為虛擬編碼。 這需要進一步的調查,我們將來可能會改變方法。 目前,我們將讓他們下訂單。
Last but not least we are going to explore the target variable.
最后但并非最不重要的一點是,我們將探索目標變量。
df.num.value_counts()We have 163 healthy patients and 98 patients with heart problems. Not ideally balanced data set but that should be ok for our purposes.
我們有163名健康患者和98名心臟病患者。 不是理想的平衡數據集,但對于我們的目的應該是可以的。
Creating a baseline model
創建基準模型
After a quick exploratory data analysis, we are ready to build an initial classifier. We are going to start by dividing the data set into features and the target variable.
在快速探索性數據分析之后,我們準備構建初始分類器。 我們將從將數據集分為要素和目標變量開始。
X = df.drop(columns='num')y = df.num.map({'no': 0, 'yes': 1})
* Note that I have to reverse the mapping I have applied while creating a seaborn graph, therefore, a need for map() function while creating y variable.
*請注意,在創建Seaborn圖時,我必須反轉我應用的映射,因此,在創建y變量時需要map()函數。
We also have used all features that the data set had as by looking at our quick EDA they all seemed relevant.
通過查看我們的快速EDA,我們還使用了數據集所具有的所有功能,它們似乎都很相關。
Now we will divide X and y variables further into their train and test correspondents using train_test_split() function.
現在,我們將使用train_test_split()函數將X和y變量進一步劃分為它們的訓練和測試對應項。
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) X_train.shape, X_test.shape, y_train.shape, y_test.shapeAs a result of the above operations, we should have now four different variables: X_train, X_test, y_train, y_test whose dimensions are printed above.
作為上述操作的結果,我們現在應該具有四個不同的變量:X_train,X_test,y_train,y_test,其尺寸顯示在上方。
Now we will build a baseline using a DummyClassifier.
現在,我們將使用DummyClassifier建立基線。
from sklearn.dummy import DummyClassifierfrom sklearn.metrics import accuracy_score
dc = DummyClassifier(strategy='most_frequent')
dc.fit(X,y) dc_preds = dc.predict(X)
accuracy_score(y, dc_preds)
As you can see the baseline classifier is giving us 62% accuracy on the train set. The strategy for our baseline is predicting the most frequent class.
如您所見,基線分類器為我們提供了62%的火車準確率。 我們基線的策略是預測最頻繁的課程。
Let’s see if we can beat it with Random Forest.
讓我們看看是否可以用隨機森林擊敗它。
Random Forest Classifier
隨機森林分類器
The code below sets a Random Forest Classifier and uses cross-validation to see how well it performs on different folds.
下面的代碼設置了一個隨機森林分類器,并使用交叉驗證來查看其在不同褶皺處的表現。
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selectionimport cross_val_score rfc = RandomForestClassifier(n_estimators=100, random_state=1)
cross_val_score(rfc, X, y, cv=5)
As you can see these accuracies are in general much higher than our dummy baseline. Only the last fold has lower accuracy. It looks like this last fold has examples that are hard to recognize.
如您所見,這些精度通常比我們的虛擬基準要高得多。 僅最后一折具有較低的準確性。 看起來這最后一折的例子很難辨認。
Nevertheless, if we take the average of those five, we get an accuracy of around 74%, and this is much higher than 62% baseline.
但是,如果我們取這五個平均值的平均值,則可以得到約74%的準確度,這比62%的基線要高得多。
Normally this is a stage where we would like to further tune model parameters using for example GridSearchCV but this is not a part of this tutorial.
通常,在這個階段,我們希望使用例如GridSearchCV進一步調整模型參數,但這不是本教程的一部分。
Let’s see how well the model performs on the test set now. If you have paid attention we have not done anything with the test so far. It has been left alone until now.
讓我們看看模型現在在測試集上的表現如何。 如果您已經注意,到目前為止,我們尚未對測試進行任何操作。 到現在為止,它一直被擱置。
Evaluating the model
評估模型
We will start by checking model performance in terms of accuracy.
我們將從檢查模型性能的準確性開始。
First, we will fit the model using the whole training data, and then we will call the accuracy_score() function on the test parts.
首先,我們將使用整個訓練數據擬合模型,然后在測試零件上調用precision_score()函數。
rfc.fit(X_train, y_train)accuracy_score(rfc.predict(X_test), y_test)
We are getting 75% accuracy on the test. Similar to our average cross-validation accuracy calculation on the train set which was 74%.
我們在測試中獲得75%的準確性。 與我們在火車上的平均交叉驗證準確性計算相似,為74%。
Let’s see how well the Dummy classifier does on the test set.
讓我們看看虛擬分類器在測試集上的表現如何。
accuracy_score(dc.predict(X_test), y_test)Accuracy for the baseline classifier is around 51%. This is actually much worse than the accuracy of our random forest model.
基線分類器的準確性約為51%。 這實際上比我們的隨機森林模型的準確性差得多。
However, we should not only look at accuracy when evaluating a classifier. Let’s have a looks at confusion matrices for both random forest and the baseline model.
但是,我們不僅應該在評估分類器時考慮準確性。 讓我們看一下隨機森林和基準模型的混淆矩陣。
We will start with computing confusion matrix for Random Forest using scikit-learn function.
我們將從使用scikit-learn函數為隨機森林計算混淆矩陣開始。
from sklearn.metrics import plot_confusion_matrixplot_confusion_matrix(rfc, X_test, y_test)
Actually we are not doing bad at all. We only have five False Positives, and also eight False Negatives. Additionally, we have predicted heart disease for eighteen people out of twenty-six people that had heart problems.
實際上,我們一點都沒有做壞。 我們只有五個假陽性,還有八個假陰性。 此外,我們已經預測出26位患有心臟疾病的人中有18位患有心臟病。
Not great but not that bad. Note that we did not even tune the model!
不是很好,但不是那么糟糕。 請注意,我們甚至都沒有調整模型!
Let’s compare this confusion matrix with the one calculated for the baseline model.
讓我們將這個混淆矩陣與為基線模型計算出的混淆矩陣進行比較。
from sklearn.metrics import plot_confusion_matrixplot_confusion_matrix(dc, X_test, y_test)
Have a closer look at the graph above. Can you see that we always predict label 0? This means we predict that all patients are healthy!
請仔細查看上圖。 您看到我們總是預測標簽0嗎? 這意味著我們可以預測所有患者都健康!
That is right, we have set our Dummy Classifier to predict the majority class. Note that it would be a terrible model for our purposes as we would not discover any patients with heart issues.
沒錯,我們已經將虛擬分類器設置為預測多數分類。 請注意,對于我們的目的,這將是一個糟糕的模型,因為我們不會發現任何有心臟問題的患者。
Random Forest did much better! We actually have discovered 18 people with heart problems out of 26 in the test set.
隨機森林好多了! 實際上,在測試集中的26名患者中,我們發現了18名患有心臟疾病的人。
Summary
摘要
In this post, you have learned how to build a basic classifier using Random Forest.
在本文中,您學習了如何使用隨機森林構建基本分類器。
It was rather an overview of the main techniques that are used when building a model on a data set without going into too many details.
它只是對在數據集上構建模型時使用的主要技術的概述,而無需涉及太多細節。
This was intended so this article does not get too long and serves as a starting point for someone who wants to build their first classifier.
這樣做的目的是使本文不會太長,并且可以作為想要構建其第一個分類器的人的起點。
Happy Learning!
學習愉快!
Originally published at https://www.aboutdatablog.com on August 13, 2020.
最初于 2020年8月13日 發布在 https://www.aboutdatablog.com 。
PS: I am writing articles that explain basic Data Science concepts in a simple and comprehensible on aboutdatablog.com. If you liked this article there are some other ones you may enjoy:
PS:我寫的文章在 aboutdatablog.com 上以簡單易懂的方式解釋了基本的數據科學概念 。 如果您喜歡這篇文章,您可能還會喜歡其他一些文章:
翻譯自: https://towardsdatascience.com/build-your-first-random-forest-classifier-cbc63a956158
隨機森林分類器
總結
以上是生活随笔為你收集整理的随机森林分类器_建立您的第一个随机森林分类器的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 想要的来了!消息称iOS 17将还iPh
- 下一篇: 激动!C919国产客机有望在今年三月开始