當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

基于决策树的多分类_R中基于决策树的糖尿病分类—一个零博客

發(fā)布時(shí)間：2023/12/15 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了基于决策树的多分类_R中基于决策树的糖尿病分类—一个零博客小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

基于決策樹的多分類

Article Outline

文章大綱

What is a decision tree?
什么是決策樹？
Why use them?
為什么要使用它們？
Data Background
資料背景
Descriptive Statistics
描述性統(tǒng)計(jì)
Decision Tree Training and Evaluation
決策樹培訓(xùn)和評(píng)估
Decision Tree Pruning
決策樹修剪
Hyperparameters Tuning
超參數(shù)調(diào)整

什么是決策樹？ (What is a decision tree?)

A decision tree is a representation of a flowchart. The classification and regression tree (a.k.a decision tree) algorithm was developed by Breiman et al. 1984 (usually reported) but that certainly was not the earliest. Wei-Yin Loh of the University of Wisconsin has written about the history of decision trees. You can read it here “Fifty Years of Classification and Regression Trees”.

決策樹是流程圖的表示。分類和回歸樹(又名決策樹)算法是由Breiman等人開發(fā)的。 1984年 ( 通常報(bào)道 )，但這當(dāng)然不是最早的。威斯康星大學(xué)的盧偉賢(Loe-Yin Yin)撰寫了有關(guān)決策樹的歷史。您可以在這里閱讀“ 分類樹和回歸樹五十年 ”。

In a decision tree, the top node is called the “root node” and the bottom node “terminal node”. The other nodes are called “internal nodes” which includes a binary split condition, while each leaf node contains associated class labels.

在決策樹中，頂部節(jié)點(diǎn)稱為“根節(jié)點(diǎn)”，而底部節(jié)點(diǎn)稱為“終端節(jié)點(diǎn)”。其他節(jié)點(diǎn)稱為“內(nèi)部節(jié)點(diǎn)”，其中包含二進(jìn)制拆分條件，而每個(gè)葉節(jié)點(diǎn)均包含關(guān)聯(lián)的類標(biāo)簽。

Photo by Saed Sayad on saedsayad.com Saed Sayad在saedsayad.com上的照片

A classification tree uses a split condition to predict a class label based on the provided input variables. The splitting process starts from the top node (root node), and at each node, it checks whether supplied input values recursively continue to the left or right according to a supplied splitting condition (Gini or Information gain). This process terminates when a leaf or terminal node is reached.

分類樹使用拆分條件基于提供的輸入變量來預(yù)測(cè)類標(biāo)簽。拆分過程從最高節(jié)點(diǎn)(根節(jié)點(diǎn))開始，并在每個(gè)節(jié)點(diǎn)處根據(jù)提供的拆分條件(Gini或信息增益)檢查提供的輸入值是遞歸地在左側(cè)還是右側(cè)。當(dāng)?shù)竭_(dá)葉節(jié)點(diǎn)或終端節(jié)點(diǎn)時(shí)，此過程終止。

為什么要使用它們？ (Why use them?)

A single decision tree-based model is easy to build, plot and interpret which makes this algorithm so popular. You can use this algorithm for performing classification as well as a regression task.

基于單個(gè)決策樹的模型易于構(gòu)建，繪制和解釋，這使得該算法如此受歡迎。您可以使用此算法執(zhí)行分類以及回歸任務(wù)。

資料背景 (Data Background)

In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).

在本示例中，我們將使用從機(jī)器學(xué)習(xí)數(shù)據(jù)庫(kù)的UCI存儲(chǔ)庫(kù)中獲得的Pima Indian Diabetes 2數(shù)據(jù)集( Newman等，1998 )。

This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

該數(shù)據(jù)集最初來自美國(guó)糖尿病與消化及腎臟疾病研究所。數(shù)據(jù)集的目的是根據(jù)數(shù)據(jù)集中包含的某些診斷測(cè)量值來診斷性預(yù)測(cè)患者是否患有糖尿病。從較大的數(shù)據(jù)庫(kù)中選擇這些實(shí)例受到一些限制。特別是，這里的所有患者均為皮馬印第安人血統(tǒng)至少21歲的女性。

The Pima Indian Diabetes 2 data set is the refined version (all missing values were assigned as NA) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.

Pima印度糖尿病2數(shù)據(jù)集是Pima印度糖尿病數(shù)據(jù)的精煉版本(所有缺失值均指定為NA)。數(shù)據(jù)集包含以下獨(dú)立變量和因變量。

Independent variables (symbol: I)

自變量(符號(hào)：I)

I1: pregnant: Number of times pregnant
I1：懷孕：懷孕次數(shù)
I2: glucose: Plasma glucose concentration (glucose tolerance test)
I2： 葡萄糖 ：血漿葡萄糖濃度(葡萄糖耐量試驗(yàn))
I3: pressure: Diastolic blood pressure (mm Hg)
I3：壓力：舒張壓(毫米汞柱)
I4: triceps: Triceps skinfold thickness (mm)
I4： 三頭肌 ：三頭肌皮褶厚度(毫米)
I5: insulin: 2-Hour serum insulin (mu U/ml)
I5： 胰島素 ：2小時(shí)血清胰島素(mu U / ml)
I6: mass: Body mass index (weight in kg/(height in m)\2)
I6： 質(zhì)量 ：體重指數(shù)(重量，單位：千克/(身高，單位：m)\2)
I7: pedigree: Diabetes pedigree function
I7：譜系：糖尿病譜系功能
I8: age: Age (years)
I8：年齡：年齡(年)

Dependent Variable (symbol: D)

因變量(符號(hào)：D)

D1: diabetes: diabetes case (pos/neg)
D1： 糖尿病 ：糖尿病病例(正/負(fù))

建模目的 (Aim of the Modelling)

fitting a decision tree classification machine learning model that accurately predicts whether or not the patients in the data set have diabetes
擬合決策樹分類機(jī)器學(xué)習(xí)模型，該模型可準(zhǔn)確預(yù)測(cè)數(shù)據(jù)集中的患者是否患有糖尿病
Decision tree pruning for reducing overfitting
決策樹修剪以減少過度擬合
Decision tree hyperparameters tuning
決策樹超參數(shù)調(diào)整

加載相關(guān)庫(kù) (Loading relevant libraries)

The first step of data analysis starts with loading relevant libraries.

數(shù)據(jù)分析的第一步從加載相關(guān)庫(kù)開始。

library(mlbench) # Diabetes dataset
library(rpart) # Decision tree
library(rpart.plot) # Plotting decision tree
library(caret) # Accuracy estimation
library(Metrics) # For diferent model evaluation metrics

加載數(shù)據(jù)集 (Loading dataset)

The very next step is to load the data into the R environment. As this comes with mlbench package one can load the data calling data( ).

下一步是將數(shù)據(jù)加載到R環(huán)境中。由于mlbench軟件包隨附此軟件包，因此可以加載調(diào)用data()的數(shù)據(jù)。

# load the diabetes dataset
data(PimaIndiansDiabetes2)

數(shù)據(jù)預(yù)處理 (Data Preprocessing)

The next step would be to perform exploratory analysis. First, we need to remove the missing values using the na.omit( ) function. Print the data types using glimpse( ) method from dplyr library. You can see that all the variables except the dependent variable (diabetes: categorical/factor) are double type.

下一步將進(jìn)行探索性分析。首先，我們需要使用na.omit()函數(shù)刪除丟失的值。使用dplyr庫(kù)中的glimpse()方法打印數(shù)據(jù)類型。您會(huì)看到除因變量(糖尿病：分類/因子)以外的所有變量都是雙精度類型。

Diabetes <- na.omit(PimaIndiansDiabetes2) # Data for modelingdplyr::glimpse(Diabetes)Data Types資料類型

訓(xùn)練和測(cè)試拆分 (Train and Test Split)

The next step is to split the dataset into 80% train and 20% test. Here, we are using the sample( ) method to randomly pick the observation index for train and test split with replacement. Next, based on indexing we split out the train and test data.

下一步是將數(shù)據(jù)集分為80％訓(xùn)練和20％測(cè)試。在這里，我們使用sample()方法隨機(jī)選擇火車的觀察指標(biāo)，并用替換進(jìn)行測(cè)試拆分。接下來，基于索引，我們拆分了訓(xùn)練和測(cè)試數(shù)據(jù)。

set.seed(123)index <- sample(2, nrow(Diabetes), prob = c(0.8, 0.2), replace = TRUE)Diabetes_train <- Diabetes[index==1, ] # Train data
Diabetes_test <- Diabetes[index == 2, ] # Test data

The train data includes 318 observations and test data included 74 observations. Both contain 9 variables.

火車數(shù)據(jù)包括318個(gè)觀測(cè)值，測(cè)試數(shù)據(jù)包括74個(gè)觀測(cè)值。兩者都包含9個(gè)變量。

print(dim(Diabetes_train))
print(dim(Diabetes_test))Train and Test Dimension訓(xùn)練和測(cè)試尺寸

模型訓(xùn)練 (Model Training)

The next step is the model training and evaluation of model performance

下一步是模型訓(xùn)練和模型性能評(píng)估

訓(xùn)練決策樹 (Training a Decision Tree)

For decision tree training, we will use the rpart( ) function from the rpart library. The arguments include; formula for the model, data and method.

為了進(jìn)行決策樹訓(xùn)練，我們將使用rpart庫(kù)中的rpart()函數(shù)。參數(shù)包括：模型，數(shù)據(jù)和方法的公式。

formula = diabetes ~. i.e., diabetes is predicted by all independent variables (excluding diabetes)

公式=糖尿病?。即，糖尿病是由所有獨(dú)立變量預(yù)測(cè)的(糖尿病除外)

Here, the method should be specified as the class for the classification task.

在此，應(yīng)將方法指定為分類任務(wù)的類。

# Train a decision tree model
Diabetes_model <- rpart(formula = diabetes ~.,
data = Diabetes_train,
method = "class")

模型圖 (Model Plotting)

The main advantage of the tree-based model is that you can plot the tree structure and able to figure out the decision mechanism.

基于樹的模型的主要優(yōu)點(diǎn)是您可以繪制樹結(jié)構(gòu)并能夠確定決策機(jī)制。

# type: 0; Draw a split label at each split and a node label at each leaf.
# yesno = 2; provides spli yes or no
# Extra = 0; no extra informationrpart.plot(x = Diabetes_model, yesno = 2, type = 0, extra = 0)Diabetes_model Tree StructureDiabetes_model樹結(jié)構(gòu)

模型性能評(píng)估 (Model Performance Evaluation)

Next, step is to see how our trained model performs on the test/unseen dataset. For predicting the test data class we need to supply the model object, test dataset and the type = “class” inside the predict( ) function.

接下來，步驟是查看我們訓(xùn)練有素的模型如何在測(cè)試/看不見的數(shù)據(jù)集上執(zhí)行。為了預(yù)測(cè)測(cè)試數(shù)據(jù)類，我們需要在predict()函數(shù)中提供模型對(duì)象 ， 測(cè)試數(shù)據(jù)集和type =“ class” 。

# class prediction
class_predicted <- predict(object = Diabetes_model,
newdata = Diabetes_test,
type = "class")

(a) Confusion matrix

(a)混淆矩陣

To evaluate the test performance we are going to use the confusionMatrix( ) from caret library. We can observe that out of 74 observations it wrongly predicts 17 observations. The model has achieved about 77.03% accuracy using a single decision tree.

為了評(píng)估測(cè)試性能，我們將使用插入符號(hào)庫(kù)中的confusionMatrix() 。我們可以觀察到，在74個(gè)觀察結(jié)果中，它錯(cuò)誤地預(yù)測(cè)了17個(gè)觀察結(jié)果。使用單個(gè)決策樹，該模型已達(dá)到約77.03％的準(zhǔn)確性。

# Generate a confusion matrix for the test dataconfusionMatrix(data = class_predicted,
reference = Diabetes_test$diabetes)Diabetes_model Test Evaluation StatisticsDiabetes_model測(cè)試評(píng)估統(tǒng)計(jì)

(b) Test accuracy

(b)測(cè)試準(zhǔn)確性

We can also supply the predicted class labels and original test dataset labels to the accuracy( ) function for estimating the model accuracy.

我們還可以將預(yù)測(cè)的類別標(biāo)簽和原始測(cè)試數(shù)據(jù)集標(biāo)簽提供給precision()函數(shù)，以估計(jì)模型的準(zhǔn)確性。

accuracy(actual = class_predicted,
predicted = Diabetes_test$diabetes)Diabetes_model Test AccuracyDiabetes_model測(cè)試準(zhǔn)確性

基于分裂準(zhǔn)則的模型比較 (Splitting Criteria Based Model Comparision)

While building the model the decision tree algorithm uses splitting criteria. There are two popular splitting criteria used in decision trees; one is called “gini” and others called “information gain”. Here, we try to compare the model performance on the test set after training with different split criteria. The splitting criteria are supplied using parms argument as a list.

在構(gòu)建模型時(shí)，決策樹算法使用拆分標(biāo)準(zhǔn)。決策樹中使用了兩種流行的拆分標(biāo)準(zhǔn)：一種稱為“基尼”，另一種稱為“信息增益”。在這里，我們嘗試在使用不同的拆分標(biāo)準(zhǔn)進(jìn)行訓(xùn)練后，對(duì)測(cè)試集上的模型性能進(jìn)行比較。使用parms參數(shù)作為列表來提供拆分條件。

# Model training based on gini-based splitting criteria
Diabetes_model1 <- rpart(formula = diabetes ~ .,
data = Diabetes_train,
method = "class",
parms = list(split = "gini"))# Model training based on information gain-based splitting criteria
Diabetes_model2 <- rpart(formula = diabetes ~ .,
data = Diabetes_train,
method = "class",
parms = list(split = "information"))

測(cè)試數(shù)據(jù)的模型評(píng)估 (Model Evaluation on Test Data)

After model training, the next step is to predict the class labels of the test dataset.

經(jīng)過模型訓(xùn)練后，下一步是預(yù)測(cè)測(cè)試數(shù)據(jù)集的類標(biāo)簽。

# Generate class predictions on the test data using gini-based splitting criteria
pred1 <- predict(object = Diabetes_model1,
newdata = Diabetes_test,
type = "class")# Generate class predictions on test data using information gain based splitting criteria
pred2 <- predict(object = Diabetes_model2,
newdata = Diabetes_test,
type = "class")

預(yù)測(cè)精度比較 (Prediction Accuracy Comparision)

Next, we compare the accuracy of the models. Here, we can observe that “gini” based splitting criteria is providing a more accurate model than “information” based splitting.

接下來，我們比較模型的準(zhǔn)確性。在這里，我們可以觀察到，基于“ 基尼 ”的分割標(biāo)準(zhǔn)比基于“ 信息 ”的分割提供了更準(zhǔn)確的模型。

# Compare classification accuracy on test data
accuracy(actual = Diabetes_test$diabetes,
predicted = pred1)accuracy(actual = Diabetes_test$diabetes,
predicted = pred2)Diabetes_model1 Test AccuracyDiabetes_model1測(cè)試準(zhǔn)確性 Diabetes_model2 Test AccuracyDiabetes_model2測(cè)試準(zhǔn)確性

The initial model (Diabetes_model) and the “gini” based model (Diabetes_model1) providing the same accuracy, as rpart model uses “gini” as its default splitting criteria.

初始模型( Diabetes_model )和基于“ gini ”的模型( Diabetes_model1 )提供相同的準(zhǔn)確性，因?yàn)?strong>rpart模型使用“ gini ”作為其默認(rèn)拆分標(biāo)準(zhǔn)。

決策樹修剪 (Decision Tree Pruning)

The initial model (Diabetes_model) plot shows that the tree structure is deep and fragile which might reduce the easy interpretation in the decision-making process. Thus here we would try to explore other ways to make the tree more interpretable without losing performance. One way of doing this is by pruning the fragile part of the tree (part contributes to model overfitting).

初始模型( Diabetes_model )曲線表明，樹結(jié)構(gòu)深而脆弱，這可能會(huì)降低決策過程中的易解釋性。因此，在這里我們將嘗試探索其他方法來使樹更易于解釋，而不會(huì)損失性能。一種方法是修剪樹的脆弱部分(部分有助于模型擬合)。

(a) Plotting the error vs complexity parameter

(a)繪制誤差與復(fù)雜度參數(shù)

The decision tree has one parameter called complexity parameter (cp) which controls the size of the decision tree. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue. We can generate the cp vs error plot using the plotcp( ) library.

決策樹具有一個(gè)稱為復(fù)雜性參數(shù)(cp)的參數(shù) ，該參數(shù)控制決策樹的大小。如果從當(dāng)前節(jié)點(diǎn)向決策樹添加另一個(gè)變量的開銷高于cp的值，則樹的構(gòu)建不會(huì)繼續(xù)。我們可以使用plotcp()庫(kù)生成cp vs錯(cuò)誤圖。

# Plotting Cost Parameter (CP) Table
plotcp(Diabetes_model1)Error vs CP Plot誤差與CP圖

(b) Generating complexity parameter table

(b)生成復(fù)雜度參數(shù)表

We can also generate the cp table by calling model$cptable. Here, you can observe that xerror is minimum with CP value of 0.025.

我們還可以通過調(diào)用model $ cptable生成cp表。在這里，您可以觀察到xerror最小， CP值為0.025。

# Plotting the Cost Parameter (CP) Table
print(Diabetes_model1$cptable)

(c) Obtaining an optimal pruned model

(c)獲得最佳修剪模型

We can filter out the optimal CP value by identifying the index of minimum xerror and by supplying it to the CP table.

我們可以通過識(shí)別最小xerror索引并將其提供給CP表來篩選出最佳CP值。

# Retrieve of optimal cp value based on cross-validated error
index <- which.min(Diabetes_model1$cptable[, "xerror"])cp_optimal <- Diabetes_model1$cptable[index, "CP"]

The next step is to prune the tree using prune( ) function by supplying optimal CP value. If we plot the optimal pruned tree we can now observe that the tree is very simple and easy to interpret.

下一步是通過提供最佳CP值，使用prune()函數(shù)對(duì)樹進(jìn)行修剪。現(xiàn)在，如果繪制最佳修剪樹，則可以觀察到該樹非常簡(jiǎn)單且易于解釋。

If a person has a glucose level above 128 and age greater than 25 will be designated as diabetes positive else negative.

如果一個(gè)人的葡萄糖水平高于128且年齡大于25歲，將被認(rèn)定為糖尿病陽(yáng)性或陰性 。

# Pruning tree based on optimal CP valueDiabetes_model1_opt <- prune(tree = Diabetes_model1, cp = cp_optimal)rpart.plot(x = Diabetes_model1_opt, yesno = 2, type = 0, extra = 0)

(d) Pruned tree performance

(d)修剪的樹木表現(xiàn)

The next step is to check whether the prune tree has similar performance or the performance has been compromised. After the performance check, we can see that the pruned tree is as capable as the earlier fragile tree but now it is simple and easy to interpret.

下一步是檢查修剪樹是否具有相似的性能或性能是否受到損害。經(jīng)過性能檢查后，我們可以看到修剪的樹與早期的脆弱樹一樣強(qiáng)大，但是現(xiàn)在它變得簡(jiǎn)單易懂。

pred3 <- predict(object = Diabetes_model1_opt,
newdata = Diabetes_test,
type = "class")accuracy(actual = Diabetes_test$diabetes,
predicted = pred3)

決策樹超參數(shù)調(diào)整 (Decision Tree Hyperparameter Tuning)

Next, we would try to increase the performance of the decision tree model by tuning its hyperparameters. The rpart( ) offers different hyperparameters but here we will try to tune two important parameters which are minsplit, and maxdepth.

接下來，我們將嘗試通過調(diào)整決策樹模型的超參數(shù)來提高其性能。 rpart()提供了不同的超參數(shù)，但是在這里我們將嘗試調(diào)整兩個(gè)重要的參數(shù)minsplit和maxdepth 。

minsplit: the minimum number of observations that must exist in the node in order for a split to be attempted.
minsplit ：節(jié)點(diǎn)中必須存在的最小嘗試觀察數(shù)。
maxdepth: The maximum depth of any node of the final tree.
maxdepth ：最終樹的任何節(jié)點(diǎn)的最大深度。

(a) Generating hyperparameter grid

(a)生成超參數(shù)網(wǎng)格

First, we generate a sequence 1 to 20 for both minsplit and maxdepth. Then we build a parameter combination grid using expand.grid( ) function.

首先，我們?yōu)樽钚》至押妥畲笊疃壬梢粋€(gè)1到20的序列。然后，我們使用expand.grid()函數(shù)構(gòu)建參數(shù)組合網(wǎng)格。

#############################
## Hyper parameter Grid Search
############################## Setting values for minsplit and maxdepth## the minimum number of observations that must exist in a node in order for a split to be attempted.
## Set the maximum depth of any node of the final tree
minsplit <- seq(1, 20, 1)
maxdepth <- seq(1, 20, 1)# Generate a search grid
hyperparam_grid <- expand.grid(minsplit = minsplit, maxdepth = maxdepth)

(b) Training grid-based models

(b)訓(xùn)練基于網(wǎng)格的模型

The next step is to train different models based on each grid hyperparameter combination. This could be done through the following steps:

下一步是根據(jù)每個(gè)網(wǎng)格超參數(shù)組合訓(xùn)練不同的模型。這可以通過以下步驟完成：

using a for loop to loop through each hyperparameter in the grid and then supplying it to rpart( ) function for model training
使用for循環(huán)遍歷網(wǎng)格中的每個(gè)超參數(shù)，然后將其提供給rpart()函數(shù)進(jìn)行模型訓(xùn)練
storing each model into an empty list (diabetes_models)
將每個(gè)模型存儲(chǔ)到一個(gè)空列表中(diabetes_models)

# Number of potential models in the grid
num_models <- nrow(hyperparam_grid)# Create an empty list
diabetes_models <- list()# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:num_models) {

minsplit <- hyperparam_grid$minsplit[i]
maxdepth <- hyperparam_grid$maxdepth[i]

# Train a model and store in the list
diabetes_models[[i]] <- rpart(formula = diabetes ~ .,
data = Diabetes_train,
method = "class",
minsplit = minsplit,
maxdepth = maxdepth)
}

(c) Computing test accuracy

(c)計(jì)算測(cè)試準(zhǔn)確性

The next step is to check the model performance of each model on test data and retrieving the best model. This could be done through the following steps:

下一步是根據(jù)測(cè)試數(shù)據(jù)檢查每個(gè)模型的模型性能，并檢索最佳模型。這可以通過以下步驟完成：

using a for loop to loop through each model in the list, and then predicting the test data and computing accuracy
使用for循環(huán)遍歷列表中的每個(gè)模型，然后預(yù)測(cè)測(cè)試數(shù)據(jù)和計(jì)算精度
storing each model accuracy into an empty vector (accuracy_values)
將每個(gè)模型精度存儲(chǔ)到一個(gè)空向量中(accuracy_values)

# Number of models inside the grid
num_models <- length(diabetes_models)# Create an empty vector to store accuracy values
accuracy_values <- c()# Use for loop for models accuracy estimation
for (i in 1:num_models) {

# Retrieve the model i from the list
model <- diabetes_models[[i]]

# Generate predictions on test data
pred <- predict(object = model,
newdata = Diabetes_test,
type = "class")

# Compute test accuracy and add to the empty vector accuracy_values
accuracy_values[i] <- accuracy(actual = Diabetes_test$diabetes,
predicted = pred)
}

(d) Identifying the best model

(d)確定最佳模式

The next step is to retrieve the best performing model (maximum accuracy) and printing its hyperparameters using model$control. We can observe that with a minimum split of 17 and a maximum depth of 6 the model provides most accurate results when evaluated on unseen/test dataset.

下一步是檢索性能最佳的模型(最大精度)，并使用model $ control打印其超參數(shù)。我們可以觀察到，在看不見/測(cè)試的數(shù)據(jù)集上進(jìn)行評(píng)估時(shí)，該模型的最小拆分度為17 ，最大深度為6 ，可提供最準(zhǔn)確的結(jié)果。

# Identify the model with maximum accuracy
best_model <- diabetes_models[[which.max(accuracy_values)]]# Print the model hyper-parameters of the best model
best_model$control

(e) Best model evaluation on test data

(e)對(duì)測(cè)試數(shù)據(jù)進(jìn)行最佳模型評(píng)估

After identifying the best performing model, the next step is to see how accurate the model is. Now, with the best hyperparameters, the model achieved an accuracy of 81.08% which is really great.

確定最佳性能模型后，下一步就是查看模型的準(zhǔn)確性。現(xiàn)在，使用最佳超參數(shù)，該模型的精度達(dá)到了81.08％，這的確非常棒。

# Best_model accuracy on test data
pred <- predict(object = best_model,
newdata = Diabetes_test,
type = "class")
accuracy(actual = Diabetes_test$diabetes,
predicted = pred)

(f) Best model plot

(f)最佳模型圖

Now it is time to plot the best model.

現(xiàn)在是時(shí)候繪制最佳模型了。

rpart.plot(x = best_model, yesno = 2, type = 0, extra = 0)Best Model’s Layout最佳模型的布局

Even the above plot is for best performing model, still, it looks a little bit fragile. So your next task would be to prune it and see if you get a better interpretable decision tree or not.

即使上面的圖是表現(xiàn)最佳的模型，仍然看起來有些脆弱。因此，您的下一個(gè)任務(wù)是修剪它，看看您是否獲得了更好的可解釋性決策樹。

I hope you learned something new. See you next time!

我希望你學(xué)到了一些新東西。 下次見！

Note

注意

This article was first published on onezero.blog, a data science, machine learning and research related blogging platform maintained by me.

本文首次發(fā)表于onezero.blog ，數(shù)據(jù)科學(xué)，機(jī)器學(xué)習(xí)和研究相關(guān)的博客平臺(tái)維護(hù)由我。

**Read more by vising my personal blog website: https://onezero.blog/

** 訪問我的個(gè)人博客網(wǎng)站以了解更多信息 ： https : //onezero.blog/

If you learned something new and liked this article, say 👋 / follow me on onezero.blog (my personal blogging website), Twitter, LinkedIn, YouTube and Github.

如果您學(xué)到新知識(shí)并喜歡這篇文章，請(qǐng)?jiān)?onezero.blog ( 我的個(gè)人博客網(wǎng)站 ) ， Twitter ， LinkedIn ， YouTube 上說“👋/關(guān)注我”。和 Github 。

[1] Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A., 1984. Classification and regression trees. CRC press.

[1] Breiman，L.，Friedman，J.，Stone，CJ和Olshen，RA，1984。分類和回歸樹。 CRC媒體。

[2] Loh, W. (2014). Fifty Years of Classification and Regression Trees 1.

[2] Loh，W.(2014)。分類樹和回歸樹五十年1。

[3] Newman, C. B. D. & Merz, C. (1998). UCI Repository of machine learning databases, Technical report, University of California, Irvine, Dept. of Information and Computer Sciences.

[3] Newman，CBD＆Merz，C.(1998)。 UCI機(jī)器學(xué)習(xí)數(shù)據(jù)庫(kù)存儲(chǔ)庫(kù)，技術(shù)報(bào)告，加利福尼亞大學(xué)歐文分校信息和計(jì)算機(jī)科學(xué)系。

Originally published at https://onezero.blog on August 2, 2020.

最初于 2020年8月2日在 https://onezero.blog 上發(fā)布。

翻譯自: https://towardsdatascience.com/diabetes-classification-using-decision-trees-c4fd6dd7241a