有关糖尿病模型建立的论文_预测糖尿病结果的模型比较
有關(guān)糖尿病模型建立的論文
項(xiàng)目主題 (Subject of the Project)
The dataset is primarily used for predicting the onset of diabetes within five years in females of Pima Indian heritage over the age of 21 given medical details about their bodies. The dataset is meant to correspond with a binary (2-class) classification machine learning problem.
該數(shù)據(jù)集主要用于預(yù)測21歲以上的皮馬印度裔女性在五年內(nèi)的糖尿病發(fā)作情況,并提供有關(guān)其身體的醫(yī)學(xué)詳細(xì)信息。 該數(shù)據(jù)集旨在與二進(jìn)制(2類)分類機(jī)器學(xué)習(xí)問題相對應(yīng)。
We have a dependent variable that indicates the state of having diabetes. Our goal is to model the relationship between other variables and whether or not they have diabetes.
我們有一個因變量,表明患有糖尿病的狀態(tài)。 我們的目標(biāo)是為其他變量與他們是否患有糖尿病之間的關(guān)系建模。
When the various features of the people are entered, we want to establish a machine learning model that will make a prediction about whether these people will have diabetes or not. This is a classification problem.
當(dāng)人們的各種特征被輸入時,我們想建立一個機(jī)器學(xué)習(xí)模型,以預(yù)測這些人是否患有糖尿病。 這是分類問題。
數(shù)據(jù)集信息 (Dataset Information)
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
該數(shù)據(jù)集最初來自美國國立糖尿病與消化及腎臟疾病研究所。 數(shù)據(jù)集的目的是基于數(shù)據(jù)集中包含的某些診斷測量值來診斷預(yù)測患者是否患有糖尿病。 從較大的數(shù)據(jù)庫中選擇這些實(shí)例受到一些限制。 特別是,這里的所有患者均為皮馬印第安人血統(tǒng)至少21歲的女性。
We have 9 columns and 768 instances (rows). The column names are provided as follows:
我們有9列和768個實(shí)例(行)。 提供的列名稱如下:
- Pregnancies: Number of times pregnant- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skinfold thickness (mm)
- Insulin: 2-Hour serum insulin measurement (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m) 2 )
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1, 0 = non-diabetic, 1 = diabetic)
數(shù)據(jù)理解 (Data Understanding)
#installation of librariesimport numpy as npimport pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import KFold#any warnings that do not significantly impact the project are ignoredimport warnings
warnings.simplefilter(action = "ignore")#reading the datasetdf = pd.read_csv("diabetes.csv")
#selection of the first 5 observationsdf.head()#return a random sample of items from an axis of objectdf.sample(3)#makes random selection from dataset at the rate of written valuedf.sample(frac = 0.01)#size informationdf.shape(768, 9)#dataframe's index dtype and column dtypes, non-null values and memory usage informationdf.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB#explanatory statistics values of the observation units corresponding to the specified percentagesdf.describe([0.10,0.25,0.50,0.75,0.90,0.95,0.99]).T#transposition of the df table. This makes it easier to evaluate.#correlation between variablesdf.corr()
Our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes. Visualize some of the differences between those that developed diabetes and those that did not.
我們最終的目標(biāo)是利用數(shù)據(jù)中的模式來預(yù)測糖尿病的發(fā)作。 可視化那些患有糖尿病的人與未患糖尿病的人之間的一些差異。
#get a histogram of the Glucose column for both classescol = 'Glucose'
plt.hist(df[df['Outcome']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(df[df['Outcome']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()
It seems that this histogram is showing us a pretty big difference between Glucose and two prediction classes.
該直方圖似乎向我們展示了葡萄糖和兩個預(yù)測類別之間的巨大差異。
for col in ['BMI', 'BloodPressure']:plt.hist(df[df['Outcome']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(df[df['Outcome']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()
These histograms show us the distributions of ‘BMI’, ‘BloodPressure’, ‘Glucose’ for the two class variables (non-diabetes and diabetes).
這些直方圖向我們顯示了兩個類別變量(非糖尿病和糖尿病)的“ BMI”,“ BloodPressure”,“葡萄糖”的分布。
There seems to be a large jump in ‘Glucose’ for those who will eventually develop diabetes. To solidify this, we can visualize correlation matrix in an attempt to quantify the relationship between these variables.
對于那些最終會患上糖尿病的人來說,“葡萄糖”似乎有很大的提高。 為了鞏固這一點(diǎn),我們可以可視化相關(guān)矩陣,以試圖量化這些變量之間的關(guān)系。
def plot_corr(df,size = 9):corr = df.corr() #corr = variable, where we assign the correlation matrix to a variable
fig, ax = plt.subplots(figsize = (size,size))
#fig = the column to the right of the chart, subplots (figsize = (size, size)) = determines the size of the chart
ax.matshow(corr) # prints the correlation, which draws the matshow matrix directly
cax=ax.matshow(corr, interpolation = 'nearest') #plotting axis, code that makes the graphic like square or map
fig.colorbar(cax) #plotting color
plt.xticks(range(len(corr.columns)),corr.columns,rotation=65)
# draw xticks, rotation = 17 is for inclined printing of expressions written for each top column
plt.yticks(range(len(corr.columns)),corr.columns) #draw yticks#we draw the dataframe using the function.plot_corr(df)#correlation matrix in seaborn libraryimport seaborn as sb
sb.heatmap(df.corr());#this way we can see the correlationssb.heatmap(df.corr(),annot =True);
Conclusion: The highest correlations with Outcome were observed between Glucose, BMI, Age and Pregnancies.
結(jié)論:血糖,BMI,年齡和懷孕與結(jié)果的相關(guān)性最高。
#proportions of classes 0 and 1 in Outcomedf["Outcome"].value_counts()*100/len(df)0 65.1041671 34.895833
Name: Outcome, dtype: float64#how many classes are 0 and 1df.Outcome.value_counts()0 500
1 268
Name: Outcome, dtype: int64
histogram of the Age variabledf["Age"].hist(edgecolor = "black");
histogram of the Age variable df["Age"].hist(edgecolor = "black");
#Age, Glucose and BMI means according to Outcome variabledf.groupby("Outcome").agg({"Age":"mean","Glucose":"mean","BMI":"mean"})數(shù)據(jù)預(yù)處理 (Data Pre-Processing)
缺失數(shù)據(jù)分析 (Missing Data Analysis)
#no missing data in datasetdf.isnull().sum()Pregnancies 0Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64#zeros in the corresponding variables mean NA, so 0 is assigned instead of NAdf[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)
It seems that there is no missing value in the data set, but when the variables are examined, the zeros in these variables represent NA.
數(shù)據(jù)集中似乎沒有缺失值,但是當(dāng)檢查變量時,這些變量中的零表示NA。
#exclusive valuesdf.isnull().sum()Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64def median_target(var):
temp = df[df[var].notnull()]
temp = temp[[var, 'Outcome']].groupby(['Outcome'])[[var]].median().reset_index() #reset_index; solved problems in indices
return temp#Non-nulls are selected from within df and assigned to a dataframe named temp, ignoring the observation units filled.
Independent and dependent variable selected from dataframe, groupby operation is applied to the dependent variable then the independent variable is selected and the median of this variable is taken.
從數(shù)據(jù)幀中選擇自變量和因變量,然后將groupby操作應(yīng)用于因變量,然后選擇自變量并取該變量的中值。
#median of glucose taken according to Outcome's value of 0 and 1median_target("Glucose")#median values of diabetes and non-diabetes were given for incomplete observations.columns = df.columnscolumns = columns.drop("Outcome")
for col in columns:
df.loc[(df['Outcome'] == 0 ) & (df[col].isnull()), col] = median_target(col)[col][0]
df.loc[(df['Outcome'] == 1 ) & (df[col].isnull()), col] = median_target(col)[col][1]
#select the outcome value 0 and the relevant variable blank, select the relevant variable
#it refers to pre-comma filtering operations, it is used for column selection after comma
特征工程 (Feature Engineering)
#according to BMI, some ranges were determined and categorical variables were assigned.NewBMI = pd.Series(["Underweight", "Normal", "Overweight", "Obesity 1", "Obesity 2", "Obesity 3"], dtype = "category")df["NewBMI"] = NewBMI
df.loc[df["BMI"] < 18.5, "NewBMI"] = NewBMI[0]
df.loc[(df["BMI"] > 18.5)&(df["BMI"] <= 24.9), "NewBMI"] = NewBMI[1]
df.loc[(df["BMI"] > 24.9)&(df["BMI"] <= 29.9), "NewBMI"] = NewBMI[2]
df.loc[(df["BMI"] > 29.9)&(df["BMI"] <= 34.9), "NewBMI"] = NewBMI[3]
df.loc[(df["BMI"] > 34.9)&(df["BMI"] <= 39.9), "NewBMI"] = NewBMI[4]
df.loc[df["BMI"] > 39.9 ,"NewBMI"] = NewBMI[5]df.head()#categorical variable creation according to the insulin valuedef set_insulin(row):
if row["Insulin"] >= 16 and row["Insulin"] <= 166:
return "Normal"
else:
return "Abnormal"df.head()#NewInsulinScore variable added with set_insulindf["NewInsulinScore"] = df.apply(set_insulin, axis=1)df.head()#some intervals were determined according to the glucose variable and these were assigned categorical variables.NewGlucose = pd.Series(["Low", "Normal", "Overweight", "Secret", "High"], dtype = "category")
df["NewGlucose"] = NewGlucose
df.loc[df["Glucose"] <= 70, "NewGlucose"] = NewGlucose[0]
df.loc[(df["Glucose"] > 70) & (df["Glucose"] <= 99), "NewGlucose"] = NewGlucose[1]
df.loc[(df["Glucose"] > 99) & (df["Glucose"] <= 126), "NewGlucose"] = NewGlucose[2]
df.loc[df["Glucose"] > 126 ,"NewGlucose"] = NewGlucose[3]df.head()
一站式編碼 (One-Hot Encoding)
#categorical variables were converted into numerical values by making One Hot Encoding transform#it is also protected from the Dummy variable trapdf = pd.get_dummies(df, columns =["NewBMI","NewInsulinScore", "NewGlucose"], drop_first = True)df.head()
可變標(biāo)準(zhǔn)化 (Variable Standardization)
#categorical variablescategorical_df = df[['NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight','NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret']]#categorical variables deleted from dfy = df["Outcome"]
X = df.drop(["Outcome",'NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight',
'NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'], axis = 1)
cols = X.columns
index = X.indexy.head()0 1
1 0
2 1
3 0
4 1
Name: Outcome, dtype: int64X.head()#by standardizing the variables in the dataset, the performance of the models is increased.from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(X)
X = transformer.transform(X)
X = pd.DataFrame(X, columns = cols, index = index)X.head()#combining non-categorical and categorical variablesX = pd.concat([X, categorical_df], axis = 1)X.head()
造型 (Modelling)
LR: 0.847539 (0.032028)KNN: 0.837235 (0.031427)
CART: 0.838602 (0.026456)
RF: 0.878947 (0.030074)
SVM: 0.848855 (0.035492)
XGB: 0.880297 (0.029243)
LightGBM: 0.885526 (0.035487)
RF, XGB and LightGBM gave good results. We focused on optimizing these models
RF,XGB和LightGBM取得了良好的效果。 我們專注于優(yōu)化這些模型
模型優(yōu)化 (Model Optimization)
模型調(diào)整 (Model Tuning)
隨機(jī)森林調(diào)整 (Random Forests Tuning)
rf_params = {"n_estimators" :[100,200,500,1000],"max_features": [3,5,7],
"min_samples_split": [2,5,10,30],
"max_depth": [3,5,8,None]}rf_model = RandomForestClassifier(random_state = 12345)gs_cv = GridSearchCV(rf_model,
rf_params,
cv = 10,
n_jobs = -1,
verbose = 2).fit(X, y)gs_cv.best_params_{'max_depth': None,
'max_features': 7,
'min_samples_split': 5,
'n_estimators': 500}
最終模型安裝 (Final Model Installation)
rf_tuned = RandomForestClassifier(**gs_cv.best_params_)rf_tuned = rf_tuned.fit(X,y)cross_val_score(rf_tuned, X, y, cv = 10).mean()0.8867737525632261feature_imp = pd.Series(rf_tuned.feature_importances_,
index=X.columns).sort_values(ascending=False)
sns.barplot(x=feature_imp, y=feature_imp.index, palette="Blues_d")
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Feature Severity Levels")
plt.show()
XGBoost調(diào)整 (XGBoost Tuning)
xgb = GradientBoostingClassifier(random_state = 12345)xgb_params = {"learning_rate": [0.01, 0.1, 0.2, 1],
"min_samples_split": np.linspace(0.1, 0.5, 3),
"max_depth":[3,5,8],
"subsample":[0.5, 0.9, 1.0],
"n_estimators": [100,500]}xgb_cv = GridSearchCV(xgb,xgb_params, cv = 10, n_jobs = -1, verbose = 2).fit(X, y)xgb_cv.best_params_{'learning_rate': 0.1,
'max_depth': 8,
'min_samples_split': 0.1,
'n_estimators': 100,
'subsample': 0.9}
最終模型安裝 (Final Model Installation)
xgb_tuned = GradientBoostingClassifier(**xgb_cv.best_params_).fit(X,y)cross_val_score(xgb_tuned, X, y, cv = 10).mean()0.8867737525632263feature_imp = pd.Series(xgb_tuned.feature_importances_,index=X.columns).sort_values(ascending=False)
sns.barplot(x=feature_imp, y=feature_imp.index, palette="Blues_d")
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Feature Severity Levels")
plt.show()
LightGBM調(diào)整 (LightGBM Tuning)
lgbm = LGBMClassifier(random_state = 12345)lgbm_params = {"learning_rate": [0.01, 0.03, 0.05, 0.1, 0.5],"n_estimators": [500, 1000, 1500],
"max_depth":[3,5,8]}gs_cv = GridSearchCV(lgbm,
lgbm_params,
cv = 10,
n_jobs = -1,
verbose = 2).fit(X, y)gs_cv.best_params_{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 500}
最終模型安裝 (Final Model Installation)
lgbm_tuned = LGBMClassifier(**gs_cv.best_params_).fit(X,y)cross_val_score(lgbm_tuned, X, y, cv = 10).mean()0.8959330143540669feature_imp = pd.Series(lgbm_tuned.feature_importances_,index=X.columns).sort_values(ascending=False)
sns.barplot(x=feature_imp, y=feature_imp.index, palette="Blues_d")
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Feature Severity Levels")
plt.show()
最終模型的比較 (Comparison of Final Models)
RF: 0.886791 (0.028298)XGB: 0.886757 (0.021597)
LightGBM: 0.892003 (0.033222)
結(jié)論 (Conclusion)
- Machine learning models were established to predict whether people will have diabetes with varying variables.- The 3 classification models that best describe the dataset were selected and these models were compared according to their success rates. Compared models are Random Forests, XGBoost, LightGBM.
- As a result of this comparison; It is determined that the model that best describes and gives the best results is LightGBM.
You can find the kaggle link of this project here.
你可以找到這個項(xiàng)目的kaggle鏈接在這里 。
資源資源 (Resources)
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html- https://www.udemy.com/course/python-egitimi/- https://github.com/omarozt/MachineLearningWorkshop- https://www.kaggle.com/ibrahimyildiz/pima-indians-diabetes-pred-0-9078-acc
- https://seaborn.pydata.org/examples/color_palettes.html- https://www.jonobacon.com/2017/08/06/joining-data-world-advisory-board/- https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825- https://becominghuman.ai/data-preprocessing-a-basic-guideline-c0842b7883fa
- Feature Engineering Made Easy, Sinan Ozdemir and Divya Susarla
翻譯自: https://medium.com/swlh/model-comparison-for-predicting-diabetes-outcomes-ddcd06384743
有關(guān)糖尿病模型建立的論文
總結(jié)
以上是生活随笔為你收集整理的有关糖尿病模型建立的论文_预测糖尿病结果的模型比较的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 文本文件加密和解密_解密文本见解和相关业
- 下一篇: 迷你世界感应方块怎么用(24期迷你世界一