【kaggle入门题一】Titanic: Machine Learning from Disaster
原題:
Start here if...
You're new to data science and machine learning, or looking for a simple intro to the Kaggle prediction competitions.
Competition Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.? On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Practice Skills
- Binary classification
- Python and R basics
訓(xùn)練數(shù)據(jù):
訓(xùn)練數(shù)據(jù)中的特征:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked| 特征 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
| 解釋 | 乘客ID | 死亡0/幸存/1 | 經(jīng)濟(jì)等級(jí)(1=high、2=middle、3=low) | 乘客姓名 | 性別 | 年齡 | 船上的兄弟姐妹個(gè)數(shù) | 船上的父母孩子個(gè)數(shù) | 船票號(hào)碼 | 票價(jià) | 客艙號(hào)碼 | 登船港口 |
解決思路:加載樣本->求出總數(shù)、總計(jì)、均值、方差->利用均值補(bǔ)全空白值->。。。->交叉驗(yàn)證(將訓(xùn)練數(shù)據(jù)做測(cè)試,123選中其二作為訓(xùn)練模型,剩下一個(gè)作為測(cè)試(原測(cè)試集不用),交叉訓(xùn)練驗(yàn)證取平均值)->線性回歸->邏輯回歸->隨機(jī)森林
#coding=utf-8 import os file_root = os.path.realpath('titanic') file_name_test = os.path.join(file_root, "test.csv") file_name_train = os.path.join(file_root, "train.csv") import pandas as pd #顯示所有信息 pd.set_option('display.max_columns' , None) titanic = pd.read_csv(file_name_train) data = titanic.describe()#可以查看有哪些缺失值 titanic.info() #缺失的Age內(nèi)容進(jìn)行取均值替換 titanic['Age'].fillna(titanic['Age'].median(), inplace=True) data = titanic.describe() print(data)#查看Sex下屬性值,并替換 print("Sex原屬性值", titanic['Sex'].unique()) titanic.loc[titanic['Sex'] == "male", "Sex"] = 0 titanic.loc[titanic['Sex'] == "female", "Sex"] = 1 print("Sex替換后的屬性值", titanic['Sex'].unique()) #查看Embarked下屬性值,并替換 print("Embarked原屬性值", titanic['Embarked'].unique()) titanic["Embarked"] = titanic["Embarked"].fillna('S') titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0 titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1 titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2 print("Embarked替換后的屬性值", titanic['Embarked'].unique())#線性回歸模型預(yù)測(cè) from sklearn.linear_model import LinearRegression #交叉驗(yàn)證 from sklearn import model_selection #特征值 predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] #初始化 alg = LinearRegression() #titanic.shape[0]:表示得到m和n的二元組,也就是樣本數(shù)目;表示n_folds:表示做基層的交叉驗(yàn)證; print("titanic.shape[0]:", titanic.shape[0]) # kf = model_selection.KFold(titanic.shape[0], n_folds=3, random_state=1) kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False) predictions = [] #n_folds=3遍歷三層 for train, test in kf.split(titanic['Survived']):#把訓(xùn)練數(shù)據(jù)拿出來(lái)train_predictors = titanic[predictors].iloc[train,:]#我們使用樣本訓(xùn)練的目標(biāo)值train_target = titanic['Survived'].iloc[train]#應(yīng)用線性回歸,訓(xùn)練回歸模型alg.fit(train_predictors, train_target)#利用測(cè)試集預(yù)測(cè)test_predictions = alg.predict(titanic[predictors].iloc[test,:])predictions.append(test_predictions)#看測(cè)試集的效果,回歸值區(qū)間值為[0-1] import numpy as np #numpy提供了numpy.concatenate((a1,a2,...), axis=0)函數(shù)。能夠一次完成多個(gè)數(shù)組的拼接。其中a1,a2,...是數(shù)組類(lèi)型的參數(shù) predictions = np.concatenate(predictions, axis=0)predictions[predictions > .5] = 1 predictions[predictions <= .5] = 0 accuracy = sum(predictions[predictions == titanic['Survived']]) / len(predictions) print("線性回歸模型: ", accuracy) #輸出:0.78... #采用邏輯回歸方式實(shí)現(xiàn) from sklearn import model_selection from sklearn.linear_model import LogisticRegression import warnings warnings.filterwarnings("ignore") #初始化 alg = LogisticRegression(random_state=1) #比較測(cè)試值 scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=3) print("邏輯回歸模型: ", scores.mean())#采用隨機(jī)森林實(shí)現(xiàn):構(gòu)造多顆決策樹(shù)共同決策結(jié)果,取出多次結(jié)果的平均值。 #隨機(jī)森林在這七個(gè)特征當(dāng)中進(jìn)行隨機(jī)選擇個(gè)數(shù) from sklearn import model_selection from sklearn.ensemble import RandomForestClassifier pridictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] #參數(shù):隨機(jī)數(shù)、用了多少樹(shù)、最小樣本個(gè)數(shù)、最小葉子結(jié)點(diǎn)個(gè)數(shù) alg = RandomForestClassifier(random_state=1, n_estimators=50, min_impurity_split=4, min_samples_leaf=2) kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False) kf = kf.split(titanic['Survived']) scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=kf) print("隨機(jī)森林: ", scores.mean())視頻地址:https://study.163.com/course/courseLearn.htm?courseId=1003551009#/learn/video?lessonId=1004052091&courseId=1003551009
總結(jié)
以上是生活随笔為你收集整理的【kaggle入门题一】Titanic: Machine Learning from Disaster的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: RabbitMQ TTL、死信队列在订单
- 下一篇: 关于统计时间切片标签的一些sql