【博客地址】:https://blog.csdn.net/sunyaowu315
【博客大綱地址】:https://blog.csdn.net/sunyaowu315/article/details/82905347
??對(duì)數(shù)據(jù)分析、機(jī)器學(xué)習(xí)、數(shù)據(jù)科學(xué)、金融風(fēng)控等感興趣的小伙伴,需要數(shù)據(jù)集、代碼、行業(yè)報(bào)告等各類學(xué)習(xí)資料,可關(guān)注微信公眾號(hào):風(fēng)控圏子(別打錯(cuò)字,是圏子,不是圈子,算了直接復(fù)制吧!)
??關(guān)注公眾號(hào)后,可聯(lián)系圈子助手加入我們的機(jī)器學(xué)習(xí)風(fēng)控討論群和反欺詐討論群。(記得要備注喔!)
??相互學(xué)習(xí),共同成長(zhǎng)。
基于隨機(jī)森林算法的房屋價(jià)格預(yù)測(cè)模型
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 17:17:16 2018@author: yaowu
"""
#==============================================================================
# 導(dǎo)入模塊 : 導(dǎo)入所需要的模塊
# 數(shù)據(jù)清洗 : 在研究變量的基礎(chǔ)上進(jìn)行數(shù)據(jù)清洗
# 變量篩選 :
# 建立模型 : 完成數(shù)據(jù)建模
# 評(píng)估模型 : 最終評(píng)估模型質(zhì)量
#==============================================================================
## 1.1 導(dǎo)入模塊
# 導(dǎo)入一些數(shù)據(jù)分析和數(shù)據(jù)挖掘常用的包
import numpy as np,pandas as pd,os,seaborn as sns,matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
#statsmodels,統(tǒng)計(jì)模型,計(jì)量經(jīng)濟(jì)學(xué)是一個(gè)包含統(tǒng)計(jì)模型、統(tǒng)計(jì)測(cè)試和統(tǒng)計(jì)數(shù)據(jù)挖掘python模塊
#outliers_influence,
#variance_inflation_factor,方差膨脹因子(Variance Inflation Factor,VIF):是指解釋變量之間存在多重共線性時(shí)的方差與不存在多重共線性時(shí)的方差之比
from sklearn.preprocessing import StandardScaler
#sklearn.preprocessing,數(shù)據(jù)預(yù)處理模塊
#StandardScaler,去均值和方差歸一化模塊
from sklearn.decomposition import PCA
#分解,降維
## 1.2 研究變量的情況 導(dǎo)入包之后加載一下數(shù)據(jù)集,查看一下數(shù)據(jù)的基礎(chǔ)情況,再考慮下一步的處理
# 加載一下數(shù)據(jù),并打印部分?jǐn)?shù)據(jù),查看一下數(shù)據(jù)的情況
os.chdir(r'C:\Users\A3\Desktop\2:項(xiàng)目\項(xiàng)目\項(xiàng)目19:基于隨機(jī)森林算法的房屋價(jià)格預(yù)測(cè)模型\房屋價(jià)格預(yù)測(cè)')
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')
print(data_train.head())
print(data_test.head())
# 查看數(shù)據(jù)的列名和每列的數(shù)據(jù)格式,方便后面對(duì)數(shù)據(jù)進(jìn)行處理
#data_train.columns
#data_train.info
#查看數(shù)據(jù)結(jié)構(gòu)
data_train_dtypes = data_train.dtypes
#print(data_train_dtypes)
'''
#==============================================================================
# 對(duì)因變量進(jìn)行具體情況具體分析,主要查看因變量的統(tǒng)計(jì)情況,包含偏度和峰度等。
# 峰度:峰度(Kurtosis)是描述某變量所有取值分布形態(tài)陡緩程度的統(tǒng)計(jì)量。
# 它是和正態(tài)分布相比較的。
# Kurtosis=0 與正態(tài)分布的陡緩程度相同。
# Kurtosis>0 比正態(tài)分布的高峰更加陡峭——尖頂峰
# Kurtosis<0 比正態(tài)分布的高峰來得平穩(wěn)——平頂峰
# 計(jì)算公式:β = M_4 /σ^4
# 偏度:偏度(Skewness)是描述某變量取值分布對(duì)稱性的統(tǒng)計(jì)量。
# Skewness=0 分布形態(tài)與正態(tài)分布偏度相同
# Skewness>0 正偏差數(shù)值較大,為正偏或右偏。長(zhǎng)尾巴拖在右邊。
# Skewness<0 負(fù)偏差數(shù)值較大,為負(fù)偏或左偏。長(zhǎng)尾巴拖在左邊。
# 計(jì)算公式: S= (X^ - M_0)/δ Skewness 越大,分布形態(tài)偏移程度越大。
#==============================================================================
'''
# 查看因變量?jī)r(jià)格的情況,進(jìn)行基礎(chǔ)分析
sns.distplot(data_train['SalePrice'])
plt.show()
# 對(duì)房屋價(jià)格金額的數(shù)值圖形化,查看一下
sns.set(style="darkgrid")
titanic = pd.DataFrame(data_train['SalePrice'].value_counts())
titanic.columns = ['SalePrice_count']
ax = sns.countplot(x="SalePrice_count", data=titanic)
plt.show()# 從房屋價(jià)格的正態(tài)分布情況,查看房屋價(jià)格的峰度和偏度情況
print('房屋價(jià)格偏度:%f' % (data_train['SalePrice'].skew()))
print('房屋價(jià)格峰度:%f' % (data_train['SalePrice'].kurt()))
'''
#==============================================================================
# 分析數(shù)據(jù)中的缺失值情況,如果超過閾值15%,則刪除這個(gè)變量,其他變量根據(jù)類別或者是數(shù)值型變量進(jìn)行填充。
# 具體得到的情況如下:
#
# 有缺失的對(duì)應(yīng)的變量名稱
# * PoolQC Pool quality 游泳池質(zhì)量
# * MiscFeature Miscellaneous feature not covered in other categories 其他雜項(xiàng),例如網(wǎng)球場(chǎng)、第二車庫(kù)等
# * Alley Type of alley access to property 胡同小路,是碎石鋪就的還是其他等等
# * Fence Fence quality 護(hù)欄質(zhì)量
# * FireplaceQu Fireplace quality 壁爐的質(zhì)量
# * LotFrontage Linear feet of street connected to property 街道情況
# * GarageFinish Interior finish of the garage 車庫(kù)完成情況
# * GarageQual Garage quality 車庫(kù)質(zhì)量
# * GarageType Garage location 車庫(kù)的位置
# * GarageYrBlt Year garage was built 車庫(kù)的建筑年齡
# * GarageCond Garage condition 車庫(kù)的條件
# * BsmtExposure Refers to walkout or garden level walls 花園的墻壁情況
# * BsmtFinType2 Rating of basement finished area (if multiple types) 地下室的完工面積
# * BsmtQual Evaluates the height of the basement 地下室的高度
# * BsmtCond Evaluates the general condition of the basement 地下室的質(zhì)量情況
# * BsmtFinType1 Rating of basement finished area 地下室完工面積
# * MasVnrType Masonry veneer type 表層砌體類型
# * MasVnrArea Masonry veneer area in square feet 磚石鑲面面積平方英尺
# * Electrical Electrical system 電氣系統(tǒng)
#
#==============================================================================
'''
# 在進(jìn)行圖形分析之前,先分析一下數(shù)據(jù)中缺失值的情況
miss_data = data_train.isnull().sum().sort_values(ascending=False) # 缺失值數(shù)量
total = data_train.isnull().count() # 總數(shù)量
miss_data_tmp = (miss_data / total).sort_values(ascending=False) # 缺失值占比
# 添加百分號(hào)
def precent(X):X = '%.2f%%' % (X * 100)return X
miss_precent = miss_data_tmp.map(precent)
# 根據(jù)缺失值占比倒序排序
miss_data_precent = pd.concat([total, miss_precent, miss_data_tmp], axis=1, keys=['total', 'Percent', 'Percent_tmp']).sort_values(by='Percent_tmp', ascending=False)
# 有缺失值的變量打印出來
print(miss_data_precent[miss_data_precent['Percent'] != '0.00%'])#* 將缺失值比例大于15%的數(shù)據(jù)全部刪除,剩余數(shù)值型變量用眾數(shù)填充、類別型變量用None填充。*drop_columns = miss_data_precent[miss_data_precent['Percent_tmp'] > 0.15].index
data_train = data_train.drop(drop_columns, axis=1)
data_test = data_test.drop(drop_columns, axis=1)
# 類別型變量
class_variable = [col for col in data_train.columns if data_train[col].dtypes == 'O']
# 數(shù)值型變量
numerical_variable = [col for col in data_train.columns if data_train[col].dtypes != 'O'] # 大寫o
print('類別型變量:%s' % class_variable, '數(shù)值型變量:%s' % numerical_variable)# 數(shù)值型變量用中位數(shù)填充,test集中最后一列為預(yù)測(cè)價(jià)格,所以不可以填充
from sklearn.preprocessing import Imputer
#Imputer填充模塊
padding = Imputer(strategy='median')
data_train[numerical_variable] = padding.fit_transform(data_train[numerical_variable])
data_test[numerical_variable[:-1]] = padding.fit_transform(data_test[numerical_variable[:-1]])
# 類別變量用None填充
data_train[class_variable] = data_train[class_variable].fillna('None')
data_test[class_variable] = data_test[class_variable].fillna('None')# 對(duì)所有的數(shù)據(jù)圖形化,主要判別一下數(shù)據(jù)和因變量之間的關(guān)聯(lián)性,后期我們?cè)俨捎梅讲铗?yàn)證來選擇變量。
# 用圖形化只是為了大致查看數(shù)據(jù)情況,構(gòu)建簡(jiǎn)單模型使用。對(duì)變量的選擇并不準(zhǔn)確,對(duì)于較多變量,用圖形先判斷一下,也是一個(gè)辦法;
# 如果變量較多,這種辦法效率低,并且準(zhǔn)確度也差,下面隨機(jī)選擇了幾個(gè)變量畫圖看一下狀況,后期并不使用這種辦法來選擇變量# 根據(jù)變量,選擇相關(guān)變量查看與因變量之間的關(guān)系
# CentralAir 中央空調(diào)
data = pd.concat([data_train['SalePrice'], data_train['CentralAir']], axis=1)
fig = sns.boxplot(x='CentralAir', y="SalePrice", data=data)
plt.title('CentralAir')
plt.show()
# 有中央空調(diào)的房?jī)r(jià)更高# MSSubClass 房屋類型
data = pd.concat([data_train['SalePrice'], data_train['MSSubClass']], axis=1)
fig = sns.boxplot(x='MSSubClass', y="SalePrice", data=data)
plt.title('MSSubClass')
plt.show()
# 房屋類型方面和房屋價(jià)格之間關(guān)聯(lián)度不大# MSZoning 房屋區(qū)域,例如有高密度住宅、低密度住宅,猜想這個(gè)變量和房屋價(jià)格之間的關(guān)系比較密切
data = pd.concat([data_train['SalePrice'], data_train['MSZoning']], axis=1)
fig = sns.boxplot(x='MSZoning', y="SalePrice", data=data)
plt.title('MSZoning')
plt.show()
# 實(shí)際結(jié)果顯示和房屋價(jià)格的關(guān)系不大# LotArea 這個(gè)變量猜想會(huì)與房屋價(jià)格有直接的關(guān)系
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(x=data_train['SalePrice'], y=data_train['LotArea'])
plt.xlabel('SalePrice')
plt.ylabel('LotArea')
plt.title('LotArea')
plt.show()
# 地塊面積與房屋價(jià)格有較高相關(guān)性# Street
data = pd.concat([data_train['SalePrice'], data_train['Street']], axis=1)
fig = sns.boxplot(x='Street', y="SalePrice", data=data)
plt.title('Street')
plt.show()
# 街道類型和房屋價(jià)格有較高相關(guān)性# LotShape
data = pd.concat([data_train['SalePrice'], data_train['LotShape']], axis=1)
fig = sns.boxplot(x='LotShape', y="SalePrice", data=data)
plt.title('LotShape')
plt.show()
# 房屋形狀和房屋價(jià)格相關(guān)不明顯# LandContour
data = pd.concat([data_train['SalePrice'], data_train['LandContour']], axis=1)
fig = sns.boxplot(x='LandContour', y="SalePrice", data=data)
plt.title('LandContour')
plt.show()
# 房屋所在地是否平整和房屋價(jià)格關(guān)系較弱# Utilities
data = pd.concat([data_train['SalePrice'], data_train['Utilities']], axis=1)
fig = sns.boxplot(x='Utilities', y="SalePrice", data=data)
plt.title('Utilities')
plt.show()
# 公共設(shè)施和房屋價(jià)格基本無關(guān)系# LotConfig
data = pd.concat([data_train['SalePrice'], data_train['LotConfig']], axis=1)
fig = sns.boxplot(x='LotConfig', y="SalePrice", data=data)
plt.title('LotConfig')
plt.show()
'''
#==============================================================================
# 因?yàn)樽兞枯^多,直接采用關(guān)系矩陣,查看各個(gè)變量和因變量之間的關(guān)系,使用的時(shí)候采用spearman系數(shù),原因:
# Pearson 線性相關(guān)系數(shù)只是許多可能中的一種情況,為了使用Pearson 線性相關(guān)系數(shù)必須假設(shè)數(shù)
# 據(jù)是成對(duì)地從正態(tài)分布中取得的,并且數(shù)據(jù)至少在邏輯范疇內(nèi)必須是等間距的數(shù)據(jù)。如果這兩條件
# 不符合,一種可能就是采用Spearman 秩相關(guān)系數(shù)來代替Pearson 線性相關(guān)系數(shù)。Spearman 秩相關(guān)系
# 數(shù)是一個(gè)非參數(shù)性質(zhì)(與分布無關(guān))的秩統(tǒng)計(jì)參數(shù),由Spearman 在1904 年提出,用來度量?jī)蓚€(gè)變
# 量之間聯(lián)系的強(qiáng)弱(Lehmann and D’Abrera 1998)。Spearman 秩相關(guān)系數(shù)可以用于R 檢驗(yàn),同樣可以
# 在數(shù)據(jù)的分布使得Pearson 線性相關(guān)系數(shù)不能用來描述或是用來描述或?qū)е洛e(cuò)誤的結(jié)論時(shí),作為變
# 量之間單調(diào)聯(lián)系強(qiáng)弱的度量。
# Spearman對(duì)原始變量的分布不作要求,屬于非參數(shù)統(tǒng)計(jì)方法,適用范圍要廣些。
# 理論上不論兩個(gè)變量的總體分布形態(tài)、樣本容量的大小如何,都可以用斯皮爾曼等級(jí)相關(guān)來進(jìn)行研究 。
#==============================================================================
'''
#==============================================================================
# 變量處理 :
#
# 在變量處理期間,我們先考慮處理更簡(jiǎn)單的數(shù)值型變量,再考慮處理復(fù)雜的類別型變量;
#
# 其中數(shù)值型變量,需要先考慮和因變量的相關(guān)性,其次考慮變量?jī)蓛芍g的相關(guān)性,再考慮變量的多重共線性;
#
# 類別型變量除了考慮相關(guān)性之外,需要進(jìn)行編碼。
#==============================================================================
# 繪制熱力圖,查看一下數(shù)值型變量之間的關(guān)系
corrmat = data_train[numerical_variable].corr('spearman')
f, ax = plt.subplots(figsize=(12, 9))
ax.set_xticklabels(corrmat, rotation='horizontal')
sns.heatmap(np.fabs(corrmat), square=False, center=1)
label_y = ax.get_yticklabels()
plt.setp(label_y, rotation=360)
label_x = ax.get_xticklabels()
plt.setp(label_x, rotation=90)
plt.show()# 計(jì)算變量之間的相關(guān)性
numerical_variable_corr = data_train[numerical_variable].corr('spearman')
numerical_corr = numerical_variable_corr[numerical_variable_corr['SalePrice'] > 0.5]['SalePrice']
print(numerical_corr.sort_values(ascending=False))
index0 = numerical_corr.sort_values(ascending=False).index
# 結(jié)合考慮兩兩變量之間的相關(guān)性
print(data_train[index0].corr('spearman'))#==============================================================================
# 結(jié)合上述情況,選擇出如下的變量:
# Variable 相關(guān)性
# OverallQual 0.809829
# GrLivArea 0.731310
# GarageCars 0.690711
# YearBuilt 0.652682
# FullBath 0.635957
# TotalBsmtSF 0.602725
# YearRemodAdd 0.571159
# Fireplaces 0.519247
#==============================================================================#在這個(gè)基礎(chǔ)上再考慮變量之間的多重共線性
new_numerical = ['OverallQual', 'GrLivArea', 'GarageCars','YearBuilt', 'FullBath', 'TotalBsmtSF', 'YearRemodAdd', 'Fireplaces']
X = np.matrix(data_train[new_numerical])
VIF_list = [variance_inflation_factor(X, i) for i in range(X.shape[1])]
VIF_list
#可以明顯看到數(shù)據(jù)有很強(qiáng)的多重共線性,對(duì)數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化和降維Scaler = StandardScaler()
data_train_numerical = Scaler.fit_transform(data_train[new_numerical])
pca = PCA(n_components=7)
newData_train = pca.fit_transform(data_train_numerical)
newData_trainScaler = StandardScaler()
data_test_numerical = Scaler.fit_transform(data_test[new_numerical])
pca = PCA(n_components=7)
newData_test = pca.fit_transform(data_test_numerical)
newData_testnewData_train = pd.DataFrame(newData_train)
# newData
y = np.matrix(newData_train)
VIF_list = [variance_inflation_factor(y, i) for i in range(y.shape[1])]
print(newData_train, VIF_list)#從上面的數(shù)據(jù)標(biāo)準(zhǔn)化和降維之后,已經(jīng)消除了多重共線性了。接下來處理類別數(shù)據(jù)
# 單因素方差分析from statsmodels.formula.api import ols
#單因素方差分析模塊
from statsmodels.stats.anova import anova_lm
#雙因素方差分析模塊
a = '+'.join(class_variable)
formula = 'SalePrice~ %s' % a
anova_results = anova_lm(ols(formula, data_train).fit())
print(anova_results.sort_values(by='PR(>F)'))
#我們需要看的是單個(gè)自變量對(duì)因變量SalePrice的影響,因此這里使用單因素方差分析。
#分析結(jié)果中 P_values(PR(>F))越小,說明該變量對(duì)目標(biāo)變量的影響越大。通常我們只選擇 P_values 小于 0.05 的變量
# 從變量列表和數(shù)據(jù)中剔除 P 值大于 0.05 的變量
del_var = list(anova_results[anova_results['PR(>F)'] > 0.05].index)
del_var
# 移除變量
for each in del_var:class_variable.remove(each)
# 移除變量數(shù)據(jù)
data_train = data_train.drop(del_var, axis=1)
data_test = data_test.drop(del_var, axis=1)
# 對(duì)類別型變量進(jìn)行編碼
def factor_encode(data):map_dict = {}for each in data.columns[:-1]:piv = pd.pivot_table(data, values='SalePrice',index=each, aggfunc='mean')piv = piv.sort_values(by='SalePrice')piv['rank'] = np.arange(1, piv.shape[0] + 1)map_dict[each] = piv['rank'].to_dict()return map_dict# 調(diào)用上面的函數(shù),對(duì)名義特征進(jìn)行編碼轉(zhuǎn)換
class_variable.append('SalePrice')
map_dict = factor_encode(data_train[class_variable])
for each_fea in class_variable[:-1]:data_train[each_fea] = data_train[each_fea].replace(map_dict[each_fea])data_test[each_fea] = data_test[each_fea].replace(map_dict[each_fea])
#因?yàn)樯厦嬉呀?jīng)完成編碼,這里我們?cè)俑鶕?jù)相關(guān)性判斷和選擇變量
class_coding_corr=data_train[class_variable].corr('spearman')['SalePrice'].sort_values(ascending=False)
print(class_coding_corr[class_coding_corr>0.5])
class_0=class_coding_corr[class_coding_corr>0.5].index
data_train[class_0].corr('spearman')
#==============================================================================
# SalePrice Neighborhood ExterQual BsmtQual KitchenQual GarageFinish GarageType Foundation
# SalePrice 1.000000 0.755779 0.684014 0.678026 0.672849 0.633974 0.598814 0.573580
# Neighborhood 0.755779 1.000000 0.641588 0.650639 0.576106 0.542172 0.520204 0.584784
# ExterQual 0.684014 0.641588 1.000000 0.645766 0.725266 0.536103 0.444759 0.609009
# BsmtQual 0.678026 0.650639 0.645766 1.000000 0.575112 0.555535 0.468710 0.669723
# KitchenQual 0.672849 0.576106 0.725266 0.575112 1.000000 0.480438 0.412784 0.546736
# GarageFinish 0.633974 0.542172 0.536103 0.555535 0.480438 1.000000 0.663870 0.516078
# GarageType 0.598814 0.520204 0.444759 0.468710 0.412784 0.663870 1.000000 0.445793
# Foundation 0.573580 0.584784 0.609009 0.669723 0.546736 0.516078 0.445793 1.000000
#==============================================================================
#查找兩兩之間的共線性之后,我們保留如下變量Neighborhood,ExterQual,BsmtQual,GarageFinish,GarageType,GarageType;
#接下來嘗試查看多重共線性
class_variable = ['Neighborhood', 'ExterQual', 'BsmtQual','GarageFinish', 'GarageType', 'Foundation']
X = np.matrix(data_train[class_variable])
VIF_list = [variance_inflation_factor(X, i) for i in range(X.shape[1])]
VIF_list
#==============================================================================
# [9.613821793463067,
# 31.39188664149662,
# 33.53637481741086,
# 22.788724064203514,
# 18.362389057630754,
# 15.022566626297733]
#==============================================================================
Scaler = StandardScaler()
data_train_class = Scaler.fit_transform(data_train[class_variable])
pca = PCA(n_components=3)
newData_train_class = pca.fit_transform(data_train_class)
newData_train_class
#==============================================================================
# array([[-1.77187082, 0.39144961, -0.00666812],
# [-0.56662675, -0.71029462, -0.51616971],
# [-1.77187082, 0.39144961, -0.00666812],
# ...,
# [-1.53107491, 0.09189354, -1.97872919],
# [ 1.08133883, -0.73280347, -0.2622237 ],
# [-0.15886914, -1.39847287, -0.0633631 ]])
#==============================================================================
Scaler = StandardScaler()
data_test_class = Scaler.fit_transform(data_test[class_variable])
pca = PCA(n_components=3)
newData_test_class = pca.fit_transform(data_test_class)
newData_test_class
#==============================================================================
# array([[ 1.02960083, -0.69269663, 0.29882836],
# [ 1.02960083, -0.69269663, 0.29882836],
# [-1.36691619, -0.77507848, -1.26800226],
# ...,
# [ 1.62092967, 0.49991231, 0.28578798],
# [ 1.25970992, 2.79139405, -1.00467437],
# [-1.16601113, -0.81964858, -1.47794992]])
#==============================================================================
newData_train_class = pd.DataFrame(newData_train_class)
y = np.matrix(newData_train_class)
VIF_list = [variance_inflation_factor(y, i) for i in range(y.shape[1])]
print(VIF_list)
#==============================================================================
# [1.0, 0.9999999999999993, 0.9999999999999996]
#==============================================================================
#訓(xùn)練集
newData_train_class = pd.DataFrame(newData_train_class)
newData_train_class.columns = ['降維后類別A','降維后類別B','降維后類別C']
newData_train = pd.DataFrame(newData_train)
newData_train.columns = ['降維后數(shù)值A(chǔ)','降維后數(shù)值B','降維后數(shù)值C','降維后數(shù)值D','降維后數(shù)值E','降維后數(shù)值F','降維后數(shù)值G']
target = data_train['SalePrice']
target = pd.DataFrame(target)
train = pd.concat([newData_train_class,newData_train],axis=1, ignore_index=True)#測(cè)試集
newData_test_class = pd.DataFrame(newData_test_class)
newData_test_class.columns = ['降維后類別A','降維后類別B','降維后類別C']
newData_test = pd.DataFrame(newData_test)
newData_test.columns = ['降維后數(shù)值A(chǔ)','降維后數(shù)值B','降維后數(shù)值C','降維后數(shù)值D','降維后數(shù)值E','降維后數(shù)值F','降維后數(shù)值G']
test = pd.concat([newData_test_class,newData_test],axis=1, ignore_index=True)from sklearn.model_selection import train_test_split
#train_test_split函數(shù)用于將矩陣隨機(jī)劃分為訓(xùn)練子集和測(cè)試子集,并返回劃分好的訓(xùn)練集測(cè)試集樣本和訓(xùn)練集測(cè)試集標(biāo)簽
from sklearn.ensemble import RandomForestRegressor
#隨機(jī)森林
from sklearn.linear_model import LogisticRegression
#邏輯回歸
from sklearn import svm
#支持向量機(jī)
train_data, test_data, train_target, test_target = train_test_split(train, target, test_size=0.2, random_state=0)
# 當(dāng)前參數(shù)為默認(rèn)參數(shù)
m = RandomForestRegressor()
m.fit(train_data, train_target)
from sklearn.metrics import r2_score
score = r2_score(test_target,m.predict(test_data))
print(score)
#==============================================================================
# 0.8407489786565608
# D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
# after removing the cwd from sys.path.
#==============================================================================
lr = LogisticRegression(C=1000.0,random_state=0)
lr.fit(train_data, train_target)
from sklearn.metrics import r2_score
score = r2_score(test_target,lr.predict(test_data))
print(score)
#==============================================================================
# D:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
# y = column_or_1d(y, warn=True)
# 0.6380328421527758
#==============================================================================
clf = svm.SVC(kernel = 'poly')
clf.fit(train_data, train_target)
score = r2_score(test_target,clf.predict(test_data))
print(score)
#==============================================================================
# D:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
# y = column_or_1d(y, warn=True)
# 0.775319597309544
#==============================================================================
#==============================================================================
# 結(jié)論就是邏輯回歸等模型性能比較差,即使已經(jīng)經(jīng)過了正則化、PCA降維、去除多重共線性等。kaggle的比賽起手就應(yīng)該用XGB。
# 下面嘗試使用一下網(wǎng)格搜索的方式看能否提高一下隨機(jī)森林的性能。
#==============================================================================
from sklearn.grid_search import GridSearchCV
#網(wǎng)格搜索功能
from sklearn.pipeline import Pipeline#==============================================================================
# D:\ProgramData\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
# "This module will be removed in 0.20.", DeprecationWarning)
# D:\ProgramData\Anaconda3\lib\site-packages\sklearn\grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
# DeprecationWarning)
#==============================================================================
param_grid = {'n_estimators':[1,10,100,200,300,400,500,600,700,800,900,1000,1200],'max_features':('auto','sqrt','log2')}
m = GridSearchCV(RandomForestRegressor(),param_grid)
m=m.fit(train_data,train_target.values.ravel())
print(m.best_score_)
print(m.best_params_)
#==============================================================================
# 0.8642193810202421
# {'max_features': 'sqrt', 'n_estimators': 500}
#==============================================================================
#通過網(wǎng)格搜索找到最佳的參數(shù)后,代入模型,模型完成
m = RandomForestRegressor(n_estimators=200,max_features='sqrt')
m.fit(train_data, train_target.values.ravel())
predict = m.predict(test)
test = pd.read_csv('test.csv')['Id']
sub = pd.DataFrame()
sub['Id'] = test
sub['SalePrice'] = pd.Series(predict)
sub.to_csv('Predictions.csv', index=False)print('finished!')
總結(jié)
以上是生活随笔為你收集整理的【项目实战】基于随机森林算法的房屋价格预测模型的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。