當前位置：首頁 > 编程语言 > python >内容正文

python

python利用特征进行可视化样本显示_利用Python进行机器学习之特征选择

發布時間：2023/12/4 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 python利用特征进行可视化样本显示_利用Python进行机器学习之特征选择小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

毫無疑問，解決一個問題最重要的是恰當選取特征、甚至創造特征的能力，這叫做特征選取和特征工程。對于特征選取工作，我個人認為分為兩個方面：

1)利用python中已有的算法進行特征選取。

2)人為分析各個變量特征與目標值之間的關系，包括利用圖表等比較直觀的手段方法，剔除無意義或者說不重要的特征變量，使得模型更加精煉高效。

一、scikit-learn中樹算法

from sklearn import metrics

from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()

model.fit(X, y)

# display the relative importance of each attribute

print(model.feature_importances_)

二、RFE搜索算法

另一種算法是基于對特征子集的高效搜索，從而找到最好的子集，意味著演化了的模型在這個子集上有最好的質量。遞歸特征消除算法(RFE)是這些搜索算法的其中之一，Scikit-Learn庫同樣也有提供。

三、利用LassoCV進行特征選擇

#!/usr/bin/python

import pandas as pd

import numpy as np

import csv as csv

import matplotlib

import matplotlib.pyplot as plt

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# create the RFE model and select 3 attributes

rfe = RFE(model, 3)

rfe = rfe.fit(X, y)

# summarize the selection of the attributes

print(rfe.support_)

print(rfe.ranking_)

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV

from sklearn.model_selection import cross_val_score

train = pd.read_csv('train.csv', header=0) # Load the train file into a

dataframe

df = pd.get_dummies(train.iloc[:,1:-1])

df = df.fillna(df.mean())

X_train = df

y = train.price

def rmse_cv(model):

rmse= np.sqrt(-cross_val_score(model, X_train, y,

scoring="neg_mean_squared_error", cv = 3))

return(rmse)

#調用LassoCV函數，并進行交叉驗證，默認cv=3 model_lasso = LassoCV(alphas = [0.1,1,0.001, 0.0005]).fit(X_train, y)

#模型所選擇的最優正則化參數alpha print(model_lasso.alpha_)

#各特征列的參數值或者說權重參數，為0代表該特征被模型剔除了 print(model_lasso.coef_)

#輸出看模型最終選擇了幾個特征向量，剔除了幾個特征向量 coef = pd.Series(model_lasso.coef_, index = X_train.columns)

print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +

str(sum(coef == 0)) + " variables")

#輸出所選擇的最優正則化參數情況下的殘差平均值，因為是3折，所以看平均值 print(rmse_cv(model_lasso).mean())

#畫出特征變量的重要程度，這里面選出前3個重要，后3個不重要的舉例 imp_coef = pd.concat([coef.sort_values().head(3),

coef.sort_values().tail(3)])

matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)

imp_coef.plot(kind = "barh")

plt.title("Coefficients in the Lasso Model")

plt.show()

從上述代碼中可以看出，權重為0的特征就是被剔除的特征，從而進行了特征選擇。還可以從圖上直觀看出哪些特征最重要。至于權重為負數的特征，還需要進一步分析研究。

LassoCV參考：

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html

#sklearn.linear_model.LassoCV

四、利用圖表分析特征以及特征間的關系

1)分析特征值的分布情況，如果有異常最大、最小值，可以進行極值的截斷

plt.figure(figsize=(8,6))

plt.scatter(range(train.shape[0]), np.sort(train.price_doc.values))

plt.xlabel('index', fontsize=12)

plt.ylabel('price', fontsize=12)

plt.show()

從圖中可以看出，目標值price_doc有一些異常極大值散列出來，個別異常極值會干擾模型

的擬合。所以，可以截斷極值。其它各特征列也可以采取該方式進行極值截斷。

#截斷極值，因為極值有時候可以認為是異常數值，會干擾模型的參數 ulimit = np.percentile(train.price_doc.values, 99)

llimit = np.percentile(train.price_doc.values, 1)

train['price_doc'].ix[train['price_doc']>ulimit] = ulimit

train['price_doc'].ix[train['price_doc']

2)分組進行分析

grouped_df = df.groupby('LotFrontage')['MSSubClass'].aggregate(np.mean).reset_index()

#根據LotFrontage進行分組聚合，并求出分組聚合后MSSubClass的平均值,reset_index()將分組后的結果轉換成DataFrame形式

plt.figure(figsize=(12,8))

sns.barplot(grouped_df.LotFrontage.values, grouped_df.MSSubClass.values, alpha=0.9,

color='red')

plt.ylabel('MSSubClass', fontsize=12)

plt.xlabel('LotFrontage', fontsize=12)

plt.xticks(rotation='vertical')

plt.show()

這種可以分析出目標值的一個變化情況，比如房屋價格的話，可以根據年進行分組聚合，展示出每年房屋價格均值的一個變化情況，從而能夠看出時間對房屋價格的一個大致影響。比如，北京房屋價格隨著時間的推進，每年都在上漲，這說明時間是一個很重要的特征變量。

3)統計數據集中各種數據類型出現的次數

#打出df各列數據類型，并利用rest_index()轉成DataFrame形式。一共兩列，1-列名，2-類型 df_type = df.dtypes.reset_index()

#將兩列更改列名 df_type.columns = ["Count", "Column Type"]

#分組統計各個類型列出現的次數 df_type=df_type.groupby("Column Type").aggregate('count').reset_index()

4)圖的形式展示缺失值情況

#將各列的缺失值情況統計出來，一共2列，1-列名，2-缺失值數量 missing_df = df.isnull().sum(axis=0).reset_index()

#賦予新列名 missing_df.columns = ['column_name', 'missing_count']

#將缺失值數量>0的列篩選出來 missing_df = missing_df.ix[missing_df['missing_count']>0]

#排序 missing_df = missing_df.sort_values(by='missing_count', ascending=True)

#將缺失值以圖形形式展示出來 ind = np.arange(missing_df.shape[0])

width = 0.9

fig, ax = plt.subplots(figsize=(12,18))

rects = ax.barh(ind, missing_df.missing_count.values, color='y')

ax.set_yticks(ind)

ax.set_yticklabels(missing_df.column_name.values, rotation='horizontal')

ax.set_xlabel("Count of missing values")

ax.set_title("Number of missing values in each column")

plt.show()

5)利用聯合分布圖分析各重要特征變量與目標值的影響關系

#先對該特征進行極值截斷 col = "full_sq"

ulimit = np.percentile(train_df[col].values, 99.5)

llimit = np.percentile(train_df[col].values, 0.5)

train_df[col].ix[train_df[col]>ulimit] = ulimit

train_df[col].ix[train_df[col]

#畫出聯合分布圖 plt.figure(figsize=(12,12))

sns.jointplot(x=np.log1p(train_df.full_sq.values),

y=np.log1p(train_df.price_doc.values), size=10)

plt.ylabel('Log of Price', fontsize=12)

plt.xlabel('Log of Total area in square metre', fontsize=12)

plt.show()

pearsonr表示兩個變量的相關性系數。

6)pointplot畫出變量間的關系

grouped_df = train_df.groupby('floor')['price_doc'].aggregate(np.median).reset_index()

plt.figure(figsize=(12,8))

sns.pointplot(grouped_df.floor.values, grouped_df.price_doc.values, alpha=0.8,

color=color[2])

plt.ylabel('Median Price', fontsize=12)

plt.xlabel('Floor number', fontsize=12)

plt.xticks(rotation='vertical')

plt.show()

2018/12/5 python進行機器學習(二)之特征選擇 - 光彩照人 - 博客園

https://www.cnblogs.com/gczr/p/6802948.html 8/10

從中看出樓層數對價格的一個整體影響。

7)countplot展示出該特征值的數量分布情況

plt.figure(figsize=(12,8))

sns.countplot(x="price", data=df)

plt.ylabel('Count', fontsize=12)

plt.xlabel('Max floor number', fontsize=12)

plt.xticks(rotation='vertical')

plt.show()

展示出了每個價格的出現次數。

8)boxplot分析最高樓層對房屋價格的一個影響，尤其看中位價格的走勢，是一個大致的判斷。

plt.figure(figsize=(12,8))

sns.boxplot(x="max_floor", y="price_doc", data=train_df)

plt.ylabel('Median Price', fontsize=12)

plt.xlabel('Max Floor number', fontsize=12)

plt.xticks(rotation='vertical')

plt.show()

總結

以上是生活随笔為你收集整理的python利用特征进行可视化样本显示_利用Python进行机器学习之特征选择的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： wait放弃对象锁_Java线程中wai
下一篇： android 固定底部布局_Andr

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

python利用特征进行可视化样本显示_利用Python进行机器学习之特征选择

總結