sklearn机器学习常用数据处理总结
                                                            生活随笔
收集整理的這篇文章主要介紹了
                                sklearn机器学习常用数据处理总结
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.                        
                                數據才是王道→數據預處理與數據集構建
from IPython.display import Image %matplotlib inline # Added version check for recent scikit-learn 0.18 checks from distutils.version import LooseVersion as Version from sklearn import __version__ as sklearn_version1.處理缺省值
import pandas as pd from io import StringIOcsv_data = '''A,B,C,D 1.0,2.0,3.0,4.0 5.0,6.0,,8.0 10.0,11.0,12.0,'''# If you are using Python 2.7, you need # to convert the string to unicode: csv_data = unicode(csv_data)df = pd.read_csv(StringIO(csv_data)) df df.isnull().sum()2.可以直接刪除缺省值多的樣本或者特征df.dropna() #默認行 df.dropna(axis=1) # only drop rows where all columns are NaN df.dropna(how='all') # drop rows that have not at least 4 non-NaN values df.dropna(thresh=4) # only drop rows where NaN appear in specific columns (here: 'C') df.dropna(subset=['C'])3.重新計算缺省值
from sklearn.preprocessing import Imputerimr = Imputer(missing_values='NaN', strategy='mean', axis=0) imr = imr.fit(df) imputed_data = imr.transform(df.values) imputed_data df.values 4.處理類別型數據
import pandas as pddf = pd.DataFrame([['green', 'M', 10.1, 'class1'],['red', 'L', 13.5, 'class2'],['blue', 'XL', 15.3, 'class1']])df.columns = ['color', 'size', 'price', 'classlabel'] df5.序列特征映射
size_mapping = {'XL': 3,'L': 2,'M': 1}df['size'] = df['size'].map(size_mapping) dfinv_size_mapping = {v: k for k, v in size_mapping.items()} df['size'].map(inv_size_mapping)6.類別編碼
import numpy as npclass_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))} class_mapping df['classlabel'] = df['classlabel'].map(class_mapping) dfinv_class_mapping = {v: k for k, v in class_mapping.items()} df['classlabel'] = df['classlabel'].map(inv_class_mapping) dffrom sklearn.preprocessing import LabelEncoderclass_le = LabelEncoder() y = class_le.fit_transform(df['classlabel'].values) y class_le.inverse_transform(y)7.對類別型的特征用one-hot編碼
X = df[['color', 'size', 'price']].valuescolor_le = LabelEncoder() X[:, 0] = color_le.fit_transform(X[:, 0]) X array([[1, 1, 10.1],[2, 2, 13.5],[0, 3, 15.3]], dtype=object)
from sklearn.preprocessing import OneHotEncoderohe = OneHotEncoder(categorical_features=[0]) ohe.fit_transform(X).toarray() array([[ 0. , 1. , 0. , 1. , 10.1],[ 0. , 0. , 1. , 2. , 13.5],[ 1. , 0. , 0. , 3. , 15.3]])
pd.get_dummies(df[['price', 'color', 'size']])
8.對連續值特征做幅度縮放(scaling)
from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler() X_train_norm = mms.fit_transform(X_train) X_test_norm = mms.transform(X_test)
 
ex = pd.DataFrame([0, 1, 2, 3, 4, 5])# standardize ex[1] = (ex[0] - ex[0].mean()) / ex[0].std(ddof=0)# Please note that pandas uses ddof=1 (sample standard deviation) # by default, whereas NumPy's std method and the StandardScaler # uses ddof=0 (population standard deviation)# normalize ex[2] = (ex[0] - ex[0].min()) / (ex[0].max() - ex[0].min()) ex.columns = ['input', 'standardized', 'normalized'] ex
9.特征選擇
通過L1正則化的截斷性效應選擇,不重要的都為0,特征矩陣變成稀疏矩陣。
 
from sklearn.ensemble import RandomForestClassifierfeat_labels = df_wine.columns[1:]forest = RandomForestClassifier(n_estimators=10000,random_state=0,n_jobs=-1)forest.fit(X_train, y_train) importances = forest.feature_importances_indices = np.argsort(importances)[::-1]for f in range(X_train.shape[1]):print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))plt.title('Feature Importances') plt.bar(range(X_train.shape[1]), importances[indices],color='lightblue', align='center')plt.xticks(range(X_train.shape[1]), feat_labels[indices], rotation=90) plt.xlim([-1, X_train.shape[1]]) plt.tight_layout() #plt.savefig('./random_forest.png', dpi=300) plt.show() if Version(sklearn_version) < '0.18':X_selected = forest.transform(X_train, threshold=0.15) else:from sklearn.feature_selection import SelectFromModelsfm = SelectFromModel(forest, threshold=0.15, prefit=True)X_selected = sfm.transform(X_train)X_selected.shape Now, let's print the 3 features that met the threshold criterion for feature selection that we set earlier (note that this code snippet does not appear in the actual book but was added to this notebook later for illustrative purposes):
for f in range(X_selected.shape[1]):print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
總結
以上是生活随笔為你收集整理的sklearn机器学习常用数据处理总结的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: pandas常见的时间处理函数
- 下一篇: 谷歌大脑自门控激活函数Swish
