数据分析——数据清洗之文字特征编码
在對數(shù)據(jù)進(jìn)行預(yù)處理時,有時會面臨特征值是字符串類型的時候,這時候就需要對特征值進(jìn)行編碼處理,主要分為一下兩類:
- 各個特征值之間沒有任何關(guān)聯(lián),如['red', 'green', 'blue']。
- 各個特征值之間有關(guān)系,如['Excellent', 'Good', 'Normal', 'Bad']。
下面分別說一下如何對以上兩種類型數(shù)據(jù)進(jìn)行編碼處理:
拿kaggle中的House Price數(shù)據(jù)來舉例說明。
import pandas as pddf = pd.read_csv('./data/train.csv') columns = ['MSZoning','ExterQual'] df_used = df[columns] print(df_used)使用到的兩列的意義分別如下,很明顯MSZoning是沒有任何關(guān)聯(lián)的,而ExterQual是對房屋材質(zhì)進(jìn)行的評價,是有等級劃分的。
MSZoning: Identifies the general zoning classification of the sale.A AgricultureC CommercialFV Floating Village ResidentialI IndustrialRH Residential High DensityRL Residential Low DensityRP Residential Low Density ParkRM Residential Medium DensityExterQual: Evaluates the quality of the material on the exteriorEx ExcellentGd GoodTA Average/TypicalFa FairPo Poor一、各個特征值之間沒有任何關(guān)聯(lián)
下面通過四種方法來處理這類問題。
1、pd.get_dummies()
看下源碼:作用是將categorical變量轉(zhuǎn)換為指標(biāo)型變量。
def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,columns=None, sparse=False, drop_first=False):"""Convert categorical variable into dummy/indicator variablesParameters----------data : array-like, Series, or DataFrameprefix : string, list of strings, or dict of strings, default NoneString to append DataFrame column namesPass a list with length equal to the number of columnswhen calling get_dummies on a DataFrame. Alternativly, `prefix`can be a dictionary mapping column names to prefixes.prefix_sep : string, default '_'If appending prefix, separator/delimiter to use. Or pass alist or dictionary as with `prefix.`dummy_na : bool, default FalseAdd a column to indicate NaNs, if False NaNs are ignored.columns : list-like, default NoneColumn names in the DataFrame to be encoded.If `columns` is None then all the columns with`object` or `category` dtype will be converted.sparse : bool, default FalseWhether the dummy columns should be sparse or not. ReturnsSparseDataFrame if `data` is a Series or if all columns are included.Otherwise returns a DataFrame with some SparseBlocks... versionadded:: 0.16.1drop_first : bool, default FalseWhether to get k-1 dummies out of n categorical levels by removing thefirst level... versionadded:: 0.18.0Returns-------dummies : DataFrame or SparseDataFrame df_used = pd.get_dummies(df_used, columns=['MSZoning']) print(df_used.head()) ExterQual MSZoning_C (all) MSZoning_FV MSZoning_RH MSZoning_RL \ 0 Gd 0.0 0.0 0.0 1.0 1 TA 0.0 0.0 0.0 1.0 2 Gd 0.0 0.0 0.0 1.0 3 TA 0.0 0.0 0.0 1.0 4 Gd 0.0 0.0 0.0 1.0 MSZoning_RM 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0從結(jié)果來看,它為每一個單獨的列特征創(chuàng)建了一個單獨的列,并進(jìn)行了one-hot編碼。 另外,直接對dataframe轉(zhuǎn)換的話,每個列名之前都會有之前列名作為前綴。
2、sklearn.preprocessing.LabelEncoder
熟悉sklearn的話應(yīng)該用過sklearn.preprocessing.OneHotEncoder,然而OneHotEncoder只能對數(shù)值類型進(jìn)行編碼,而LabelEncoder可以對字符類型進(jìn)行編碼處理。
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() result = le.fit_transform(df_used['MSZoning']) df_used['MSZoning'] = result print(df_used.head())這里會報一個copy的warn:http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
不過沒有什么影響,結(jié)果如下。
MSZoning ExterQual 0 3 Gd 1 3 TA 2 3 Gd 3 3 TA 4 3 Gd 5 3 TA3、使用map函數(shù)
通過enumerate內(nèi)置函數(shù)來為列屬性中的所有值建立索引,然后將索引來代替之前的值。
map_MSZoning = {key : value for value, key in enumerate(set(df['MSZoning']))} df_used['MSZoning'] = df_used['MSZoning'].map(map_MSZoning) print(df_used.head()) MSZoning ExterQual 0 2 Gd 1 2 TA 2 2 Gd 3 2 TA 4 2 Gd4、使用pd.factorize()
pd.factorize()不像pd.get_dummies()那樣將一個特征映射為多個特征,而只是對該特征內(nèi)的特征值進(jìn)行編碼。
def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):"""Encode input values as an enumerated type or categorical variableParameters----------values : ndarray (1-d)Sequencesort : boolean, default FalseSort by valuesna_sentinel : int, default -1Value to mark "not found"size_hint : hint to the hashtable sizerReturns-------labels : the indexer to the original arrayuniques : ndarray (1-d) or Indexthe unique values. Index is returned when passed values is Index orSeriesnote: an array of Periods will ignore sort as it returns an always sortedPeriodIndex""" df['MSZoning'] = pd.factorize(df['MSZoning'])[0] print(df['MSZoning'])二、各個特征值之間具有一定關(guān)系
通過map函數(shù)映射。
map_ExterQual = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1} df_used['ExterQual'] = df_used['ExterQual'].map(map_ExterQual) print(df_used.head()) MSZoning ExterQual 0 RL 4 1 RL 3 2 RL 4 3 RL 3 4 RL 4?
總結(jié)
以上是生活随笔為你收集整理的数据分析——数据清洗之文字特征编码的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: matlab hilbert 包络,hi
- 下一篇: 2020-11-07 EOS 体系下钱包