當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据分析——数据清洗之文字特征编码

發(fā)布時間：2023/12/29 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了数据分析——数据清洗之文字特征编码小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

在對數(shù)據(jù)進(jìn)行預(yù)處理時，有時會面臨特征值是字符串類型的時候，這時候就需要對特征值進(jìn)行編碼處理，主要分為一下兩類：

各個特征值之間沒有任何關(guān)聯(lián)，如['red', 'green', 'blue']。
各個特征值之間有關(guān)系，如['Excellent', 'Good', 'Normal', 'Bad']。

下面分別說一下如何對以上兩種類型數(shù)據(jù)進(jìn)行編碼處理：

拿kaggle中的House Price數(shù)據(jù)來舉例說明。

import pandas as pddf = pd.read_csv('./data/train.csv') columns = ['MSZoning','ExterQual'] df_used = df[columns] print(df_used)

使用到的兩列的意義分別如下，很明顯MSZoning是沒有任何關(guān)聯(lián)的，而ExterQual是對房屋材質(zhì)進(jìn)行的評價，是有等級劃分的。

MSZoning: Identifies the general zoning classification of the sale.A AgricultureC CommercialFV Floating Village ResidentialI IndustrialRH Residential High DensityRL Residential Low DensityRP Residential Low Density ParkRM Residential Medium DensityExterQual: Evaluates the quality of the material on the exteriorEx ExcellentGd GoodTA Average/TypicalFa FairPo Poor

一、各個特征值之間沒有任何關(guān)聯(lián)

下面通過四種方法來處理這類問題。

1、pd.get_dummies()

看下源碼：作用是將categorical變量轉(zhuǎn)換為指標(biāo)型變量。

def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,columns=None, sparse=False, drop_first=False):"""Convert categorical variable into dummy/indicator variablesParameters----------data : array-like, Series, or DataFrameprefix : string, list of strings, or dict of strings, default NoneString to append DataFrame column namesPass a list with length equal to the number of columnswhen calling get_dummies on a DataFrame. Alternativly, `prefix`can be a dictionary mapping column names to prefixes.prefix_sep : string, default '_'If appending prefix, separator/delimiter to use. Or pass alist or dictionary as with `prefix.`dummy_na : bool, default FalseAdd a column to indicate NaNs, if False NaNs are ignored.columns : list-like, default NoneColumn names in the DataFrame to be encoded.If `columns` is None then all the columns with`object` or `category` dtype will be converted.sparse : bool, default FalseWhether the dummy columns should be sparse or not. ReturnsSparseDataFrame if `data` is a Series or if all columns are included.Otherwise returns a DataFrame with some SparseBlocks... versionadded:: 0.16.1drop_first : bool, default FalseWhether to get k-1 dummies out of n categorical levels by removing thefirst level... versionadded:: 0.18.0Returns-------dummies : DataFrame or SparseDataFrame df_used = pd.get_dummies(df_used, columns=['MSZoning']) print(df_used.head()) ExterQual MSZoning_C (all) MSZoning_FV MSZoning_RH MSZoning_RL \ 0 Gd 0.0 0.0 0.0 1.0 1 TA 0.0 0.0 0.0 1.0 2 Gd 0.0 0.0 0.0 1.0 3 TA 0.0 0.0 0.0 1.0 4 Gd 0.0 0.0 0.0 1.0 MSZoning_RM 0 0.0 1 0.0 2 0.0 3 0.0 4 0.0

從結(jié)果來看，它為每一個單獨的列特征創(chuàng)建了一個單獨的列，并進(jìn)行了one-hot編碼。 另外，直接對dataframe轉(zhuǎn)換的話，每個列名之前都會有之前列名作為前綴。

2、sklearn.preprocessing.LabelEncoder

熟悉sklearn的話應(yīng)該用過sklearn.preprocessing.OneHotEncoder，然而OneHotEncoder只能對數(shù)值類型進(jìn)行編碼，而LabelEncoder可以對字符類型進(jìn)行編碼處理。

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() result = le.fit_transform(df_used['MSZoning']) df_used['MSZoning'] = result print(df_used.head())

這里會報一個copy的warn：http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

不過沒有什么影響，結(jié)果如下。

MSZoning ExterQual 0 3 Gd 1 3 TA 2 3 Gd 3 3 TA 4 3 Gd 5 3 TA

3、使用map函數(shù)

通過enumerate內(nèi)置函數(shù)來為列屬性中的所有值建立索引，然后將索引來代替之前的值。

map_MSZoning = {key : value for value, key in enumerate(set(df['MSZoning']))} df_used['MSZoning'] = df_used['MSZoning'].map(map_MSZoning) print(df_used.head()) MSZoning ExterQual 0 2 Gd 1 2 TA 2 2 Gd 3 2 TA 4 2 Gd

4、使用pd.factorize()

pd.factorize()不像pd.get_dummies()那樣將一個特征映射為多個特征，而只是對該特征內(nèi)的特征值進(jìn)行編碼。

def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):"""Encode input values as an enumerated type or categorical variableParameters----------values : ndarray (1-d)Sequencesort : boolean, default FalseSort by valuesna_sentinel : int, default -1Value to mark "not found"size_hint : hint to the hashtable sizerReturns-------labels : the indexer to the original arrayuniques : ndarray (1-d) or Indexthe unique values. Index is returned when passed values is Index orSeriesnote: an array of Periods will ignore sort as it returns an always sortedPeriodIndex""" df['MSZoning'] = pd.factorize(df['MSZoning'])[0] print(df['MSZoning'])

二、各個特征值之間具有一定關(guān)系

通過map函數(shù)映射。

map_ExterQual = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1} df_used['ExterQual'] = df_used['ExterQual'].map(map_ExterQual) print(df_used.head()) MSZoning ExterQual 0 RL 4 1 RL 3 2 RL 4 3 RL 3 4 RL 4

總結(jié)

以上是生活随笔為你收集整理的数据分析——数据清洗之文字特征编码的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： matlab hilbert 包络,hi
下一篇： 2020-11-07 EOS 体系下钱包