當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

特征工程tf-idf_特征工程-保留和删除的内容

發(fā)布時(shí)間：2023/11/29 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了特征工程tf-idf_特征工程-保留和删除的内容小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

特征工程tf-idf

The next step after exploring the patterns in data is feature engineering. Any operation performed on the features/columns which could help us in making a prediction from the data could be termed as Feature Engineering. This would include the following at high-level:

探索數(shù)據(jù)模式之后的下一步是要素工程。對(duì)特征/列執(zhí)行的任何可幫助我們根據(jù)數(shù)據(jù)進(jìn)行預(yù)測(cè)的操作都可以稱為特征工程。這將在高層包括以下內(nèi)容：

adding new features

添加新功能

eliminating some of the features which tell the same story

消除了講述同一故事的某些功能

combining several features together

結(jié)合幾個(gè)功能

breaking down a feature into multiple features

將一個(gè)功能分解為多個(gè)功能

新增功能 (Adding new features)

Suppose you want to predict sales of ice-cream or gloves, or umbrella. What is common in these items? The sales of all these items are dependent on “weather” and “l(fā)ocation”. Ice-creams sell more during summer or hotter areas, gloves are sold more in colder weather (winter) or colder regions, and we definitely need an umbrella when there’s rain. So if you have the historical sales data for all these items, what would help your model to learn the patterns more would be to add the weather and the selling areas at each data level.

假設(shè)您要預(yù)測(cè)冰淇淋或手套或雨傘的銷量。這些項(xiàng)目有什么共同點(diǎn)？所有這些項(xiàng)目的銷售都取決于“天氣”和“位置”。在夏季或更熱的地區(qū)，冰淇淋的銷售量更大，在寒冷的天氣(冬季)或寒冷的地區(qū)，手套的銷售量也更多，而下雨天我們肯定需要一把雨傘。因此，如果您具有所有這些項(xiàng)目的歷史銷售數(shù)據(jù)，那么可以幫助您的模型學(xué)習(xí)更多模式的方法是在每個(gè)數(shù)據(jù)級(jí)別添加天氣和銷售區(qū)域。

消除講述同一故事的某些功能 (Eliminating some of the features which tell the same story)

For explanation purpose, I made up a sample dataset which has data of different phone brands, something like the one below. Let us analyze this data and figure out why we should remove/eliminate some columns-

為了說(shuō)明起見(jiàn)，我組成了一個(gè)樣本數(shù)據(jù)集，其中包含不同手機(jī)品牌的數(shù)據(jù)，如下圖所示。讓我們分析這些數(shù)據(jù)并弄清楚為什么要?jiǎng)h除/消除某些列-

Image by Author圖片作者

Now in this dataset, if we look carefully, there is a column for the brand name, a column for the model name, and there’s another column which says Phone (which basically contains both brand and model name). So if we see this situation, we don’t need the column Phone because the data in this column is already present in other columns, and split data is better than the aggregated data in this case.

現(xiàn)在，在此數(shù)據(jù)集中，如果我們仔細(xì)看，會(huì)出現(xiàn)一列品牌名稱，一列型號(hào)名稱以及另一列顯示“ 電話”的信息 (基本上包含品牌名稱和型號(hào)名稱)。因此，如果遇到這種情況，則不需要“電話”列，因?yàn)榇肆兄械臄?shù)據(jù)已經(jīng)存在于其他列中，在這種情況下，拆分?jǐn)?shù)據(jù)要好于匯總數(shù)據(jù)。

There is another column that is not adding any value to the dataset — Memory scale. All the memory values are in terms of “GB”, hence there is no need to keep an additional column that fails to show any variation in the dataset, because it’s not going to help our model learn different patterns.

另一列未向數(shù)據(jù)集添加任何值- 內(nèi)存比例。 所有內(nèi)存值均以“ GB”為單位，因此無(wú)需保留額外的列，該列無(wú)法顯示數(shù)據(jù)集中的任何變化，因?yàn)檫@不會(huì)幫助我們的模型學(xué)習(xí)不同的模式。

組合多個(gè)功能以創(chuàng)建新功能 (Combining several features to create new features)

This means we can use 2–3 features or rows and create a new feature that explains the data better. For example, in the above dataset, some of the features which we can create could be — count of phones in each brand, % share of each phone in respective brand, count of phones available in different memory size, price per unit memory, etc. This will help the model understand the data at a granular level.

這意味著我們可以使用2–3個(gè)要素或行，并創(chuàng)建一個(gè)可以更好地解釋數(shù)據(jù)的新要素。例如，在上述數(shù)據(jù)集中，我們可以創(chuàng)建的某些功能可能是-每個(gè)品牌的手機(jī)數(shù)量，每個(gè)品牌在每個(gè)品牌手機(jī)中的百分比份額，具有不同內(nèi)存大小的可用手機(jī)數(shù)量，單位內(nèi)存價(jià)格等這將幫助模型更深入地了解數(shù)據(jù)。

將功能分解為多個(gè)功能 (Breaking down a feature into multiple features)

The most common example in this segment is Date and Address. A date mostly consists of Year, Month, and Day, let’s say in the form of ‘07/28/2019’. So if we break down the Date column into 2019, 7 or July, and 28, it’ll help us join the tables to various other tables in an easier way, and also will be easy to manipulate the data, because now instead of a date format, we have to deal with numbers which are a lot easier.

此段中最常見(jiàn)的示例是日期和地址。日期主要由年，月和日組成，以“ 07/28/2019”的形式表示。因此，如果我們將“日期”列細(xì)分為2019年，7月，7月和28日，這將有助于我們以一種更簡(jiǎn)單的方式將這些表連接到其他各種表，并且也將易于操作數(shù)據(jù)，因?yàn)楝F(xiàn)在不再使用日期格式，我們必須處理容易得多的數(shù)字。

For the same easier data manipulation and easier data joins reason, we break down the Address data (721 Main St., Apt 24, Dallas, TX-75432) into — Street name (721 Main St.), Apartment number/ House number (Apt 24), City (Dallas), State (TX/Texas), zip code (75432).

為了簡(jiǎn)化數(shù)據(jù)處理和簡(jiǎn)化數(shù)據(jù)合并的原因，我們將地址數(shù)據(jù)(721 Main St.，Apt 24，Dallas，TX-75432)分解為—街道名稱(721 Main St.)，公寓號(hào)/門(mén)牌號(hào)( Apt 24)，城市(達(dá)拉斯)，州(TX / Texas)，郵政編碼(75432)。

Now that we know what feature engineering is, let’s go through some of the techniques by which we can do feature engineering. There are various methods out there for feature engineering, but I will discuss some of the most common techniques & practices that I use in my regular problems.

既然我們知道了特征工程是什么，那么讓我們看一下可以進(jìn)行特征工程的一些技術(shù)。有許多用于特征工程的方法，但是我將討論一些我經(jīng)常遇到的最常見(jiàn)的技術(shù)和實(shí)踐。

Lags — this means creating columns for previous timestamp records (sales 1-day back, sales 1-month back, etc. based on the use-case). This feature will help us understand, for example, what was the iPhone sale 1 day back, 2 days back, etc. This is important because most of the machine learning algorithms look at the data row-wise, and unless we don’t have the previous days' records in the same row, the model will not be able to create patterns between current and previous date records efficiently.

滯后 -這意味著為以前的時(shí)間戳記錄創(chuàng)建列(根據(jù)用例，返回1天的銷售額，返回1個(gè)月的銷售額等)。例如，此功能將幫助我們了解1天后，2天后iPhone的銷售情況。這很重要，因?yàn)榇蠖鄶?shù)機(jī)器學(xué)習(xí)算法都是按行查看數(shù)據(jù)，除非我們沒(méi)有同一行中的前幾天記錄，該模型將無(wú)法有效地在當(dāng)前日期記錄和以前的日期記錄之間創(chuàng)建模式。

Count of categories — this could be anything as simple as count of phones in each brand, count of people buying iPhone 11pro, count of the different age groups of people buying Samsung Galaxy vs iPhone.

類別計(jì)數(shù) -這可能很簡(jiǎn)單，例如每個(gè)品牌的手機(jī)計(jì)數(shù)，購(gòu)買(mǎi)iPhone 11pro的人數(shù)，購(gòu)買(mǎi)三星Galaxy與iPhone的不同年齡段的人數(shù)。

Sum/ Mean/ Median/Cumulative sum/ Aggregate sum — of any numeric features like salary, sales, profit, age, weight, etc.

總和/平均值/中位數(shù)/累計(jì)總和/總和 -任何數(shù)字特征，如薪水，銷售額，利潤(rùn)，年齡，體重等。

Categorical Transformation Techniques (replacing values, one-hot encoding, label encoding, etc) — These techniques are used to convert the categorical features to respective numerical encoded values, because some of the algorithms (like xgboost) do not identify categorical features. The correct technique depends on the number of categories in each column, the number of categorical columns, etc. To learn more about different techniques, check this blog and this blog.

分類轉(zhuǎn)換技術(shù) (替換值，單次編碼，標(biāo)簽編碼等)-這些技術(shù)用于將分類特征轉(zhuǎn)換為各自的數(shù)字編碼值，因?yàn)槟承┧惴?例如xgboost)無(wú)法識(shí)別分類特征。正確的技術(shù)取決于每列中類別的數(shù)量，分類列的數(shù)量等。要了解有關(guān)不同技術(shù)的更多信息，請(qǐng)?jiān)L問(wèn)此博客和此博客。

Standardization/ Normalization techniques (min-max, standard scaler, etc) — There could be some datasets where you have numerical features but they’re present at different scales (kg, $, inch, sq.ft., etc.). So for some of the machine learning methods like clustering, it is important that we have all the numbers at one scale (we will discuss about clustering more in later blogs, but for now understand it as creating groups of data points in space based on the similarity). To know more about this section, check out these blogs — Feature Scaling Analytics Vidhya, Handling Numerical Data O'Reilly, Standard Scaler/MinMax Scaler.

標(biāo)準(zhǔn)化/標(biāo)準(zhǔn)化技術(shù) (最小-最大，標(biāo)準(zhǔn)縮放器等)—可能有一些數(shù)據(jù)集具有數(shù)字功能，但它們以不同的比例(公斤，美元，英寸，平方英尺等)顯示。因此，對(duì)于某些諸如聚類的機(jī)器學(xué)習(xí)方法，重要的是使所有數(shù)字都在一個(gè)尺度上(我們將在以后的博客中討論有關(guān)聚類的更多信息，但就目前而言，它理解為基于空間的數(shù)據(jù)點(diǎn)組)相似)。要了解有關(guān)此部分的更多信息，請(qǐng)查看這些博客-Feature Scaling Analytics Vidhya ，處理數(shù)值數(shù)據(jù)O'Reilly ， Standard Scaler / MinMax Scaler。

These are some of the very general methods of creating new features, but most of the feature engineering largely depends on brainstorming on the dataset in the picture. For example, if we have a dataset for employees, vs if we have the dataset of the general transactions, feature engineering will be done in different ways.

這些是創(chuàng)建新要素的一些非常通用的方法，但是大多數(shù)要素工程很大程度上取決于對(duì)圖片數(shù)據(jù)集進(jìn)行頭腦風(fēng)暴。例如，如果我們有一個(gè)雇員數(shù)據(jù)集，而我們有一個(gè)一般交易數(shù)據(jù)集，那么要素工程將以不同的方式完成。

We can create these columns using various pandas functions manually. Besides these, there is a package called FeatureTools, which can also be explored to create new columns by combining the datasets at different levels.

我們可以使用各種熊貓函數(shù)手動(dòng)創(chuàng)建這些列。除此之外，還有一個(gè)名為FeatureTools的軟件包，也可以通過(guò)組合不同級(jí)別的數(shù)據(jù)集來(lái)探索該軟件包以創(chuàng)建新列。

Image by Author圖片作者

This brings us to a (somewhat) end of Data Preprocessing Stages. Once we have the data preprocessed, we need to start looking into different ML techniques for our problem statement. We will be discussing those in the upcoming blogs. Hope y’all found this blog interesting and useful! :)

這使我們進(jìn)入了(某種程度上) 數(shù)據(jù)預(yù)處理階段的結(jié)尾。 對(duì)數(shù)據(jù)進(jìn)行預(yù)處理后，我們需要針對(duì)問(wèn)題陳述開(kāi)始研究不同的ML技術(shù)。我們將在即將發(fā)布的博客中討論這些內(nèi)容。希望大家都覺(jué)得這個(gè)博客有趣和有用！ :)

翻譯自: https://medium.com/swlh/what-to-keep-and-what-to-remove-74ba1b3cb04

特征工程tf-idf

總結(jié)

以上是生活随笔為你收集整理的特征工程tf-idf_特征工程-保留和删除的内容的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。