数据增强 数据集扩充_数据扩充的抽象总结
數(shù)據(jù)增強(qiáng) 數(shù)據(jù)集擴(kuò)充
班級(jí)分配不均衡的創(chuàng)新解決方案 (A Creative Solution to Imbalanced Class Distribution)
Imbalanced class distribution is a common problem in Machine Learning. I was recently confronted with this issue when training a sentiment classification model. Certain categories were far more prevalent than others and the predictive quality of the model suffered. The first technique I used to address this was random under-sampling, wherein I randomly sampled a subset of rows from each category up to a ceiling threshold. I selected a ceiling that reasonably balanced the upper 3 classes. Although a small improvement was observed, the model was still far from optimal.
班級(jí)分配不平衡是機(jī)器學(xué)習(xí)中的常見(jiàn)問(wèn)題。 最近,我在訓(xùn)練情感分類模型時(shí)遇到了這個(gè)問(wèn)題。 某些類別比其他類別更為普遍,因此模型的預(yù)測(cè)質(zhì)量受到影響。 我用來(lái)解決此問(wèn)題的第一個(gè)技術(shù)是隨機(jī)欠采樣,其中我從每個(gè)類別中隨機(jī)采樣了行的子集,直到上限閾值。 我選擇了一個(gè)合理地平衡前三類的上限。 盡管觀察到很小的改進(jìn),但是該模型仍遠(yuǎn)非最佳。
I needed a way to deal with the under-represented classes. I could not rely on traditional techniques used in multi-class classification such as sample and class weighting, as I was working with a multi-label dataset. It became evident that I would need to leverage oversampling in this situation.
我需要一種方法來(lái)處理代表性不足的課程。 當(dāng)我使用多標(biāo)簽數(shù)據(jù)集時(shí),我不能依賴于用于多類分類的傳統(tǒng)技術(shù),例如樣本和類加權(quán)。 很明顯,在這種情況下,我需要利用過(guò)度采樣。
A technique such as SMOTE (Synthetic Minority Over-sampling Technique) can be effective for oversampling, although the problem again becomes a bit more difficult with multi-label datasets. MLSMOTE (Multi-Label Synthetic Minority Over-sampling Technique) has been proposed [1], but the high dimensional nature of the numerical vectors created from text can sometimes make other forms of data augmentation more appealing.
諸如SMOTE(合成少數(shù)族裔過(guò)采樣技術(shù))之類的技術(shù)可以有效地進(jìn)行過(guò)采樣,盡管對(duì)于多標(biāo)簽數(shù)據(jù)集,問(wèn)題再次變得更加棘手。 已經(jīng)提出了MLSMOTE (多標(biāo)簽綜合少數(shù)族裔過(guò)采樣技術(shù))[1],但是從文本創(chuàng)建的數(shù)字矢量的高維性質(zhì)有時(shí)會(huì)使其他形式的數(shù)據(jù)增強(qiáng)更具吸引力。
Photo by Christian Wagner on Unsplash 克里斯蒂安·瓦格納在《 Unsplash》上的照片變形金剛救援! (Transformers to the Rescue!)
If you decided to read this article, it is safe to assume that you are aware of the latest advances in Natural Language Processing bequeathed by the mighty Transformers. The exceptional developers at Hugging Face in particular have opened the door to this world through their open source contributions. One of their more recent releases implements a breakthrough in Transfer Learning called the Text-to-Text Transfer Transformer or T5 model, originally presented by Raffel et. al. in their paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [2].
如果您決定閱讀本文,可以假定您了解強(qiáng)大的變形金剛在自然語(yǔ)言處理方面的最新進(jìn)展。 Hugging Face的杰出開發(fā)人員尤其通過(guò)其開源貢獻(xiàn)為這個(gè)世界打開了一扇門。 他們的一個(gè)更新的版本工具的轉(zhuǎn)移的突破口學(xué)習(xí)所謂的T外部- 牛逼鄰T外部貿(mào)易交接牛逼 ransformer或T5型號(hào),最初由拉費(fèi)爾等人提出的。 等 在他們的論文《使用統(tǒng)一的文本到文本的轉(zhuǎn)換器探索遷移學(xué)習(xí)的局限性》中 [2]。
T5 allows us to execute various NLP tasks by specifying prefixes to the input text. In my case, I was interested in Abstractive Summarization, so I made use of the summarize prefix.
T5允許我們通過(guò)指定輸入文本的前綴來(lái)執(zhí)行各種NLP任務(wù)。 就我而言,我感興趣的是寫意總結(jié),所以我利用的summarize前綴。
Text-to-Text Transfer Transformer [2]文本到文本傳輸變壓器[2]抽象總結(jié) (Abstractive Summarization)
Abstractive Summarization put simplistically is a technique by which a chunk of text is fed to an NLP model and a novel summary of that text is returned. This should not be confused with Extractive Summarization, where sentences are embedded and a clustering algorithm is executed to find those closest to the clusters’ centroids — namely, existing sentences are returned. Abstractive Summarization seemed particularly appealing as a Data Augmentation technique because of its ability to generate novel yet realistic sentences of text.
簡(jiǎn)而言之,抽象摘要是一種將文本塊輸入NLP模型并返回該文本的新穎摘要的技術(shù)。 這不應(yīng)與“提取摘要”相混淆,在“摘要提取”中嵌入句子并執(zhí)行聚類算法以查找最接近聚類質(zhì)心的那些,即返回現(xiàn)有的句子。 抽象匯總作為一種數(shù)據(jù)增強(qiáng)技術(shù)特別吸引人,因?yàn)樗軌蛏尚路f而逼真的文本句子。
算法 (Algorithm)
Here are the steps I took to use Abstractive Summarization for Data Augmentation, including code segments illustrating the solution.
這是我使用抽象匯總進(jìn)行數(shù)據(jù)增強(qiáng)所采取的步驟,包括說(shuō)明解決方案的代碼段。
I first needed to determine how many rows each under-represented class required. The number of rows to add for each feature is thus calculated with a ceiling threshold, and we refer to these as the append_counts. Features with counts above the ceiling are not appended. In particular, if a given feature has 1000 rows and the ceiling is 100, its append count will be 0. The following methods trivially achieve this in the situation where features have been one-hot encoded:
首先,我需要確定每個(gè)代表性不足的類需要多少行。 因此,每個(gè)特征要添加的行數(shù)是使用上限閾值計(jì)算的,我們將其稱為append_counts 。 計(jì)數(shù)不超過(guò)上限的要素不會(huì)被附加。 特別是,如果給定要素具有1000行且上限為100,則其附加計(jì)數(shù)將為0。在要素已被一鍵編碼的情況下,以下方法可以輕松實(shí)現(xiàn)此目的:
def get_feature_counts(self, df):shape_array = {} for feature in self.features:
shape_array[feature] = df[feature].sum() return shape_array
def get_append_counts(self, df):
append_counts = {}
feature_counts = self.get_feature_counts(df)
for feature in self.features:
if feature_counts[feature] >= self.threshold:
count = 0
else:
count = self.threshold - feature_counts[feature]
append_counts[feature] = count
return append_counts
For each feature, a loop is completed from an append index range to the append count specified for that given feature. This append_index variable along with a tasks array are introduced to allow for multi-processing which we will discuss shortly.
對(duì)于每個(gè)功能,從附加索引范圍到為該給定功能指定的附加計(jì)數(shù)的循環(huán)完成。 引入了這個(gè)append_index變量以及一個(gè)task數(shù)組,以允許進(jìn)行多重處理,我們將在稍后進(jìn)行討論。
counts = self.get_append_counts(self.df)# Create append dataframe with length of all rows to be appended
self.df_append = pd.DataFrame(
index=np.arange(sum(counts.values())),
columns=self.df.columns
)
# Creating array of tasks for multiprocessing
tasks = []
# set all feature values to 0
for feature in self.features:
self.df_append[feature] = 0
for feature in self.features:
num_to_append = counts[feature]
for num in range(
self.append_index,
self.append_index + num_to_append
):
tasks.append(
self.process_abstractive_summarization(feature, num)
)
# Updating index for insertion into shared appended dataframe
# to preserve indexing for multiprocessing
self.append_index += num_to_append
An Abstractive Summarization is calculated for a specified size subset of all rows that uniquely have the given feature, and is added to the append DataFrame with its respective feature one-hot encoded.
為唯一具有給定特征的所有行的指定大小的子集計(jì)算一個(gè)摘要匯總,并將其摘要附加到附加DataFrame中,并對(duì)其各個(gè)特征進(jìn)行一次熱編碼。
df_feature = self.df[(self.df[feature] == 1) &
(self.df[self.features].sum(axis=1) == 1)
]
df_sample = df_feature.sample(self.num_samples, replace=True)
text_to_summarize = ' '.join(
df_sample[:self.num_samples]['review_text'])
new_text = self.get_abstractive_summarization(text_to_summarize)
self.df_append.at[num, 'text'] = new_text
self.df_append.at[num, feature] = 1
The Abstractive Summarization itself is generated in the following way:
摘要匯總本身是通過(guò)以下方式生成的:
t5_prepared_text = "summarize: " + text_to_summarizeif self.device.type == 'cpu':
tokenized_text = self.tokenizer.encode(
t5_prepared_text,
return_tensors=self.return_tensors).to(self.device)
else:
tokenized_text = self.tokenizer.encode(
t5_prepared_text,
return_tensors=self.return_tensors)
summary_ids = self.model.generate(
tokenized_text,
num_beams=self.num_beams,
no_repeat_ngram_size=self.no_repeat_ngram_size,
min_length=self.min_length,
max_length=self.max_length,
early_stopping=self.early_stopping
)
output = self.tokenizer.decode(
summary_ids[0],
skip_special_tokens=self.skip_special_tokens
)
In initial tests the summarization calls to the T5 model were extremely time-consuming, reaching up to 25 seconds even on a GCP instance with an NVIDIA Tesla P100. Clearly this needed to be addressed to make this a feasible solution for data augmentation.
在最初的測(cè)試中,對(duì)T5模型的匯總調(diào)用非常耗時(shí),即使在使用NVIDIA Tesla P100的GCP實(shí)例上,也要長(zhǎng)達(dá)25秒。 顯然,需要解決此問(wèn)題,以使其成為可行的數(shù)據(jù)增強(qiáng)解決方案。
Photo by Brad Neathery on Unsplash Brad Neathery在Unsplash上拍攝的照片多處理 (Multiprocessing)
I introduced a multiprocessing option, whereby the calls to Abstractive Summarization are stored in a task array later passed to a sub-routine that runs the calls in parallel using the multiprocessing library. This resulted in an exponential decrease in runtime. I must thank David Foster for his succinct stackoverflow contribution [3]!
我介紹了一個(gè)multiprocessing選項(xiàng),其中對(duì)抽象總結(jié)的調(diào)用存儲(chǔ)在一個(gè)任務(wù)數(shù)組中,然后傳遞給一個(gè)子例程,該子例程使用多處理庫(kù)并行運(yùn)行這些調(diào)用。 這導(dǎo)致運(yùn)行時(shí)間呈指數(shù)下降。 我必須感謝David Foster所做的簡(jiǎn)潔的stackoverflow貢獻(xiàn)[3]!
running_tasks = [Process(target=task) for task in tasks]for running_task in running_tasks:
running_task.start()
for running_task in running_tasks:
running_task.join()
簡(jiǎn)化解決方案 (Simplified Solution)
To make things easier for everybody I packaged this into a library called absum. Installing is possible through pip:pip install absum. One can also download directly from the repository.
為了使每個(gè)人都更容易,我將其打包到一個(gè)名為absum的庫(kù)中。 可以通過(guò)pip install absum : pip install absum 。 也可以直接從資源庫(kù)下載。
Running the code on your own dataset is then simply a matter of importing the library’s Augmentor class and running its abs_sum_augment method as follows:
在自己的數(shù)據(jù)集運(yùn)行的代碼則只需導(dǎo)入庫(kù)的事項(xiàng)Augmentor類和運(yùn)行其abs_sum_augment方法如下:
import pandas as pdfrom absum import Augmentorcsv = 'path_to_csv'
df = pd.read_csv(csv)
augmentor = Augmentor(df)
df_augmented = augmentor.abs_sum_augment()
df_augmented.to_csv(
csv.replace('.csv', '-augmented.csv'),
encoding='utf-8',
index=False
)
absum uses the Hugging Face T5 model by default, but is designed in a modular way to allow you to use any pre-trained or out-of-the-box Transformer models capable of Abstractive Summarization. It is format agnostic, expecting only a DataFrame containing text and one-hot encoded features. If additional columns are present that you do not wish to be considered, you have the option to pass in specific one-hot encoded features as a comma-separated string to the features parameter.
absum默認(rèn)情況下使用Hugging Face T5模型,但以模塊化方式設(shè)計(jì),允許您使用任何能夠進(jìn)行抽象總結(jié)的預(yù)訓(xùn)練或開箱即用的Transformer模型。 它與格式無(wú)關(guān),只希望包含文本和一鍵編碼功能的DataFrame。 如果存在您不希望考慮的其他列,則可以選擇將特定的一鍵編碼特征作為逗號(hào)分隔的字符串傳遞給features參數(shù)。
Also of special note are the min_length and max_length parameters, which determine the size of the resulting summarizations. One trick I found useful is to find the average character count of the text data you’re working with and start with something a bit lower for the minimum length while slightly padding it for the maximum. All available parameters are detailed in the documentation.
還要特別注意的是min_length和max_length參數(shù),它們確定所得匯總的大小。 我發(fā)現(xiàn)有用的一個(gè)技巧是找到正在使用的文本數(shù)據(jù)的平均字符數(shù),并從最小長(zhǎng)度的小一些開始,而最大長(zhǎng)度的填充一些。 文檔中詳細(xì)介紹了所有可用參數(shù)。
Feel free to add any suggestions for improvement in the comments or even better yet in a PR. Happy coding!
可以隨意添加任何建議以改善評(píng)論,甚至可以改善PR 。 編碼愉快!
翻譯自: https://towardsdatascience.com/abstractive-summarization-for-data-augmentation-1423d8ec079e
數(shù)據(jù)增強(qiáng) 數(shù)據(jù)集擴(kuò)充
總結(jié)
以上是生活随笔為你收集整理的数据增强 数据集扩充_数据扩充的抽象总结的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 苹果xs max国行和美版的区别
- 下一篇: 贝叶斯优化神经网络参数_贝叶斯超参数优化