“万物皆可Seq2Seq” | 忠于原文的T5手写论文翻译
《Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer》
摘要 /?Abstract
? ? ?Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP).?The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format.?Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.?By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.?To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.1 Keywords: transfer learning, natural language processing, multi-task learning, attentionbased models, deep learning
? ? ?遷移學習,把一個模型先在數據豐富的任務上進行預訓練,然后再針對下游任務進行微調,這在自然語言處理中是一個強大的技術。遷移學習的有效性引起了方法、方式和實現的多樣性。在本文中,我們探索了NLP的遷移學習技術的前景,通過引入一個統一框架將所有基于文本的語言問題轉換為文本到文本格式。我們系統的比較了數十種語言理解任務的預訓練目標,體系結構,未標記的數據集,遷移方法和其他因素。通過結合對規模的探索和新的“巨型清潔爬蟲語料庫(C4)”,我們在許多基準上獲得了最先進的結果,包括文本摘要,問題解答,文本分類等。為了促進NLP遷移學習的發展,我們發布了數據集,預訓練的模型和代碼。
章節1 介紹 /?Introduction
? ? ?Training a machine learning model to perform natural language processing (NLP) tasks often requires that the model can process text in a way that is amenable to downstream learning.?This can be loosely viewed as developing general-purpose knowledge that allows the model to “understand” text.?This knowledge can range from low-level (e.g. the spelling?or meaning of words) to high-level (e.g. that a tuba is too large to fit in most backpacks).?In modern machine learning practice, providing this knowledge is rarely done explicitly; instead, it is often learned as part of an auxiliary task.?For example, a historically common approach is to use word vectors (Mikolov et al., 2013b,a; Pennington et al., 2014) to map word identities to a continuous representation where, ideally, similar words map to similar vectors.?These vectors are often learned through an objective that, for example, encourages co-occurring words to be positioned nearby in the continuous space?(Mikolov et al., 2013b).
? ? ?訓練一個自然語言處理領域任務的機器學習模型經常需要這個模型能夠處理文本數據以適應下游學習。可以將其大致看做讓其學習通用的知識,使模型可以“理解”文本。這些知識的范圍可能從低級(例如單詞的拼寫或含義)到高級(例如大號(低音銅管樂器)太大而無法容納大多數背包)。在現代機器學習實踐中,很少明確地提供這種知識;相反的,通常將其作為輔助任務的一部分來學習。例如,一種歷史上常見的方法是使用詞向量(Mikolov et al., 2013b,a; Pennington et al., 2014)將單詞編碼映射為連續表示,理想情況下,相似的單詞映射到相似的向量。這些詞向量通常是通過一個目標來學習的,例如,它鼓勵將同時出現的單詞放在連續空間的附近(對于word2vec來說在文本距離更近的單詞映射的詞向量擁有更近的空間距離)(Mikolov et al., 2013b).
? ? ?Recently, it has become increasingly common to pre-train the entire model on a data-rich task.?Ideally, this pre-training causes the model to develop general-purpose abilities and knowledge that can then be transferred to downstream tasks.?In applications of transfer learning to computer vision (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014), pre-training is typically done via supervised learning on a large labeled data set like ImageNet (Russakovsky et al., 2015; Deng et al., 2009).?In contrast, modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data.?This approach has recently been used to obtain state-of-the-art results in many of the most common NLP benchmarks (Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu et al., 2019c; Lan et al., 2019). Beyond its empirical strength, unsupervised pre-training for NLP is particularly attractive because unlabeled text data is available en masse thanks to the Internet—for example, the Common Crawl project2 produces about 20TB of text data extracted from web pages each month.?This is a natural fit for neural networks, which have been shown to exhibit remarkable scalability, i.e. it is often possible to achieve better performance simply by training a larger model on a larger data set?(Hestness et al., 2017; Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019; Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a).
? ? ?最近,在數據豐富的任務上對整個模型進行預訓練變得越來越普遍。在理想情況下,這種預訓練可使模型發展出通用的能力和知識,然后將其遷移到下游任務中。在將遷移學習應用于計算機視覺的過程中(Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014),預訓練通常是在大型計算機上進行有監督學習來完成的。 比如已經標記的數據集ImageNet(Russakovsky et al., 2015; Deng et al., 2009)。相反,現在用于NLP中的遷移學習技術通常在未標記的數據上使用無監督學習進行預訓練。在許多最常見的NLP基準測試中,近期用這種方法獲得了最頂的結果(Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu et al., 2019c; Lan et al., 2019)。除了其經驗優勢之外,對無監督預訓練的NLP尤其具有吸引力,因為借助互聯網,可以獲得無標簽文本數據,例如,Common Crawl project2每月會從網頁提取大約20TB的文本數據。這自然適用于神經網絡,神經網絡已顯示出卓越的可擴展性,即通常只需在較大的數據集上訓練較大的模型,通常就有可能獲得更頂的性能(Hestness et al., 2017; Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019; Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a).
? ? ?This synergy has resulted in a great deal of recent work developing transfer learning methodology for NLP, which has produced a wide landscape of pre-training objectives (Howard and Ruder, 2018; Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019), unlabeled data sets (Yang et al., 2019; Liu et al., 2019c; Zellers et al., 2019), benchmarks (Wang et al., 2019b, 2018; Conneau and Kiela, 2018), fine-tuning methods (Howard and Ruder, 2018; Houlsby et al., 2019; Peters et al., 2019), and more.?The rapid rate of progress and diversity of techniques in this burgeoning field can make it difficult to compare different algorithms, tease apart the effects of new contributions, and understand the space of existing methods for transfer learning.?Motivated by a need for more rigorous understanding, we leverage a unified approach to transfer learning that allows us to systematically study different approaches and push the current limits of the field.
? ? ?這種1+1>2的作用導致最近對NLP的遷移學習有了大量的工作進展,這產生了廣泛的預訓練目標(Howard and Ruder, 2018; Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019),未標記的數據集(Yang et al., 2019; Liu et al., 2019c; Zellers et al., 2019),基準(Wang et al., 2019b, 2018; Conneau and Kiela, 2018),微調方法(Howard and Ruder, 2018; Houlsby et al., 2019; Peters et al., 2019)等。在這個迅速發展的領域中,快速的進步和技術的多樣性可能使得很難比較不同的算法,難以梳理出新研究的效果,并難以理解現有的遷移學習方法的情況。由于需要更嚴謹的理解,我們利用統一的方法來遷移學習,使我們能夠系統地研究不同的方法,并推動該領域的當前發展。
? ? ?The basic idea underlying our work is to treat every text processing problem as a “text-to-text” problem, i.e. taking text as input and producing new text as output.?This approach is inspired by previous unifying frameworks for NLP tasks, including casting all text problems as question answering (McCann et al., 2018), language modeling (Radford et al., 2019), or span extraction Keskar et al. (2019b) tasks.?Crucially, the text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task we consider.?We leverage this flexibility by evaluating performance on a wide variety of English-based NLP problems, including question answering, document?summarization, and sentiment classification, to name a few.?With this unified approach, we can compare the effectiveness of different transfer learning objectives, unlabeled data sets, and other factors, while exploring the limits of transfer learning for NLP by scaling up models and data sets beyond what has previously been considered.
? ? ?我們工作的基本思想是將每個文本處理問題都視為“文本到文本”問題,即以文本作為輸入并產生一個新的文本作為輸出(萬物皆可Seq2Seq)。這種方法受到以前用于NLP任務的統一框架的啟發,包括將所有文本問題都轉換為問答問題(McCann et al., 2018),語言建模(Radford et al., 2019)或跨度提取Keskar等任務。重要的是,文本到文本框架允許我們可以將相同的模型,目標,訓練過程和解碼過程直接應用于我們所考慮的每個任務。我們通過各種基于英語的NLP問題來評估這種性能,其中包括問答,文檔摘要和情感分類等。使用這種統一的方法,我們可以比較不同的遷移學習目標,未標記的數據集和其他因素的有效性,同時通過擴大模型和數據集的范圍以超越先前考慮的范圍,探索NLP遷移學習的局限性。
Figure 1: A diagram of our text-to-text framework.?Every task we consider—including translation, question answering, and classification—is cast as feeding our model text as input and training it to generate some target text.?This allows us to use the same model, loss function, hyperparameters, etc. across our diverse set of tasks.?It also provides a standard testbed for the methods included in our empirical survey.?“T5” refers to our model, which we dub the “Text-to-Text Transfer Transformer”.
圖1:我們的文本到文本框架圖。我們考慮的每個任務(包括翻譯,問題解答和分類)都將文本作為輸入喂入我們的模型,并對其進行訓練來生成一些目標文本。這使我們可以在各種任務中使用相同的模型,損失函數,超參數等。它還為我們調研中的方法提供了標準的測試方法。“Text-to-Text Transfer Transformer”是指我們的模型,我們將其稱為“T5”。
? ? ?We emphasize that our goal is not to propose new methods but instead to provide a comprehensive perspective on where the field stands.?As such, our work primarily comprises a survey, exploration, and empirical comparison of existing techniques.?We also explore the limits of current approaches by scaling up the insights from our systematic study (training models up to 11 billion parameters) to obtain state-of-the-art results in many of the tasks we consider.?In order to perform experiments at this scale, we introduce the “Colossal Clean Crawled Corpus” (C4), a data set consisting of hundreds of gigabytes of clean English text scraped from the web.?Recognizing that the main utility of transfer learning is the possibility of leveraging pre-trained models in data-scarce settings, we release our code, data sets, and pre-trained models.
? ? ?我們強調,我們的目標不是提出新方法,而是提供有關這個領域現狀的全面觀點。因此,我們的工作主要包括對現有技術的研究,探索和經驗的比較。我們還將通過擴大我們的系統研究(訓練模型多達110億個參數)的見解來探索當前方法的局限性,從而在我們考慮的許多任務中獲得最頂的結果。為了進行如此大規模的實驗,我們引入了“巨型清潔爬蟲語料庫”(C4),該數據集是從網絡上抓取的數百GB干凈的英語文本組成。我們認識到遷移學習的主要作用是讓人們可以在數據稀缺的環境中利用預訓練的模型,因此我們發布了代碼,數據集和預訓練的模型。
? ? ?The remainder of the paper is structured as follows: In the following section, we discuss our base model and its implementation, our procedure for formulating every text processing problem as a text-to-text task, and the suite of tasks we consider.?In Section 3, we present a large set of experiments that explore the field of transfer learning for NLP.?At the end of the section (Section 3.7), we combine insights from our systematic study to obtain state-of-the-art results on a wide variety of benchmarks.?Finally, we provide a summary of our results and wrap up with a look towards the future in Section 4.
? ? ?在本文的其余結構如下:在下面的部分中,我們討論基本模型及其實現,將每個文本處理問題表達為文本到文本任務的過程以及我們考慮的一系列任務。在第3節中,我們提供了大量的實驗,探索NLP的遷移學習領域。在本節的最后(第3.7節),我們結合了系統研究的理解,從而獲得了各種基準上的最頂結果。最后,我們對結果進行了總結,并在第4節中總結了對未來的展望。
?
總結
以上是生活随笔為你收集整理的“万物皆可Seq2Seq” | 忠于原文的T5手写论文翻译的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 开关磁阻电机滑模控制仿真,电流斩波控制,
- 下一篇: 树冠体积计算之AlphaShape算法