當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

变压器 5g_T5：文本到文本传输变压器

發布時間：2023/12/15 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了变压器 5g_T5：文本到文本传输变压器小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

變壓器 5g

With the burgeoning of Transfer Learning, Deep Learning has achieved many wonders. More specifically, in NLP, with the rise of the Transformer (Vaswani et. al.), various approaches for ‘Language Modeling’ have arisen wherein we leverage transfer learning by pre-training the model for a very generic task and then fine-tuning it on specific downstream problems.

隨著遷移學習的蓬勃發展，深度學習已實現了許多奇跡。更具體地說，在NLP中，隨著Transformer的興起( Vaswani等人 )，出現了各種“語言建模”方法，其中我們通過對模型進行預訓練以完成非常通用的任務，然后進行微調來利用轉移學習它針對特定的下游問題。

In this article, we’ll discuss Google’s state of the art, T5 — Text-to-Text Transfer Transformer Model which was proposed earlier this year in the paper, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. This paper is essentially a survey of modern transfer learning techniques used in language understanding and hence proposes a unified framework that attempts to combine all language problems into a text-to-text format. We will discuss this approach in greater detail in the coming sections. Moreover, the authors have also open-sourced a new dataset (for facilitating their work) called C4 — Colossal Clean Crawled Corpus.

在這篇文章中，我們將討論藝術，T5的谷歌的狀態- T的外部- 牛逼鄰T外部貿易交接牛逼 ransformer模型，它在今年早些時候提出的文件，“ 探索遷移學習的限制與統一文本-to-Text Transformer ”。本文本質上是對用于語言理解的現代遷移學習技術的調查，因此提出了一個統一的框架，該框架試圖將所有語言問題組合為文本到文本格式。我們將在接下來的部分中詳細討論這種方法。此外，作者還開源了一個新的數據集(為了方便他們的工作)，稱為C4 - C大型C貧C精煉CAppium。

T5—文本到文本傳輸變壓器 (T5— Text-To-Text Transfer Transformer)

As mentioned earlier, T5 attempts to combine all the downstream tasks into a text-to-text format.

如前所述，T5嘗試將所有下游任務組合為文本到文本格式。

文本到文本框架 (The Text-to-Text Framework)

Google AI BlogGoogle AI博客實現所有下游任務的統一框架

Consider the example of a BERT-style architecture that is pre-trained on a Masked LM and Next Sentence Prediction objective and then, fine-tuned on downstream tasks (for example predicting a class label in classification or the span of the input in QnA). Here, we separately fine-tune different instances of the pre-trained model on different downstream tasks.

考慮一個BERT樣式的架構示例，該架構在Masked LM和Next Sentence Prediction目標上進行了預訓練，然后在下游任務上進行了微調(例如，預測類別中的類標簽或QnA中輸入的范圍) 。在這里，我們分別針對不同的下游任務微調了預訓練模型的不同實例。

The text-to-text framework on the contrary, suggests using the same model, same loss function, and the same hyperparameters on all the NLP tasks. In this approach, the inputs are modeled in such a way that the model shall recognize a task, and the output is simply the “text” version of the expected outcome. Refer to the above animation to get a clearer view of this.

相反，文本到文本框架建議在所有NLP任務上使用相同的模型，相同的損失函數和相同的超參數。在這種方法中，對輸入進行建模的方式是模型可以識別任務，而輸出只是預期結果的“文本”版本。請參考上面的動畫以獲得更清晰的視圖。

Fun fact: We can even apply T5 to regression tasks by training it to output the string representation of the expected output.

有趣的事實：通過訓練T5輸出期望輸出的字符串表示形式，我們甚至可以將T5應用于回歸任務。

C4—巨大的干凈爬行的語料庫 (C4— Colossal Clean Crawled Corpus)

It is a stereotype to pre-train language models on huge unlabeled datasets. Common Crawl is one of such datasets. It is obtained by scraping web pages and ignoring the markup from the HTML. It produces around 20TB of scraped data each month. However, Common Crawl contains large amounts of gibberish text like menus or error messages, or duplicate text. Moreover, there is also an appreciable amount of useless text with respect to our tasks like offensive words, placeholder text, or source codes.

這是在大量未標記數據集上預訓練語言模型的刻板印象。 Common Crawl是此類數據集之一。它是通過抓取網頁并忽略HTML中的標記而獲得的。每個月會產生約20TB的抓取數據。但是，“常見爬網”包含大量亂碼，如菜單或錯誤消息，或重復的文本。此外，對于我們的任務，還有相當數量的無用文字，例如令人反感的文字，占位符文字或源代碼。

For C4, the authors took Common Crawl scrape from April 2019 and applied some cleansing filters on it:

對于C4，作者從2019年4月開始抓取Common Crawl刮擦并在其上應用了一些清理過濾器：

Retaining sentences that end only with a valid terminal punctuation mark (a period, exclamation mark, question mark, or end quotation mark).

保留僅以有效的終端標點符號(句點，感嘆號，問號或結束引號)結尾的句子。

Removing any page containing offensive words that appear on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”.

刪除出現在“ 臟話，頑皮話，淫穢話或其他不良話語清單 ”上的任何含有冒犯性話語的頁面。

“JavaScript must be enabled” type warnings are removed by filtering out any line that contains the word JavaScript.

通過過濾掉包含JavaScript單詞的任何行，可以刪除“必須啟用JavaScript”類型的警告。

Pages with placeholder text like “lorem ipsum” are removed.

帶有占位符文本(如“ lorem ipsum”)的頁面將被刪除。

Source codes are removed by removing any pages that contain a curly brace “{” (since curly braces appear in many well-known programming languages).

通過刪除任何包含花括號“ {”的頁面來刪除源代碼(因為花括號在許多眾所周知的編程語言中都顯示)。

For removing duplicates, three-sentence spans are considered. Any duplicate occurrences of the same 3 sentences are filtered out.

為了刪除重復項，請考慮三句跨度。同一3個句子的任何重復出現都將被過濾掉。

Finally, since the downstream tasks are mostly for English language, langdetect is used to filter out any pages that are not classified as English with a probability of at least 0.99.

最后，由于下游任務主要用于英語，因此使用langdetect過濾掉任何未歸類為英語的頁面的可能性至少為0.99。

This resulted in a 750GB dataset which is not just reasonably larger than the most pre-training datasets but also contains a relatively very clean text.

這產生了750GB的數據集，它不僅比大多數預訓練數據集合理地大，而且還包含相對非常干凈的文本。

輸入和輸出表示 (Input and Output Representations)

This is one of the major concerns of T5 as this is what makes the unified text-to-text approach possible. To avail the same model for all the downstream tasks, a task-specific text prefix is added to the original input that is fed to the model. This text prefix is also considered as a hyperparameter.

這是T5的主要問題之一，因為這使統一的文本到文本方法成為可能。為了對所有下游任務使用相同的模型，將特定于任務的文本前綴添加到提供給模型的原始輸入中。此文本前綴也被視為超參數。

As an example,to ask the model to translate the sentence “That is good.” from English to German, the model would be fed the sequence “translate English to German: That is good.” and would be trained to output “Das ist gut.”

例如，要求模型翻譯句子“那很好”。從英語到德語，將向模型提供以下順序：“ 將英語翻譯為德語：很好。 ”，并將經過訓練以輸出“ 達斯主義者的直覺”。 ”

— T5 Paper

— T5紙

Similarly, for classification tasks, the model predicts a single word corresponding to the target label.

類似地，對于分類任務，模型預測與目標標簽相對應的單個單詞。

For example, on the MNLI benchmark the goal is to predict whether a premise implies (“entailment”), contradicts (“contradiction”), or neither (“neutral”) a hypothesis. With our preprocessing, the input sequence becomes “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.” with the corresponding target word “entailment”.

例如，在MNLI基準的目標是預測的前提是否意味著(“ 蘊涵 ”)相矛盾(“ 矛盾 ”)，或者兩者都不是(“ 中性 ”)的假設。通過我們的預處理，輸入序列變成了“ mnli前提：我討厭鴿子”。 假設：我對鴿子的感覺充滿敵意。 ”和相應的目標詞“ 蘊含 ”。

— T5 Paper

— T5紙

Here’s an issue with this. What if the predicted word is something else i.e. not “entailment”, “contradiction” or “neutral”. Well, in that case, the model is trained to consider all the other words as wrong.

這是一個問題。如果預測的單詞不是“蘊含”，“矛盾”或“中立”，該怎么辦？好吧，在那種情況下，訓練模型可以將所有其他單詞視為錯誤。

該模型 (The Model)

The proposed model is essentially a Encoder-Decoder Transformer (Vaswani et. al.) with some architectural changes (like applying Layer Normalization before a sub block and then adding the initial input to the sub-block output; also known as pre-norm). Moreover, the model configuration is similar to BERT base (Devlin et. al.).

提出的模型本質上是一個編碼器-解碼器變壓器( Vaswani et al。 )，具有一些架構上的變化(例如在子塊之前應用Layer Normalization，然后將初始輸入添加到子塊輸出；也稱為pre-norm)。。此外，模型配置類似于BERT基( Devlin等人 )。

We’ll skip these architectures as they’re out of scope for this article. If you’re interested in knowing the specifications of these models in particular, I have already covered them in the following articles:

我們將跳過這些架構，因為它們不在本文討論范圍之內。如果您有興趣特別了解這些模型的規格，那么我將在以下文章中介紹它們：

Transformers: https://towardsdatascience.com/transformers-explained-65454c0f3fa7

變形金剛： https ： //towardsdatascience.com/transformers-explained-65454c0f3fa7

Transformers Implementation: https://medium.com/swlh/abstractive-text-summarization-using-transformers-3e774cc42453

變壓器實現： https : //medium.com/swlh/abstractive-text-summarization-using-transformers-3e774cc42453

BERT: https://medium.com/swlh/bert-pre-training-of-transformers-for-language-understanding-5214fba4a9af

BERT： https : //medium.com/swlh/bert-pre-training-of-transformers-for-language-understanding-5214fba4a9af

培訓方式 (Training Approach)

the Paper的紙

At an architectural level, there are several options in selecting the training approach:The paper is an exhaustive survey on many modern approaches for language understanding. Hence, many architectural specifications have been explored and compared.

在體系結構級別上，選擇培訓方法有多種選擇：本文是對許多現代語言理解方法的詳盡調查。因此，已經探索和比較了許多架構規范。

Encoder-Decoder (Left): This is the standard encoder-decoder, seq2seq architecture wherein the encoder is trained in a BERT-style, fully visible manner (i.e. every token contributes to the attention calculation of every other token in the sequence), and the decoder is trained in a GPT-style causal manner (i.e. every token is attended by all the tokens that occur before it in the sequence).

編碼器-解碼器(左)：這是標準的編碼器-解碼器seq2seq架構，其中以BERT樣式， 完全可見的方式訓練編碼器(即，每個令牌都有助于序列中每個其他令牌的注意力計算)，以及解碼器以GPT樣式的因果方式進行訓練(即，每個令牌都由序列中在其之前出現的所有令牌所伴隨)。

Language Model (Middle): This is essentially the causal attention mechanism that was discussed earlier. It is an autoregressive modeling approach.

語言模型(中)：本質上是前面討論的因果注意機制。這是一種自回歸建模方法。

Prefix LM (Right): This is a combination of the BERT-style and language model approaches. For example, the task of translating from English to German can have a BERT-style attention on: “translate English to German: That is good. target:”. And then the translation “Das ist gut.” will be attended autoregressively.

前綴LM(右)：這是BERT樣式和語言模型方法的組合。例如，將英語翻譯成德語的任務可以引起BERT風格的關注：“將英語翻譯成德語：很好。目標：”。然后翻譯為“ Das ist gut”。將自發參加。

With experimentation, the best results were obtained with the Encoder-Decoder approach.

通過實驗，使用“編碼器-解碼器”方法可獲得最佳結果。

無監督目標 (Unsupervised Objective)

the Paper文件中損壞跨度

With respect to the pre-training objective too, the authors have explored some of the approaches in practice:

關于培訓前的目標，作者還探索了實踐中的一些方法：

Language Modeling: This approach mainly includes the causal prediction task i.e. predicting the next word in the sentence considering all the words preceding that word.

語言建模：此方法主要包括因果預測任務，即考慮該詞之前的所有詞來預測句子中的下一個詞。

Deshuffling: All the words in a sentence are shuffled and the model is trained to predict the original text.

去混洗：將句子中的所有單詞混洗，并訓練模型以預測原始文本。

Corrupting Spans: Masking a sequence of words from the sentence and training the model to predict these masked words as shown in the figure above. It is also known as a denoising objective.

損壞的跨度：屏蔽句子中的一系列單詞，并訓練模型以預測這些屏蔽的單詞，如上圖所示。它也被稱為降噪目標。

After exploration, the denoising objective had the most promising results.

經過探索，降噪目標得到了最有希望的結果。

the Paper本文探索無監督目標

結果 (Results)

First things first, T5 has achieved the state of the art in many GLUE, SuperGLUE tasks along with translation and summarization benchmarks.

首先，T5在許多GLUE，SuperGLUE任務以及翻譯和摘要基準中都達到了最先進的水平。

T5 is surprisingly good at this task. The full 11-billion parameter model produces the exact text of the answer 50.1%, 37.4%, and 34.5% of the time on TriviaQA, WebQuestions, and Natural Questions, respectively.

T5出奇地擅長此任務。完整的110億參數模型分別在TriviaQA ， WebQuestions和Natural Questions上分別產生答案的準確文本，分別為50.1％，37.4％和34.5％。

— Google AI Blog

— Google AI博客

To generate realistic text, T5 relies on a fill-in-the-blanks type task with which it is familiar due to the pre-training. So, the authors have created a new downstream task called sized fill-in-the-blank. For example, given the sentence, “I like to eat peanut butter and _4_ sandwiches,”, the model will be trained to predict approximately 4 words for the blank.

為了生成逼真的文本，T5依賴于由于預先訓練而熟悉的填空任務。因此，作者創建了一個新的下游任務，稱為“ 大小填充空白” 。例如，給定句子“ 我喜歡吃花生醬和_4_三明治 ”，該模型將被訓練為空白預測大約4個單詞。

Fun fact: The model also adjusts its predictions based on the requested size of the missing text.

有趣的事實：該模型還會根據請求的缺失文本大小來調整其預測。

For the demonstration of the above, refer to the official blog.

有關上述說明，請參閱官方博客。

放在一起 (Putting it All Together)

Google AI BlogGoogle AI博客進行 T5的預訓練和微調

T5 is first pre-trained on the C4 dataset for the denoising, corrupting span objective with an Encoder-Decoder architecture.
T5首先在C4數據集上經過預編碼，以使用Encoder-Decoder體系結構進行降噪，破壞跨度目標。
It is then fine tuned on the downstream tasks with a supervised objective with appropriate input modeling for the text-to-text setting.
然后在帶有監督目標的下游任務上進行微調，并為文本到文本設置設置適當的輸入模型。

結論 (Conclusion)

In this article, we dived deep into Google’s T5 model which is one of the state of the art models in language understanding. We saw the new dataset: C4. The main takeaway from this article would be the empirical results obtained by the T5 authors regarding the training approaches, model architectures and the datasets. Moreover, it can be also observed that DL is approaching more and more towards achieving human quality understanding— in this context, generalizing to just one model for many NLP tasks.

在本文中，我們深入研究了Google的T5模型，該模型是語言理解方面的最新模型之一。我們看到了新的數據集：C4。本文的主要內容是T5作者在訓練方法，模型架構和數據集方面的經驗結果。此外，還可以觀察到，DL正在越來越多地實現對人類素質的理解-在這種情況下，DL僅適用于許多NLP任務的模型。

Github repo: https://github.com/google-research/text-to-text-transfer-transformer

Github倉庫： https : //github.com/google-research/text-to-text-transfer-transformer

API for the model architecture and pre-trained weights by huggingface: https://huggingface.co/transformers/model_doc/t5.html

通過擁抱面Kong獲得模型架構和預訓練權重的API： https ://huggingface.co/transformers/model_doc/t5.html

C4 Tensorflow datasets: https://www.tensorflow.org/datasets/catalog/c4

C4 Tensorflow數據集： https ://www.tensorflow.org/datasets/catalog/c4

翻譯自: https://towardsdatascience.com/t5-text-to-text-transfer-transformer-643f89e8905e