《reStructured Pre-training》笔记
reStructured Pre-training 筆記
本文主要記錄論文中我覺得比較重要的部分,并加入個人的理解,如有錯誤請可直接指出;由于格式問題,強烈建議去notion觀看,完整版內容請移步notion網頁進行詳細閱讀,謝謝!
Abstract
In such a paradigm, the role of data will be re-emphasized, and model
pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing.
a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access.
We achieve this by pre-training models over restructured data that consist of a variety of valuable information instead of raw data after overcoming several engineering challenges.
Hypothesis of NLP technique evolution
自然語言處理技術進化的假說
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-NpOvV25C-1657291156008)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-06_17.21.43.png)]
1 Introduction
We argue that the ultimate goal of data storage is to better serve human life, and how data is accessed is as important as how it is stored. However, there are often differences in the way that data is stored and accessed.
作者提出存儲數據的最終目標是更好地服務于人們的生活,因此如何獲取數據和如何存儲數據一樣重要。
Although prompting methods have narrowed the difference between data storage and access, it does not fundamentally eliminate the gap, as the way models store data in the pre-training stage is not transparent to diverse downstream tasks.
盡管prompting methods減少了存儲和獲取的差別,但沒有在根本上消除他們之間的代溝,因為模型在預訓練過程中存儲數據的方式對不同的下流任務是不透明的。
換句話說,下流任務不知道使用何種方法(即prompts)可以更好地從預訓練模型中獲取想要的數據。
比如,在情感分類任務中,為了在預訓練模型的幫助下預測句子的情感,我們必須選擇一個模型熟悉的提問方式,然而系統設計者并不了解模型更傾向于使用那種提問格式,因為預訓練數據的分布或者結構是不可解釋的。 下面的圖可以生動地解釋這個例子:
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-nXKgyIhr-1657291156009)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-06_20.32.43.png)]
Methodologically, we present a new way to look at data that contains various types of information, which could be regarded as pre-training signals that can instruct models for parameter optimization. We structurally represent data in the unit of signals and claim that a good PLM should mark various signals during pre-training in a way that expected information could be accessed efficiently by downstream tasks.
作者將數據中包含的不同種類的數據看作預訓練信號,用來指導模型的參數優化,在結構上以信號為單位表示數據。
一個好的PLM應該在預訓練過程中標記不同種類的信號,以便下游任務可以有效地獲取需要的數據。
就像我們使用數據庫存儲數據一樣,我們需要先將他們結構化并放進一個有結構的表格中,這樣就可以通過結構化語言(如SQL)準確地獲取我們想要的數據
Moreover, we argue that valuable signals are rich and exist everywhere from the data in the world instead of simply existing in the supervised datasets that are manually curated
有價值的信號是豐富的并且存在于世界的任何地方,而不僅僅存在于有監督數據集中。
and what we need to do is to (a) identify them, (b) restructure them in a unified language, ? integrate and store them into the pre-trained language model. We call this learning paradigm reStructured Pre-training.
我們需要做的是:
我們稱這種學習范式為重構式預訓練。
A good PLM should have a clear picture of the composition of the various signals in the data to provide accurate information for downstream tasks according to their different
needs.
一個好的PLM應該對數據中不同種類信號的組成有清楚的認知,從而根據下流任務的不同需求提供準確的信息。
2 reStructured Pre-training
2.1 Paradigm Shift in Modern NLP
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-5rweyvv8-1657291156010)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-07_14.11.13.png)]
2.2 reStructured Pre-training
Unlike existing paradigms that mainly focus on model-centric design, we think more from the data perspective to maximize the utility of the already available data.
專注于最大化利用現有數據
Specifically, we take a data storing & accessing view where the pre-training stage is considered as a data storing process while downstream task training based on pre-trained models is regarded as data accessing process from pre-trained models, and claim that a good data storage mechanism should make the stored data more accessible.
我們采用數據存儲和獲取的角度,其中預訓練階段被看作數據存儲過程,而下流任務則看作數據獲取過程。
一個好的數據存儲方法應該使存儲的數據更容易獲取。
To achieve this goal, we look at data as an object that consists of diverse signals and argue that a good pre-trained model should (1) cover as many types of signals as possible and (2) provide precise access mechanisms for these signals when required by downstream tasks. i.e., a shift from pre-training over plain texts to pre-training over structured signals. In general, there are three steps within this new paradigm.
為了實現這個目標,我們將數據看作由各種信號組成的對象,并主張一個好的預訓練模型應該:
總體上,這個新范式有3個步驟:
reStructure:由于現有的信號格式多種多樣,有必要將它們重組為統一的格式用以模型的預訓練。
Pre-train:當所有訓練數據都被結構化為統一的數據后,選擇預訓練結構,并訓練結構化數據。
Fine-tune:預訓練完成后,模型可以用結構化標簽數據進一步微調;另一種常見情況是直接將它們用于下游任務,通常通過zero-shot prompting。
2.3 Evolutionary Process of Engineering Cycles
[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-TwKyC2cA-1657291156011)(reStructured%20Pre-training%20dd713e6e66b74c20a893aa232e0efe3c/%E6%88%AA%E5%B1%8F2022-07-07_15.05.46.png)]
機器學習技術的核心推動力:
the iteration of technology always moves along the direction that system developers can design a better and more general system by doing fewer things.
技術的迭代總是朝著——系統開發者通過做更少的事情就可以設計一個更好的、更通用的系統——的方向發展。
2.4 Design Considerations
作為restructured learning的第一步,我們需要知道哪些signals自然地存在于世界上,并且可收集、可獲取。
Data Mine指一組包含多種類型信號的數據。一旦完成Siganal Defination就開始尋找合適的Data Mine。
如何有效地從Data Mine中提取Signals也很重要。
這個過程關注如何使用統一的格式表示所有類型的signals,并且縮小數據存儲和檢索的差距。
這個過程關注使用什么預訓練結構,使得所有結構化數據都可以被有效地表示。
3 reStructuring Engineering
總結
以上是生活随笔為你收集整理的《reStructured Pre-training》笔记的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 子网划分详解与子网划分例题解析
- 下一篇: 机器学习系列(14)_PCA对图像数据集