对比学习系列论文MoCo v1(二):Momentum Contrast for Unsupervised Visual Representation Learning
0.Abstract
0.1逐句翻譯
We present Momentum Contrast (MoCo) for unsupervised visual representation learning.
我們提出了動量對比(MoCo)用于無監(jiān)督視覺表征學(xué)習(xí)。
From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder.
從字典查找的對比學(xué)習(xí)角度出發(fā),我們建立了一個帶有隊列和移動平均編碼器的動態(tài)字典。
This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.
這使得構(gòu)建一個大型且一致的動態(tài)字典成為可能,從而促進(jìn)了對比的無監(jiān)督學(xué)習(xí)。
MoCo provides competitive results under the common linear protocol on ImageNet classification.
MoCo在ImageNet分類的通用線性協(xié)議下提供了有競爭力的結(jié)果。
More importantly, the representations learned by MoCo transfer well to downstream tasks.
更重要的是,MoCo學(xué)習(xí)到的表象很好地轉(zhuǎn)移到下游任務(wù)。
MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins.
MoCo可以在PASCAL VOC、COCO和其他數(shù)據(jù)集上的7個檢測/分割任務(wù)中超過它的有監(jiān)督的預(yù)訓(xùn)練的對手,有時會超過它的很大幅度。
This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks
這表明,在許多視覺任務(wù)中,無監(jiān)督表征學(xué)習(xí)和有監(jiān)督表征學(xué)習(xí)之間的差距已經(jīng)在很大程度上縮小了。(使得無監(jiān)督學(xué)習(xí)變得更加細(xì)致了)
0.2總結(jié)
大約這里就是介紹了一下MoCo:
- 1.他是一個無監(jiān)督的學(xué)習(xí),是對比學(xué)習(xí),是對比學(xué)習(xí)中的字典查詢方式。
- 2.在ImageNet上的分割和分類任務(wù)甚至超過了一些有監(jiān)督的網(wǎng)絡(luò)。
- 3.因為達(dá)到了這么好的效果,所以作者認(rèn)為本文縮小了無監(jiān)督和有監(jiān)督的距離。
1.Introduction
1.1逐句翻譯
第一段(于NLP類比,非監(jiān)督學(xué)習(xí)落后、tokenized dictionaries使用不充分)
Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by GPT [50, 51] and BERT [12].
無監(jiān)督表示學(xué)習(xí)在自然語言處理中非常成功,如GPT[50,51]和BERT[12]所示。
But supervised pre-training is still dominant in computer vision, where unsupervised methods generally lag behind.
但在計算機視覺領(lǐng)域,有監(jiān)督的預(yù)訓(xùn)練仍占主導(dǎo)地位,而無監(jiān)督的方法通常落后。
The reason may stem from differences in their respective signal spaces.
原因可能是它們各自的信號空間不同。
Language tasks have discrete signal spaces (words, sub-word units, etc.) for building tokenized dictionaries, on which unsupervised learning can be based.
語言任務(wù)具有離散的信號空間(單詞、子單詞單位等),用于構(gòu)建token化字典,非監(jiān)督學(xué)習(xí)可以以此為基礎(chǔ)。
Computer vision, in contrast, further concerns dictionary building [54, 9, 5], as the raw signal is in a continuous, high-dimensional space and is not structured for human communication (e.g., unlike words).
相比之下,計算機視覺進(jìn)一步關(guān)注字典的構(gòu)建[54,9,5],因為原始信號處于連續(xù)的高維空間,不適合人類交流(例如,不像單詞)。
第二段(借之前的研究提出了動態(tài)字典)
Several recent studies [61, 46, 36, 66, 35, 56, 2] present promising results on unsupervised visual representation learning using approaches related to the contrastive loss [29].
一些最近的研究[61,46,36,66,35,56,2]在使用與對比損失[29]相關(guān)的方法進(jìn)行無監(jiān)督視覺表征學(xué)習(xí)方面取得了很好的結(jié)果。
Though driven by various motivations, these methods can be thought of as building dynamic dictionaries.
盡管這些方法受到各種動機的驅(qū)動,但可以將其視為構(gòu)建動態(tài)字典。
The “keys” (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network.
字典中的“鍵”(標(biāo)記)是從數(shù)據(jù)(如圖像或補丁)中取樣的,并由編碼器網(wǎng)絡(luò)表示。
Unsupervised learning trains encoders to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others.
無監(jiān)督學(xué)習(xí)訓(xùn)練編碼器執(zhí)行字典查找:一個編碼的“查詢”應(yīng)該與它的匹配鍵相似而與其他鍵不同。
Learning is formulated as minimizing a contrastive loss [29].
learning的過程可以表示為使對比損失最小化的過程
第三段(動態(tài)字典應(yīng)具有的特點:大的、結(jié)構(gòu)一致的)
From this perspective, we hypothesize that it is desirable to build dictionaries that are: (i) large and (ii) consistent as they evolve during training.
從這個角度來看,我們假設(shè)構(gòu)建字典是可取的:(i)大的和(ii)在訓(xùn)練不斷發(fā)展的過程中具有一致性。
Intuitively, a larger dictionary may better sample the underlying continuous, highdimensional visual space, while the keys in the dictionary should be represented by the same or similar encoder so that their comparisons to the query are consistent.
直觀地說,更大的字典可能更好地抽樣底層連續(xù)的、高維的視覺空間,而字典中的鍵應(yīng)該由相同或類似的編碼器表示,以便它們與查詢的比較是一致的。
However, existing methods that use contrastive losses can be limited in one of these two aspects (discussed later in context).
然而,使用對比損失的現(xiàn)有方法可以在這兩個方面之一受到限制(稍后將在上下文中討論)。
第四段(簡要介紹本文為什么可以實現(xiàn)這兩個特點)
We present Momentum Contrast (MoCo) as a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss (Figure 1).
我們將動量對比(Momentum Contrast, MoCo)作為一種構(gòu)建大型且一致的字典的方法,用于具有對比損失的無監(jiān)督學(xué)習(xí)(圖1)。
We maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued.
我們將字典維護(hù)為一個數(shù)據(jù)樣本隊列:當(dāng)前小批處理的編碼表示被放入隊列,最老的被退出隊列。
The queue decouples the dictionary size from the mini-batch size, allowing it to be large.
隊列將字典大小與迷你批處理大小解耦,允許它變大。
Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.
此外,由于字典鍵來自前面的幾個小批量,一個緩慢進(jìn)展的鍵編碼器,實現(xiàn)了一個基于動量的移動平均的查詢編碼器,以保持一致性。
第五段(這個字典干什么的?取得什么樣的效果?)
MoCo is a mechanism for building dynamic dictionaries for contrastive learning, and can be used with various pretext tasks.
MoCo是一種建立動態(tài)字典的對比學(xué)習(xí)機制,可用于各種代理任務(wù)。
如果您對代理任務(wù)不清楚可以看:什么是pretext task?
In this paper, we follow a simple instance discrimination task [61, 63, 2]: a query matches a key if they are encoded views (e.g., different crops) of the same image.
在本文中,我們遵循一個簡單的實例識別任務(wù)[61,63,2]:查詢匹配一個密鑰,判斷它們是否和我們的編碼視圖(例如,不同的作物)的同一圖像。
Using this pretext task, MoCo shows competitive results under the common protocol of linear classification in the ImageNet dataset [11].
利用這個pretext task,MoCo在ImageNet數(shù)據(jù)集[11]上展現(xiàn)出了在線性分類的通用協(xié)議下有競爭力的結(jié)果(大約就是不是當(dāng)前最好的,但是也很不錯的成果)。
第六段(MoCo的在預(yù)訓(xùn)練的表現(xiàn)很好)
A main purpose of unsupervised learning is to pre-train representations (i.e., features) that can be transferred to downstream tasks by fine-tuning(微調(diào)).
譯文:無監(jiān)督學(xué)習(xí)的一個主要目的是預(yù)先訓(xùn)練表征(即特征),這些表征可以通過微調(diào)轉(zhuǎn)移到下游任務(wù)。
We show that in 7 downstream tasks related to detection or segmentation, MoCo unsupervised pre-training can surpass its ImageNet supervised counterpart, in some cases by nontrivial margins.
我們發(fā)現(xiàn),在7個與檢測或分割相關(guān)的下游任務(wù)中,MoCo無監(jiān)督預(yù)訓(xùn)練可以超過ImageNet的有監(jiān)督預(yù)訓(xùn)練,在某些情況下甚至超過了它的有監(jiān)督預(yù)訓(xùn)練。
In these experiments, we explore MoCo pre-trained on ImageNet or on a one-billion Instagram image set, demonstrating that MoCo can work well in a more real-world, billionimage scale, and relatively uncurated scenario.
在這些實驗中,我們探索了在ImageNet或10億Instagram圖像集上預(yù)先訓(xùn)練的MoCo,證明MoCo可以在更真實的、10億圖像規(guī)模的和相對未策展的場景中很好地工作。
These results show that MoCo largely closes the gap between unsupervised and supervised representation learning in many computer vision tasks, and can serve as an alternative to ImageNet supervised pre-training in several applications.
這些結(jié)果表明,在許多計算機視覺任務(wù)中,MoCo在很大程度上縮小了無監(jiān)督和有監(jiān)督表示學(xué)習(xí)之間的差距,并且在一些應(yīng)用中可以作為ImageNet有監(jiān)督預(yù)訓(xùn)練的替代方案。
1.2總結(jié)
- 1.NLP當(dāng)中大家很好的使用無監(jiān)督學(xué)習(xí),無監(jiān)督學(xué)習(xí)的一個主要方法就是建立字典,之前很多研究也在這些方面作了工作。
- 2.作者提出字典應(yīng)該具有兩個特點,但是前人的研究都只做到了其中一個:
大的:這樣就可以更好的抽象底層信息。
輸入結(jié)構(gòu)一致的:這樣查詢的時候可以更加方便。 - 3.本文怎么解決的這兩個問題?
大的:將新的batch_size的數(shù)據(jù)加入其中,把最古老的數(shù)據(jù)取出,這樣就解決了字典必須是batch_size的限制,使得字典變大
輸入結(jié)構(gòu)一致:所有的key都是來源之前的輸入,所以穩(wěn)定。(其實我沒看懂,之后懂了會來補充。) - 4.所以說了這么久,這個字典干啥的?
這個字典是用來判斷當(dāng)前輸入和某個確定圖片是否為一類,可以作為預(yù)訓(xùn)練,提升識別效果。 - 5.所以這么做,效果好不好?
效果很好,在預(yù)訓(xùn)練的任務(wù)當(dāng)中,MoCo的效果甚至超過了一些有監(jiān)督網(wǎng)絡(luò)的效果。
2. Related Work
2.1逐句翻譯
第一段(本文的研究主要圍繞Loss函數(shù),所以我們接下來圍繞Loss函數(shù)進(jìn)行討論)
Unsupervised/self-supervised learning methods generally involve two aspects: pretext tasks and loss functions.
無監(jiān)督/自我監(jiān)督學(xué)習(xí)方法一般涉及pretext task和損失函數(shù)兩個方面。
The term “pretext” implies that the task being solved is not of genuine interest, but is solved only for the true purpose of learning a good data representation.
術(shù)語“pretext”意味著正在解決的任務(wù)不是真正感興趣的,而是為了學(xué)習(xí)良好的數(shù)據(jù)表示而解決的。(大約就是和真正的任務(wù)不是一個,但是可以提升真正任務(wù)的效果)
Loss functions can often be investigated independently of pretext tasks.
損失函數(shù)通常可以獨立于 pretext tasks進(jìn)行研究。
MoCo focuses on the loss function aspect.
MoCo專注于損失功能方面
Next we discuss related studies with respect to these two aspects.
接下來我們就這兩個方面的相關(guān)研究進(jìn)行討論。
第二段(介紹之前的loss函數(shù)的研究)
Loss functions. A common way of defining a loss function is to measure the difference between a model’s prediction and a fixed target, such as reconstructing the input pixels (e.g., auto-encoders) by L1 or L2 losses, or classifying the input into pre-defined categories (e.g., eight positions [13],color bins [64]) by cross-entropy or margin-based losses.
定義損失函數(shù)的常用方法是測量模型預(yù)測和固定目標(biāo)之間的差異,
例如:通過L1或L2損耗重建輸入像素(例如自動編碼器)
或通過交叉熵或基于邊際的損失將輸入分類為預(yù)先定義的類別
Other alternatives, as described next, are also possible.
下面所述的其他選擇也是可能的。
第三段(Contrastive losses 、最近的無監(jiān)督研究中對比學(xué)習(xí)都作為核心出現(xiàn))
Contrastive losses [29] measure the similarities of sample pairs in a representation space.
對比loss度量一個表示空間中一對樣本的相似性。
Instead of matching an input to a fixed target, in contrastive loss formulations the target can vary on-the-fly during training and can be defined in terms of the data representation computed by a network[29].
與將輸入匹配到固定目標(biāo)不同,在損失對比公式中,目標(biāo)可以在訓(xùn)練期間動態(tài)變化,并可以根據(jù)網(wǎng)絡(luò)[29]計算的數(shù)據(jù)表示來定義。
Contrastive learning is at the core of several recent works on unsupervised learning [61, 46, 36, 66, 35, 56, 2],which we elaborate on later in context (Sec. 3.1).
對比學(xué)習(xí)是最近幾篇關(guān)于無監(jiān)督學(xué)習(xí)的研究的核心[61,46,36,66,35,56,2],我們將在后面的背景(第3.1節(jié))中詳細(xì)闡述。
第四段(Adversarial losses 這個東西主要是對抗網(wǎng)絡(luò)的損失函數(shù))
Adversarial losses [24] measure the difference between probability distributions.
對抗損失[24]度量概率分布之間的差異。
It is a widely successful technique for unsupervised data generation.
這是一種非常成功的無監(jiān)督數(shù)據(jù)生成技術(shù)。
Adversarial methods for representation learning are explored in [15, 16].
[15, 16]探索了表示學(xué)習(xí)的對抗方法
There are relations (see [24]) between generative adversarial networks and noise-contrastive estimation (NCE) [28].
生成式對抗網(wǎng)絡(luò)和噪聲對比估計(NCE)[28]之間存在聯(lián)系(見[24])。
(這些前人的研究,看起來有點耽誤事,直接看作者的moco怎么做的)
3. Method
3.1. Contrastive Learning as Dictionary Look-up
3.1.1
第一段(概括最近的work為訓(xùn)練字典查找的編碼器)
Contrastive learning [29], and its recent developments, can be thought of as training an encoder for a dictionary look-up task, as described next.
對比學(xué)習(xí)[29]及其最近的發(fā)展可以被認(rèn)為是訓(xùn)練一個字典查找任務(wù)的編碼器,如下所述。
第二段(描述本文的損失函數(shù))
Consider an encoded query q and a set of encoded samples {k0, k1, k2, …} that are the keys of a dictionary.
考慮一個編碼查詢q和一組編碼樣本{k0, k1, k2,…}是字典的鍵。
Assume that there is a single key (denoted as k+) in the dictionary that q matches.
假設(shè)在字典中有一個鍵(我們將這個鍵記成 k+)他是和q匹配的。
(這里注意一件事,就是這個文章事實上只認(rèn)為每個q都只和一個k匹配)
A contrastive loss [29] is a function whose value is low when q is similar to its positive key k+ and dissimilar to all other keys (considered negative keys for q).
一個對比損失函數(shù)是這樣的一個東西,如果預(yù)測的出來的結(jié)果中:和正例(只有一個)的相似度高且和負(fù)例(剩下的全是負(fù)例)的相似度低,則這個損失函數(shù)的值應(yīng)當(dāng)?shù)汀?/p>
With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE [46], is considered in this paper:
采用點積度量相似性的方法,本文考慮了一種名為InfoNCE[46]的對比損失函數(shù)形式:
where τ is a temperature hyper-parameter per [61].
其中τ是每個的溫度超參數(shù)。
The sum is over one positive and K negative samples.
這個和的得出是建立在:1個正的和K個負(fù)的樣本。
Intuitively, this loss is the log loss of a (K+1)-way softmax-based classifier that tries to classify q as k+.
直觀地,這個損失函數(shù)是一個k+1路的softmax為基礎(chǔ)的將q歸為k+的分類器的log loss
Contrastive loss functions can also be based on other forms [29, 59, 61, 36], such as margin-based losses and variants of NCE losses.
對比損失函數(shù)也可以基于其他形式[29,59,61,36],如基于邊際的損失和NCE損失的變體。
第二段所述的損失函數(shù)理解
理解一下這個損失函數(shù),這個東西是使用點積來作為基礎(chǔ)的,所以越相似的兩個內(nèi)容的點積就越大,點積就越大,這樣就會出現(xiàn)這樣的景象,q和k+越相似就會使得下面這個值就會更加接近1:
同樣地,如果這個東西還和其他的比較相似,其他的點積就會變大,整體就會更加遠(yuǎn)離1.
而越接近1整體的值就越小,所以輸出的結(jié)果判斷的越對,就越接近結(jié)果。但是這個東西其實不就是個交叉熵?fù)p失函數(shù)嗎?
第三段(介紹網(wǎng)絡(luò)的輸入輸出模式)
The contrastive loss serves as an unsupervised objective function for training the encoder networks that represent the queries and keys [29].
對比loss作為無監(jiān)督目標(biāo)函數(shù),用于訓(xùn)練query和key的編碼器網(wǎng)絡(luò)。(我個人理解這里就是把兩者對應(yīng)起來)
In general, the query representation is q = fq(xq) where fq is an encoder network and xq is a query sample (likewise, k = fk(xk)).
普遍的來說,表示查詢就是q = fq(xq),其中fg是一個網(wǎng)絡(luò),xq是一個查詢得實際例子。(同理,k = fk(xk)也是如此)
Their instantiations depend on the specific pretext task.
它們的實例化依賴于特定的pretext task
The input xq and xk can be images [29, 61, 63], patches [46], or context consisting a set of patches [46].
輸入xq和xk可以是圖像[29,61,63],patches,或者由一組patches組成的context。
The networks fq and fk can be identical [29, 59, 63], partially shared [46, 36, 2], or different [56].
網(wǎng)絡(luò)fq和fk可以是相同的[29,59,63],部分共享的[46,36,2],或不同的[56]。
3.1.2總結(jié)
這里主要是介紹本文設(shè)計的contrastive loss的問題:
1.設(shè)計損失函數(shù),首先要明確損失函數(shù)是用來評估模型的好壞的,所以必須結(jié)合模型所做的工作。這里明確模型需要完成的工作是實現(xiàn)一個查字典的工作。通過一個query,拿到一個key
2.所以,這個其實類似于一個分類任務(wù),作者這里使用的損失函數(shù)也類似于交叉熵?fù)p失函數(shù)。
3.不同的是這個是檢驗兩個enconder的效果好壞。
3.2. Momentum Contrast
3.2.1逐段翻譯
第一段(作者說了自己模型創(chuàng)建的出發(fā)點,這里大約介紹的都是公共特點)
From the above perspective, contrastive learning is a way of building a discrete dictionary on high-dimensional continuous inputs such as images. 從上述角度來看,對比學(xué)習(xí)是在高維連續(xù)輸入(例如圖像)上構(gòu)建離散字典的一種方法。
The dictionary is dynamic in the sense that the keys are randomly sampled, and that the key encoder evolves during training.
字典是動態(tài)的,因為從某種意義上說key是隨機采樣的,并且key的編碼器在訓(xùn)練過程中不斷進(jìn)化。
Our hypothesis is that good features can be learned by a large dictionary that covers a rich set of negative samples, while the encoder for the dictionary keys is kept as consistent as possible despite its evolution.
我們的假設(shè)是,一個包含豐富負(fù)樣本集的大字典可以學(xué)習(xí)到好的特征,而字典keys的編碼器盡管不斷進(jìn)化,卻盡可能保持一致。
(這句話其實應(yīng)該分成兩個部分來理解,因為前半句對應(yīng)的是用queue來構(gòu)造一個大的字典,后半句大約對應(yīng),為什么要用動量更新key encoder,當(dāng)然動量更新還有一個原因就是字典太大backward算不動,我覺得作者之前應(yīng)該是由這個算不動想到的這個優(yōu)化)
Based on this motivation, we present Momentum Contrast as described next.
基于這個動機,我們提出了動量對比,如下所述。
第二段(隊列作為字典的好處)
Dictionary as a queue.
At the core of our approach is maintaining the dictionary as a queue of data samples.
我們方法的核心是將字典維護(hù)為一個數(shù)據(jù)樣本隊列。
This allows us to reuse the encoded keys from the immediate preceding mini-batches.
這使我們可以重用前面的小批量的已編碼密鑰。
The introduction of a queue decouples the dictionary size from the mini-batch size.
隊列的引入將字典大小與mini-batch size解耦。
Our dictionary size can be much larger than a typical mini-batch size, and can be flexibly and independently set as a hyper-parameter.
我們的字典大小可以比典型的迷你批處理大小大得多,并且字典大小可以靈活獨立地設(shè)置為超參數(shù)。
第二段理解:
理解一下這里作者提出來queue作為keys(keys中有negative samples同時只有一個positive sample)也就是和這些東西計算相似度,要讓輸出的東西和這里的唯一的positive sample相似,和其他的negative不相似。
第三段(字典實際實現(xiàn)的過程中的情況)
The samples in the dictionary are progressively replaced.
詞典中的樣本逐步被替換。
The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed.
當(dāng)前的mini-batch被放入字典中,隊列中最老的mini-batch被刪除。
The dictionary always represents a sampled subset of all data, while the extra computation of maintaining this dictionary is manageable.
字典總是表示所有數(shù)據(jù)的抽樣子集,而維護(hù)這個字典的額外計算是可控的。(字典中的之前的key都是之前計算保存下來的)
Moreover, removing the oldest mini-batch can be beneficial, because its encoded keys are the most outdated and thus the least consistent with the newest ones.
此外,刪除最舊的mini-batch可能是有益的,因為其編碼的keys是最過時的,因此與最新的密鑰最不一致。(這句理解一下,之前的字典中的key其實都是用當(dāng)時的key encoder計算的,不是用現(xiàn)在更新的key encoder計算的)
第四段(動量更新提出的原因)
Momentum update.
Using a queue can make the dictionary large, but it also makes it intractable to update the key encoder by back-propagation (the gradient should propagate to all samples in the queue).
使用隊列可以使字典變大,但通過反向傳播來更新keys編碼器也很棘手(梯度應(yīng)該傳播到隊列中的所有樣本)。
(內(nèi)存承受不住,計算能力也承受不住,我在上一篇詳細(xì)談過的。)
A naive solution is to copy the key encoder fk from the query encoder fq, ignoring this gradient. But this solution yields poor results in experiments (Sec. 4.1).
一個不好的解決方案是從查詢編碼器fq中復(fù)制鍵編碼器fk,忽略這個梯度。但這種解決方案的效果并不好。
(不算這里的梯度直接吧query encoder拷貝過來使用。)
We hypothesize that such failure is caused by the rapidly changing encoder that reduces the key representations’ consistency.
我們假設(shè)這種失敗是由于快速變化的編碼器降低了密鑰表示的一致性造成的。
(作者假設(shè)keys的編碼器得有一定的穩(wěn)定性才能很好的表達(dá))
We propose a momentum update to address this issue.
(所以)我們提出來了momentum更新。
第五段(動量更新的具體實現(xiàn))
Formally, denoting the parameters of fk as θk and those of fq as θq, we update θk by:
形式上,將fk的參數(shù)表示為θk,將fq的參數(shù)表示為θq,我們對θk的更新如下:
θk ← mθk + (1 1 m)θq. (2)
Here m ∈ [0, 1) is a momentum coefficient. Only the parameters θq are updated by back-propagation.
這里m∈[0,1]是動量系數(shù)。只有θq被反向傳播更新。
The momentum update in Eqn.(2) makes θk evolve more smoothly than θq.
Eqn.(2)中的動量更新使得θk比θq演化更平穩(wěn)。(因為動量稀釋了更新)
As a result, though the keys in the queue are encoded by different encoders (in different mini-batches), the difference among these encoders can be made small.
因此,盡管隊列中的keys由不同的編碼器編碼(在不同的小批量中),但這些編碼器之間的差異可以很小。
(因為編碼器變化不大,所以前面編碼器計算出來的keys后面是可以接著用的)
In experiments, a relatively large momentum (e.g., m = 0.999, our default) works much better than a smaller value (e.g., m = 0.9), suggesting that a slowly evolving key encoder is a core to making use of a queue.
在實驗中,相對較大的動量(例如,m = 0.999,我們的默認(rèn)值)比較小的值(例如,m = 0.9)工作得更好,這表明緩慢演化的密鑰編碼器是利用隊列的核心。
第五段總結(jié)(介紹和之前的機制的區(qū)別,接下來會詳細(xì)介紹)
Relations to previous mechanisms.
MoCo is a general mechanism for using contrastive losses.
MoCo是使用對比loss的一般機制。
We compare it with two existing general mechanisms in Figure 2.
我們將其與圖2中現(xiàn)有的兩種通用機制進(jìn)行比較。
They exhibit different properties on the dictionary size and consistency.
它們在字典大小和一致性方面表現(xiàn)出不同的屬性。
第六段(介紹Fig2.a圖中的研究的優(yōu)化方式和字典大小限制)
The end-to-end update by back-propagation is a natural mechanism (e.g., [29, 46, 36, 63, 2, 35], Figure 2a).
使用反向傳播的端到端更新是一種自然的機制
It uses samples in the current mini-batch as the dictionary, so the keys are consistently encoded (by the same set of encoder parameters).
它使用當(dāng)前小批中的樣本作為字典,因此keys是一致的編碼(通過相同的編碼器參數(shù)集)
(就是這個batch使用的編碼器都是上一個batch優(yōu)化得到的那個key encoder,所以具有一致性)
But the dictionary size is coupled with the mini-batch size, limited by the GPU memory size.
但是字典的大小與小批量的大小相結(jié)合,受到GPU內(nèi)存大小的限制。
(在上一篇當(dāng)中談了其實batchsize變大就會挑戰(zhàn)GPU的內(nèi)存)
It is also challenged by large mini-batch optimization [25].
同時大的batch-size對優(yōu)化也會有影響
(上篇也說了萬一偏了可能影響就很大的。)
Some recent methods [46, 36, 2] are based on pretext tasks driven by local positions, where the dictionary size can be made larger by multiple positions.
最近的一些方法[46,36,2]是基于局部位置驅(qū)動的pretext tasks,其中,字典大小可以通過多個位置來增大。(沒看懂)
But these pretext tasks may require special network designs such as patchifying the input [46] or customizing the receptive field size [2], which may complicate the transfer of these networks to downstream tasks.
但是,這些任務(wù)可能需要特殊的網(wǎng)絡(luò)設(shè)計,如修補輸入[46]或定制接收域大小[2],這可能會使這些網(wǎng)絡(luò)向下游任務(wù)的轉(zhuǎn)移復(fù)雜化。
第七段(介紹Fig2.b圖中的研究的一致性存在的問題,并且介紹了之前的動量的研究)
Another mechanism is the memory bank approach proposed by [61] (Figure 2b).
另一種機制是[61]提出的內(nèi)存庫方法(圖2b)。
A memory bank consists of the representations of all samples in the dataset.
存儲庫由數(shù)據(jù)集中所有樣本的表示組成。
The dictionary for each mini-batch is randomly sampled from the memory bank with no back-propagation, so it can support a large dictionary size.
每個小批的字典從內(nèi)存庫中隨機取樣,沒有反向傳播,因此它可以支持大字典大小。
(理解一下這里的keys encoder其實就已經(jīng)不存在了)
However, the representation of a sample in the memory bank was updated when it was last seen, so the sampled keys are essentially about the encoders at multiple different steps all over the past epoch and thus are less consistent.
然而,在內(nèi)存庫中樣本的表示是在最后一次看到它時更新的,因此采樣的密鑰本質(zhì)上是關(guān)于過去整個epoch中多個不同步驟的編碼器的,因此不具有穩(wěn)定性。
(沒有看過這個文章,不確定怎么更新的,但是按照本文的意思他是這個每次訪問到這個key都會更新,所以相當(dāng)于底板庫不大穩(wěn)定)
A momentum update is adopted on the memory bank in [61]. Its momentum update is on the representations of the same sample, not the encoder.
[61]中對內(nèi)存庫采用動量更新。它的動量更新是在同一個樣本的表示上,而不是在編碼器上。
This momentum update is irrelevant to our method, because MoCo does not keep track of every sample.
這種動量更新與我們的方法無關(guān),因為MoCo不會跟蹤每個樣本
Moreover, our method is more memory-efficient and can be trained on billion-scale data, which can be intractable for a memory bank.
此外,我們的方法具有更高的內(nèi)存效率,并且可以在10億規(guī)模的數(shù)據(jù)上進(jìn)行訓(xùn)練,這對于存儲庫來說是很難處理的。
3.2.2總結(jié)
- 1.對比學(xué)習(xí)主要是然輸入經(jīng)過處理之后和正例盡量的相似,和負(fù)例盡量不像。并且想以此來達(dá)到在世間萬物當(dāng)中達(dá)到只和正例相似的目的。我們想一下這樣的目的是什么?這個東西只要是來自兩個東西,那么他們就會不同,如果我們提取出的所以負(fù)例的集合就得大。
- 2.為了擴(kuò)大負(fù)例的問題,其實這就是兩個方法的對博弈問題:
方法一:我們動態(tài)優(yōu)化key encoder,那么我們就得將字典的大小就必須為batchsize
方法二:我們不優(yōu)化key encoder,這樣理論上也不錯,因為只更改query encoder也行,但是實際上效果并不好,所以離散的對整個數(shù)據(jù)集進(jìn)行更新,這樣效果提升,但是不具有字典的穩(wěn)定性。使得學(xué)習(xí)很難達(dá)到最好的效果。 - 3.所以作者提出來了moco的方法來中和兩者
- 4.其他和moco相關(guān)的論文提的都不是類似的內(nèi)容。
3.3. Pretext Task
3.3.1逐句翻譯
第一段( 大約這段講的是預(yù)處理的問題,別人都是怎么預(yù)處理圖片的)
Contrastive learning can drive a variety of pretext tasks.
對比學(xué)習(xí)可以驅(qū)動各種pretext tasks。
As the focus of this paper is not on designing a new pretext task, we use a simple one mainly following the instance discrimination task in [61], to which some recent works [63, 2] are related.
由于本文的重點不是設(shè)計一種新的借口任務(wù),所以我們采用了一個簡單的pretext task,主要是在[61]的實例識別任務(wù)之后,與最近的一些研究相關(guān)。
第二段(這段將MoCo是怎么吸收借鑒他們的優(yōu)點的)
Following [61], we consider a query and a key as a positive pair if they originate from the same image, and otherwise as a negative sample pair.
跟隨[61]的研究,我們將來自于同一個圖像的一個query和一個key視為一個對應(yīng)對,如果不來自同一個圖,則視為負(fù)樣本對。
Following [63, 2], we take two random “views” of the same image under random data augmentation to form a positive pair.
跟隨[63,2]的研究,我們?nèi)⊥粓D像在隨機數(shù)據(jù)增強下的兩個隨機“視圖”,形成正對。
The queries and keys are respectively encoded by their encoders, fq and fk.
queries和keys分別由它們的編碼器fq和fk編碼。
The encoder can be any convolutional neural network [39].
編碼器可以是任意卷積神經(jīng)網(wǎng)絡(luò)[39]。
第三段(這里主要是介紹)
Algorithm 1 provides the pseudo-code of MoCo for this pretext task.
Algorithm 1說明了這個MoCo的pretext task的偽代碼。偽代碼看不懂可以參考:Moco Algorithm 1 解讀
For the current mini-batch, we encode the queries and their corresponding keys, which form the positive sample pairs.
對于當(dāng)前的mni_batch,我們對query及其對應(yīng)的key進(jìn)行編碼,它們形成了正樣本對。
The negative samples are from the queue.
負(fù)例則來自于隊列。(理解一下這句話,這個其實看懂了Algorithm 1 之后這個就很清楚了)
第五段Technical details(技術(shù)細(xì)節(jié))
We adopt a ResNet [33] as the encoder, whose last fully
connected layer (after global average pooling) has a fixed dimensional output (128-D [61]).
我們使用一個最后具有全連接層并且具有固定輸出的RseNet作為enconder。
This output vector is normalized by its L2-norm [61].
這個輸出向量是由它的l2范數(shù)標(biāo)準(zhǔn)化的
This is the representation of the query or key.
這是查詢或鍵的表示。
The temperature τ in Eqn.(1) is set as 0.07 [61].
eq .(1)中的溫度τ設(shè)為0.07[61]。(這里是提到了之前的那個溫度交叉熵的初始溫度設(shè)置值)
The data augmentation setting follows [61]: a 224×224-pixel crop is taken from a randomly resized image, and then undergoes random color jittering, random horizontal flip, and random grayscale conversion, all available in PyTorch’s torchvision package.
數(shù)據(jù)增強設(shè)置如下[61]:從隨機調(diào)整大小的圖像中獲取224×224-pixel裁剪,然后進(jìn)行隨機顏色抖動、隨機水平翻轉(zhuǎn)和隨機灰度轉(zhuǎn)換,所有這些都可以在PyTorch的torchvision包中獲得。(就是在torchvision的基礎(chǔ)包中作了數(shù)據(jù)增強,沒有做出格的數(shù)據(jù)增強)
第六段Shuffling BN
Our encoders fq and fk both have Batch Normalization (BN) [37] as in the standard ResNet [33].
我們的編碼器fq和fk都有批處理歸一化(BN)[37]作為標(biāo)準(zhǔn)ResNet[33]。
In experiments, we found that using BN prevents the model from learning good representations, as similarly reported in [35] (which avoids using BN).
在實驗中,我們發(fā)現(xiàn)使用BN會阻礙模型學(xué)習(xí)好的表征,類似的論斷在35號引用中有陳述(它避免使用BN)。
The model appears to “cheat” the pretext task and easily finds a low-loss solution.
該模型似乎“欺騙”了pretext task,很容易找到一個低損失的解決方案。
This is possibly because the intra-batch communication among samples (caused by BN) leaks information
這可能是由于數(shù)據(jù)之間的批內(nèi)通信(BN引起的)泄露了信息。
第七段(詳細(xì)說明Shuffling BN)
We resolve this problem by shuffling BN.
我們通過shuffling BN來解決這個問題。
We train with multiple GPUs and perform BN on the samples independently for each GPU (as done in common practice).
我們使用多個GPU進(jìn)行訓(xùn)練,并對每個GPU獨立地在樣本上執(zhí)行BN(與通常做法相同)。(也就是對每個gpu上的內(nèi)容獨立的進(jìn)行,互相并不影響,這個就和正常的情況沒區(qū)別)
For the key encoder fk, we shuffle the sample order in the current mini-batch before distributing it among GPUs (and shuffle back after encoding); the sample order of the mini-batch for the query encoder fq is not altered.
對于key encoder fk,我們先將當(dāng)前小批量的樣本順序打亂,然后再將其分配到gpu中(編碼后再打亂);query encoder fq的minibatch順序沒有更改。
(也就是說我們只在過key encoder的時候我們打亂順序)
This ensures the batch statistics used to compute a query and its positive key come from two different subsets.
這確保用于計算查詢及其正鍵的批統(tǒng)計信息來自兩個不同的子集。
This effectively tackles the cheating issue and allows training to benefit from BN.
這有效地解決了作弊問題,并使訓(xùn)練受益于BN。(這里說的作弊問題,大約就是通過bn獲得了這個batch當(dāng)中其他內(nèi)容的信息,理解一下這里其實在正常的網(wǎng)絡(luò)當(dāng)中沒有很大的問題,但是這里就不行,因為在對比學(xué)習(xí)當(dāng)中這個batch當(dāng)中的其他都是作為反例,相當(dāng)于測試集的內(nèi)容泄露到訓(xùn)練集當(dāng)中去了)
第八段(Shuffling BN的消融試驗)
We use shuffled BN in both our method and its end-to-end ablation counterpart (Figure 2a).
我們在我們的方法和它的端到端消融試驗的方法中都使用了shuffled BN
(這里消融實驗版本)
消融試驗就是將其中的一部分設(shè)置為無效,主要是用來驗證哪一部分有效。其實就相當(dāng)于控制變量法。
It is irrelevant to the memory bank counterpart (Figure 2b), which does not suffer from this issue because the positive keys are from different mini-batches in the past.
大約就是說在之前的方法當(dāng)中,作者也使用了shuffle BN進(jìn)行消融試驗來進(jìn)行驗證。
試驗部分之后完善。
4. Experiments
4.0前面的一部分
4.0.1逐句翻譯
We study unsupervised training performed in:
我們研究在以下情況下進(jìn)行的無監(jiān)督訓(xùn)練:
ImageNet-1M (IN-1M): This is the ImageNet [11] training set that has ~1.28 million images in 1000 classes (often called ImageNet-1K; we count the image number instead, as classes are not exploited by unsupervised learning).
ImageNet- 1m (in - 1m):這是ImageNet[11]訓(xùn)練集,在1000個類中有約128萬張圖像(通常稱為ImageNet- 1k;我們計算圖像的數(shù)量,因為類不會被非監(jiān)督學(xué)習(xí)所利用)。
This dataset is well-balanced in its class distribution, and its images generally contain iconic view of objects.
Instagram-1B (IG-1B): Following [44], this is a dataset of ~1 billion (940M) public images from Instagram.
The images are from ~1500 hashtags [44] that are related to the ImageNet categories.
This dataset is relatively uncurated comparing to IN-1M, and has a long-tailed, unbalanced distribution of real-world data. This dataset contains both iconic objects and scene-level images
總結(jié)
總結(jié)起來的總體創(chuàng)新點就是:用一個隊列將之前的數(shù)據(jù)都存儲起來獲得一個大的負(fù)例的集合。
之后的動量優(yōu)化、shuffleBN也是創(chuàng)新點,但是這些都是由最初的想法引出來的。
總結(jié)
以上是生活随笔為你收集整理的对比学习系列论文MoCo v1(二):Momentum Contrast for Unsupervised Visual Representation Learning的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Batch Normalization的
- 下一篇: 降噪自动编码器:DAEs