當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Mask R-CNN学习笔记

發布時間：2023/12/15 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 Mask R-CNN学习笔记小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Date：2018-10-22
Author：哪咔嗎
Source Link：http://arxiv.org/pdf/1703.06870v3.pdf

Mask R-CNN

摘要
1.介紹

NN)

摘要

用途：Mask R-CNN用于對象實例分割

實現方法：在Faster R-CNN網絡中擴展一個與矩形框識別任務平行的分支實現方法：在Faster R-CNN網絡中擴展一個與矩形框識別任務平行的分支

性能：訓練簡單，相比Faster R-CNN只增加了很小的開支?？梢赃_到5fps性能：訓練簡單，相比Faster R-CNN只增加了很小的開支?？梢赃_到5fps

其它特性：容易推廣到其它任務，例如：允許在同一個網絡中進行人體姿態估計其它特性：容易推廣到其它任務，例如：允許在同一個網絡中進行人體姿態估計

成績：在COCO系列挑戰中，三個子任務都獲得了最好的成績，包括實例分割、矩形框對象檢測、人體關鍵點檢測。成績甚至超過了2016年COCO的冠軍。成績：在COCO系列挑戰中，三個子任務都獲得了最好的成績，包括實例分割、矩形框對象檢測、人體關鍵點檢測。成績甚至超過了2016年COCO的冠軍。

代碼：https://github.com/facebookresearch/Detectron代碼：https://github.com/facebookresearch/Detectron

1.介紹

背景：計算機視覺在短期內發展迅速，很大程度上都是基于那些強大的基礎網絡，例如：用于物體檢測的Fast/Faster R-CNN網絡 [12, 36]和用于語義分割的全卷積（FCN）[30]網絡。這些方法在概念上是直觀的，并且具有靈活性和魯棒性，他們的訓練效率和推理效率都很高。

目標：本文的任務是提出一種與上述框架類似的一種用于實現實例分割的基礎框架。

概述：實例分割需要正確檢測所有對象，同時也要精確地分割每個實例。可能很多人認為要實現這樣的效果需要一個復雜的方法，然而我們的成果是一種簡單的、靈活的、高效的系統，并且其表現已經超過了當下最好的實例分割結果。

圖1.用于實例分割的Mask R-CNN框架。

在Faster R-CNN網絡[36]的基礎上為每個RoI添加一個用于預測分割掩模的分支，這個分支與已有的用于分類和矩形框回歸的分支平行。

這個分支是一個小型的全卷積網絡（FCN），它作用于每個RoI。以一種像素到像素的方式預測一個分割掩碼。

In principle Mask R-CNN is an intuitive extension of Faster R-CNN, yet constructing the mask branch properly is critical for good results. Most importantly, Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool [18, 12], the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To ?x the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations. Despite being 1Following common terminology, we use object detection to denote detection via bounding boxes, not masks, and semantic segmentation to denote per-pixel classi?cation without differentiating instances. Yet we note that instance segmentation is both semantic and a form of detection.
原則上，Mask R-CNN是R-CNN的直觀擴展，但正確構建掩模分支對于獲得好的結果至關重要。最重要的是，更快的RCNN并非針對網絡輸入和輸出之間的像素對像素對齊而設計的。RoIPool [18,12]是參與實例的事實核心操作，為特征提取執行粗略的空間量化，這一點最為明顯。為了找到錯位，我們提出了一個簡單的，無量化的圖層，稱為RoIAlign，忠實地保留了確切的空間位置。盡管1遵循通用術語，但我們使用對象檢測來表示通過邊界框而不是掩碼進行檢測，并使用語義分割來表示每像素分類而不區分實例。但是我們注意到，實例分割既是語義的，也是一種檢測形式。

Figure 2. Mask R-CNN results on the COCO test set. These results are based on ResNet-101 [19], achieving a mask AP of 35.7 and running at 5 fps. Masks are shown in color, and bounding box, category, and con?dences are also shown.
圖2.掩蓋COCO測試集上的R-CNN結果。這些結果基于ResNet-101 [19]，實現了35.7的掩模AP，并以5 fps運行。面具以彩色顯示，還顯示了邊界框，類別和置信度。
a seemingly minor change, RoIAlign has a large impact: it improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Second, we found it essential to decouple mask and class prediction: we predict a binary mask for each class independently, without competition among classes, and rely on the network’s RoI classi?cation branch to predict the category. In contrast, FCNs usually perform per-pixel multi-class categorization, which couples segmentation and classi?cation, and based on our experiments works poorly for instance segmentation.
一個看似微小的變化，RoIAlign具有很大的影響：它將掩模精度提高了10％到50％，在更嚴格的本地化指標下顯示出更大的收益。其次，我們發現解耦模板和類別預測至關重要：我們獨立預測每個類別的二進制掩碼，而不需要在類別間進行競爭，并依靠網絡的RoI分類分支來預測類別。相比之下，FCNs通常執行每像素多類別分類，結合分割和分類，并基于我們的實驗在分割實例方面效果不佳。
Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task [28], including the heavilyengineered entries from the 2016 competition winner. As a by-product, our method also excels on the COCO object detection task. In ablation experiments, we evaluate multiple basic instantiations, which allows us to demonstrate its robustness and analyze the effects of core factors.
沒有花里胡哨之力，Mask R-CNN超越了COCO實例分割任務中所有先前的最新單模型結果[28]，其中包括來自2016年競賽冠軍的大量工程項目。作為副產品，我們的方法也擅長COCO物體檢測任務。在消融實驗中，我們評估了多個基本實例，這使我們能夠展示其強大性并分析核心因素的影響。
Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. We believe the fast train and test speeds, together with the framework’s ?exibility and accuracy, will bene?t and ease future research on instance segmentation.
我們的模型可以在GPU上以每幀200毫秒的速度運行，并且在單個8 GPU計算機上進行COCO培訓需要一到兩天。我們相信，快速訓練和測試速度，以及框架的靈活性和準確性，將會對實例分割的未來研究起到一定的作用。
Finally, we showcase the generality of our framework via the task of human pose estimation on the COCO keypoint dataset [28]. By viewing each keypoint as a one-hot binary mask, with minimal modi?cation Mask R-CNN can be applied to detect instance-speci?c poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint competition, and at the same time runs at 5 fps. Mask R-CNN, therefore, can be seen more broadly as a ?exible framework for instance-level recognition and can be readily extended to more complex tasks.
最后，我們通過COCO關鍵點數據集上的人體姿態估計任務展示了我們框架的一般性[28]。通過將每個關鍵點視為一個熱門的二進制掩碼，只需進行最少的修改Mask R-CNN可用于檢測實例特定的姿勢。Mask R-CNN超越2016年COCO關鍵點競賽的冠軍，同時運行速度為5 fps。因此，面膜R-CNN可以更廣泛地視為實例級別識別的靈活框架，并且可以很容易地擴展到更復雜的任務。
We have released code to facilitate future research.
我們已發布代碼以促進未來的研究。
2. Related Work2.相關工作
R-CNN: The Region-based CNN (R-CNN) approach [13] to bounding-box object detection is to attend to a manageable number of candidate object regions [42, 20] and evaluate convolutional networks [25, 24] independently on each RoI. R-CNN was extended [18, 12] to allow attending to RoIs on feature maps using RoIPool, leading to fast speed and better accuracy. Faster R-CNN [36] advanced this stream by learning the attention mechanism with a Region Proposal Network (RPN). Faster R-CNN is ?exible and robust to many follow-up improvements (e.g., [38, 27, 21]), and is the current leading framework in several benchmarks.
R-CNN：基于區域的CNN（R-CNN）方法[13]對邊界框對象進行檢測是為了關注可管理數量的候選目標區域[42,20]并獨立評估卷積網絡[25,24]在每個RoI上。R-CNN得到了擴展[18,12]，允許使用RoIPool在功能地圖上參與RoI，從而實現更快的速度和更高的準確性。更快的R-CNN [36]通過學習區域建議網絡（RPN）的注意機制來推進這一流程。更快速的R-CNN靈活性強，適用于許多后續改進（例如[38,27,21]），并且是幾個基準測試中的當前領先框架。
Instance Segmentation: Driven by the effectiveness of RCNN, many approaches to instance segmentation are based on segment proposals. Earlier methods [13, 15, 16, 9] resorted to bottom-up segments [42, 2]. DeepMask [33] and following works [34, 8] learn to propose segment candidates, which are then classi?ed by Fast R-CNN. In these methods, segmentation precedes recognition, which is slow and less accurate. Likewise, Dai et al. [10] proposed a complex multiple-stage cascade that predicts segment proposals from bounding-box proposals, followed by classi?cation. Instead, our method is based on parallel prediction of masks and class labels, which is simpler and more ?exible.
實例細分：在RCNN的有效性的推動下，許多實例細分的方法都基于細分提案。早期的方法[13,15,16,9]采用了自下而上的方法[42,2]。DeepMask [33]和以下著作[34,8]學會提出片段候選者，然后由Fast R-CNN進行分類。在這些方法中，分割先于識別，這是緩慢的并且不太準確。同樣，戴等人。 [10]提出了一個復雜的多級級聯，從包圍盒提議中預測段提議，然后進行分類。相反，我們的方法基于面具和類標簽的并行預測，它更簡單，更靈活。
Most recently, Li et al. [26] combined the segment proposal system in [8] and object detection system in [11] for “fully convolutional instance segmentation” (FCIS). The common idea in [8, 11, 26] is to predict a set of positionsensitive output channels fully convolutionally. These channels simultaneously address object classes, boxes, and masks, making the system fast. But FCIS exhibits systematic errors on overlapping instances and creates spurious edges (Figure 6), showing that it is challenged by the fundamental dif?culties of segmenting instances.
最近，李等人。 [26]將[8]中的段提議系統和[11]中的對象檢測系統合并為“完全卷積實例分段”（FCIS）。[8,11,26]中的共同想法是預測一組完全卷積的位置敏感輸出通道。這些通道同時處理對象類，框和掩碼，使系統更快。但是FCIS在重疊實例上表現出系統性錯誤并產生虛假邊緣（圖6），表明它受到分割實例的根本困難的挑戰。
Another family of solutions [23, 4, 3, 29] to instance segmentation are driven by the success of semantic segmentation. Starting from per-pixel classi?cation results (e.g., FCN outputs), these methods attempt to cut the pixels of the same category into different instances. In contrast to the segmentation-?rst strategy of these methods, Mask R-CNN is based on an instance-?rst strategy. We expect a deeper incorporation of both strategies will be studied in the future.
另一個解決方案家族[23,4,3,29]實例分割是由語義分割的成功驅動的。從每像素分類結果（例如，FCN輸出）開始，這些方法試圖將相同類別的像素切割成不同的實例。與這些方法的分段第一策略相比，Mask R-CNN基于實例第一策略。我們預計未來將研究更深入的兩種戰略。
3. Mask R-CNN3.掩碼R-CNN
Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much ?ner spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN.
掩碼R-CNN在概念上是簡單的：更快的R-CNN對于每個候選對象具有兩個輸出，一個類別標簽和一個邊界框偏移;為此，我們添加一個輸出對象掩碼的第三個分支。面具R-CNN因此是一個自然而直觀的想法。但是額外的掩碼輸出與類和盒輸出不同，需要提取對象的更精細的空間布局。接下來，我們介紹Mask R-CNN的關鍵元素，包括像素對像素對齊，這是Fast / Faster R-CNN的主要缺失部分。
Faster R-CNN: We begin by brie?y reviewing the Faster R-CNN detector [36]. Faster R-CNN consists of two stages. The ?rst stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN [12], extracts features using RoIPool from each candidate box and performs classi?cation and bounding-box regression. The features used by both stages can be shared for faster inference. We refer readers to [21] for latest, comprehensive comparisons between Faster R-CNN and other frameworks.
更快的R-CNN：我們首先回顧一下更快的R-CNN探測器[36]。更快的R-CNN由兩個階段組成。第一階段稱為區域提議網絡（RPN），提出候選對象邊界框。第二階段本質上是Fast R-CNN [12]，使用每個候選框中的RoIPool提取特征，并執行分類和邊界框回歸。兩個階段使用的功能可以共享以加快推斷速度。我們引用讀者[21]對Faster R-CNN和其他框架進行最新，全面的比較。
Mask R-CNN: Mask R-CNN adopts the same two-stage procedure, with an identical ?rst stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classi?cation depends on mask predictions (e.g. [33, 10, 26]). Our approach follows the spirit of Fast R-CNN [12] that applies bounding-box classi?cation and regression in parallel (which turned out to largely simplify the multi-stage pipeline of original R-CNN [13]).
掩碼R-CNN：掩碼R-CNN采用相同的兩階段過程，具有相同的第一階段（即RPN）。在第二階段，與預測類和盒子偏移并行，Mask R-CNN也為每個RoI輸出一個二進制掩碼。這與大多數最近的系統形成對比，其中分類依賴于掩模預測（例如[33,10,26]）。我們的方法遵循Fast R-CNN [12]的精神，它并行地應用了邊界框分類和回歸（其原來大大簡化了原始R-CNN的多級流水線[13]）。
Formally, during training, we de?ne a multi-task loss on each sampled RoI as L = Lcls + Lbox + Lmask. The classi?cation loss and bounding-box loss are identical as those de?ned in [12]. The mask branch has a dimensional output for each RoI, which encodes K binary masks of resolution , one for each of the K classes. To this we apply a per-pixel sigmoid, and de?ne as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, is only de?ned on the k-th mask (other mask outputs do not contribute to the loss).
形式上，在訓練期間，我們將每個抽樣的RoI的多任務丟失定義為L = Lcls + Lbox + Lmask。分類損失和邊界框損失與[12]中定義的相同。掩碼分支對每個RoI都有一個維輸出，它編碼分辨率為的K個二進制掩碼，每個K類一個掩碼。為此，我們應用每像素S形，并將定義為平均二叉交叉熵損失。對于與地面實況類別k相關的RoI，僅在第k個掩模上定義（其他掩模輸出不會造成損失）。
Our de?nition of allows the network to generate masks for every class without competition among classes; we rely on the dedicated classi?cation branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [30] to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results.
我們對的定義允許網絡為每個班級生成口罩，而不需要在班級間進行競爭;我們依靠專用分類分支來預測用于選擇輸出掩碼的類別標簽。這樣可以將掩碼和類別預測分開。這與將FCN [30]應用于語義分割時的常見做法不同，后者通常使用每像素softmax和多項叉熵損失。在這種情況下，跨班級的面具競爭;在我們的例子中，每像素S形和二進制丟失，他們不。我們通過實驗顯示這個公式對于良好的實例分割結果是關鍵的。

Mask Representation: A mask encodes an input object’s spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully-connected (fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.
掩碼表示法：掩碼編碼輸入對象的空間布局。因此，與通過完全連接（fc）層不可避免地折疊成短輸出矢量的類標簽或框偏移不同，提取掩模的空間結構可以通過卷積提供的像素到像素對應自然地解決。
Speci?cally, we predict an mask from each RoI using an FCN [30]. This allows each layer in the mask branch to maintain the explicit object spatial layout without collapsing it into a vector representation that lacks spatial dimensions. Unlike previous methods that resort to fc layers for mask prediction [33, 34, 10], our fully convolutional representation requires fewer parameters, and is more accurate as demonstrated by experiments.
具體而言，我們使用FCN預測每個RoI的掩碼[30]。這允許掩碼分支中的每個層保持顯式對象空間布局，而不將其折疊成缺少空間維度的向量表示。與之前采用fc層進行掩模預測的方法不同[33,34,10]，我們的完全卷積表示需要更少的參數，并且如實驗所證明的那樣更精確。
This pixel-to-pixel behavior requires our RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the following RoIAlign layer that plays a key role in mask prediction.
這種像素到像素的行為要求我們的RoI特征（它們本身是小特征圖）能夠很好地對齊以忠實地保留顯式的每像素空間對應關系。這促使我們開發了以下RoAlign圖層，該圖層在遮罩預測中發揮關鍵作用。
RoIAlign: RoIPool [12] is a standard operation for extracting a small feature map (e.g., 7×7) from each RoI. RoIPool ?rst quantizes a ?oating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and ?nally feature values covered by each bin are aggregated (usually by max pooling). Quantization is performed, e.g., on a continuous coordinate x by computing , where 16 is a feature map stride and is rounding; likewise, quantization is performed when dividing into bins (e.g., 7×7). These quantizations introduce misalignments between the RoI and the extracted features. While this may not impact classi?cation, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks.
RoIlign：RoIPool [12]是從每個RoI提取小特征映射（例如7×7）的標準操作。RoIPool首先將浮點數RoI量化為特征映射的離散粒度，然后將這個量化的RoI細分為自身量化的空間倉，最后匯總每個倉所涵蓋的特征值（通常通過最大池）。例如，通過計算在連續坐標x上執行量化，其中16是特征映射步長并且是舍入;同樣地，當分成分箱（例如，7×7）時執行量化。這些量化引入了RoI和提取的特征之間的錯位。雖然這可能不會影響分類，這對于小型翻譯很有用，但它對預測像素精確的蒙版有很大的負面影響。
To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries or bins (i.e., we use instead of ). We use bilinear interpolation [22] to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Figure 3 for details. We note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed.
為了解決這個問題，我們提出一個RoIlign層，它可以消除RoIPool的嚴格量化，正確地將提取的特征與輸入對齊。我們提出的改變很簡單：我們避免任何RoI邊界或分區的量化（即，我們使用而不是）。我們使用雙線性插值[22]來計算每個RoI bin中四個有規律采樣位置的輸入特征的精確值，并匯總結果（使用最大值或平均值），詳細信息請參見圖3。我們注意到，只要未執行量化，結果對精確的采樣位置不敏感，或者采樣了多少個點。
RoIAlign leads to large improvements as we show in §4.2. We also compare to the RoIWarp operation proposed in [10]. Unlike RoIAlign, RoIWarp overlooked the alignment issue and was implemented in [10] as quantizing RoI just like RoIPool. So even though RoIWarp also adopts bilinear resampling motivated by [22], it performs on par with RoIPool as shown by experiments (more details in Table 2c), demonstrating the crucial role of alignment.
正如我們在§4.2中所展示的，RoIAlign帶來了巨大的改進。我們也比較了[10]中提出的RoIWarp操作。與RoIlign不同，RoIWarp忽略了對齊問題，并在[10]中將RoI與RoIPool一樣量化為RoI。所以即使RoIWarp也采用[22]激勵的雙線性重采樣，它可以像RoIPool一樣實驗（表2c中的更多細節），證明了對齊的關鍵作用。
Network Architecture: To demonstrate the generality of our approach, we instantiate Mask R-CNN with multiple architectures. For clarity, we differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and (ii) the network head for bounding-box recognition (classi?cation and regression) and mask prediction that is applied separately to each RoI. We denote the backbone architecture using the nomenclature network-depth-features. We evaluate ResNet [19] and ResNeXt [45] networks of depth 50 or 101 layers. The original implementation of Faster R-CNN with ResNets
網絡體系結構：為了演示我們的方法的一般性，我們實例化具有多種體系結構的Mask R-CNN。為了清楚起見，我們區分：（i）用于整個圖像上的特征提取的卷積骨干架構，以及（ii）用于邊界框識別（分類和回歸）的網絡頭和分別應用于每個RoI的掩模預測。我們用命名網絡深度特征來表示骨干架構。我們評估深度為50或101層的ResNet [19]和ResNeXt [45]網絡。帶ResNets的更快的R-CNN的原始實施
[19] extracted features from the ?nal convolutional layer of the 4-th stage, which we call C4. This backbone with ResNet-50, for example, is denoted by ResNet-50-C4. This is a common choice used in [19, 10, 21, 39].
[19]從第四階段的最后卷積層提取特征，我們稱之為C4。例如，ResNet-50的骨干用ResNet-50-C4表示。這是[19,10,21,39]中常用的選擇。
We also explore another more effective backbone recently proposed by Lin et al. [27], called a Feature Pyramid Network (FPN). FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed. For further details on FPN, we refer readers to [27].
我們還探索了Lin等人最近提出的另一種更有效的骨干。 [27]，稱為特征金字塔網絡（FPN）。FPN使用具有橫向連接的自頂向下架構從單一比例輸入構建網絡內特征金字塔。更快的R-CNN和FPN骨干網根據其規模從不同層次的特征金字塔中提取RoI特征，但其他方法與vanilla ResNet類似。使用ResNet-FPN主干進行MaskNRCNN特征提取，可以提高精度和速度。有關FPN的更多詳細信息，請參閱[27]。
For the network head we closely follow architectures presented in previous work to which we add a fully convolutional mask prediction branch. Speci?cally, we extend the Faster R-CNN box heads from the ResNet [19] and FPN [27] papers. Details are shown in Figure 4. The head on the ResNet-C4 backbone includes the 5-th stage of ResNet (namely, the 9-layer ‘res5’ [19]), which is computeintensive. For FPN, the backbone already includes res5 and thus allows for a more ef?cient head that uses fewer ?lters. We note that our mask branches have a straightforward structure. More complex designs have the potential to improve performance but are not the focus of this work.
對于網絡負責人，我們密切關注以前工作中提出的架構，并在其中添加完全卷積掩碼預測分支。具體而言，我們從ResNet [19]和FPN [27]論文中擴展了更快的R-CNN盒頭。詳細情況如圖4所示。ResNet-C4主干上包含ResNet的第5級（即9層’res5’[19]），它是計算密集型的。對于FPN，骨干已經包含res5，因此可以使用更少的濾波器來提高效率。我們注意到我們的面具分支有一個簡單的結構。更復雜的設計有提高性能的潛力，但不是這項工作的重點。

Figure 4. Head Architecture: We extend two existing Faster RCNN heads [19, 27]. Left/Right panels show the heads for the ResNet C4 and FPN backbones, from [19] and [27], respectively, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial dimension while deconv increases it). All convs are 3×3, except the output conv which is 1×1, deconvs are 2×2 with stride 2, and we use ReLU [31] in hidden layers. Left: ‘res5’ denotes ResNet’s ?fth stage, which for simplicity we altered so that the ?rst conv operates on a 7×7 RoI with stride 1 (instead of 14×14 / stride 2 as in [19]). Right: ‘×4’ denotes a stack of four consecutive convs.
圖4.頭架構：我們擴展了兩個現有的更快的RCNN頭[19,27]。左/右面板分別顯示來自[19]和[27]的ResNet C4和FPN骨干的頭部，其中添加了掩膜分支。數字表示空間分辨率和頻道。箭頭表示可以從上下文推斷的conv，deconv或fc圖層（conv會保留空間維度，而deconv會增加它）。所有的轉換都是3×3，除了輸出轉換為1×1，解壓縮為2×2和步長2，并且我們在隱藏層中使用了ReLU [31]。左：res5表示ResNet的第五階段，為了簡單起見，我們改變了第一階段的第一階段，以步幅1（而不是14×14 /步幅2，如[19]中的7×7階段）操作。右：“×4”表示一連串四次轉換。
3.1. Implementation Details3.1。實施細節
We set hyper-parameters following existing Fast/Faster R-CNN work [12, 36, 27]. Although these decisions were made for object detection in original papers [12, 36, 27], we found our instance segmentation system is robust to them.
我們在現有的快速/更快的R-CNN工作之后設置超參數[12,36,27]。盡管這些決策是在原始文件中進行對象檢測的[12,36,27]，但我們發現我們的實例分割系統對它們是強健的。
Training: As in Fast R-CNN, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. The mask loss is de?ned only on positive RoIs. The mask target is the intersection between an RoI and its associated ground-truth mask.
培訓：與Fast R-CNN一樣，如果RoI的IoU的地面實況框至少為0.5，則認為是正面的，否則為負面。掩模損失僅在正向RoI上定義。掩碼目標是RoI與其關聯的地面實況蒙版之間的交集。
We adopt image-centric training [12]. Images are resized such that their scale (shorter edge) is 800 pixels [27]. Each mini-batch has 2 images per GPU and each image has N sampled RoIs, with a ratio of 1:3 of positive to negatives [12]. N is 64 for the C4 backbone (as in [12, 36]) and 512 for FPN (as in [27]). We train on 8 GPUs (so effective minibatch size is 16) for 160k iterations, with a learning rate of 0.02 which is decreased by 10 at the 120k iteration. We use a weight decay of 0.0001 and momentum of 0.9. With ResNeXt [45], we train with 1 image per GPU and the same number of iterations, with a starting learning rate of 0.01. The RPN anchors span 5 scales and 3 aspect ratios, following [27]. For convenient ablation, RPN is trained separately and does not share features with Mask R-CNN, unless speci?ed. For every entry in this paper, RPN and Mask R-CNN have the same backbones and so they are shareable.
我們采用圖像中心訓練[12]。調整圖像的大小以使其比例（較短的邊緣）為800像素[27]。每個微型批次每個GPU有2個圖像，每個圖像具有N個采樣的RoI，比例為1：3的正負極[12]。C4骨架的N為64（如[12,36]），FPN為512（如[27]）。我們在8個GPU（有效小批量大小為16）上進行160k次迭代訓練，學習率為0.02，在120k迭代時減少10。我們使用0.0001的重量衰減和0.9的動量。使用ResNeXt [45]，我們每個GPU訓練1個圖像，迭代次數相同，初始學習率為0.01。RPN錨點跨越5個尺度和3個縱橫比，見[27]。為了方便消融，除非另有說明，否則RPN將單獨進行培訓并且不會與Mask R-CNN共享特征。對于本文中的每個條目，RPN和Mask R-CNN具有相同的主干，因此它們可共享。
Inference: At test time, the proposal number is 300 for the C4 backbone (as in [36]) and 1000 for FPN (as in [27]). We run the box prediction branch on these proposals, followed by non-maximum suppression [14]. The mask branch is then applied to the highest scoring 100 detection boxes. Although this differs from the parallel computation used in training, it speeds up inference and improves accuracy (due to the use of fewer, more accurate RoIs). The mask branch can predict K masks per RoI, but we only use the k-th mask, where k is the predicted class by the classi?cation branch. The m×m ?oating-number mask output is then resized to the RoI size, and binarized at a threshold of 0.5.
推論：在測試時，C4主干的提案編號為300（如[36]），FPN的提案編號為1000（如[27]）。我們對這些提議運行盒子預測分支，然后是非最大抑制[14]。然后將掩碼分支應用于得分最高的100個檢測框。雖然這與訓練中使用的并行計算不同，但它加快了推理速度并提高了準確性（由于使用了更少，更準確的RoI）。掩模分支可以預測每個RoI的K個掩模，但我們只使用第k個掩模，其中k是分類分支預測的類。然后將m×m浮點數掩碼輸出調整為RoI大小，并在閾值0.5下進行二進制化。

Figure 5. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1).
圖5.在COCO測試圖像上使用ResNet-101-FPN并以5 fps運行并帶有35.7掩模AP（表1）的Mask R-CNN的更多結果。

Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includes multi-scale train/test, horizontal ?ip test, and OHEM [38]. All entries are single-model results.
表1. COCO test-dev上的實例分段掩碼AP。跨國公司[10]和FCIS [26]分別是2015年和2016年分類挑戰的贏家。沒有花里胡哨的，Mask R-CNN勝過了更復雜的FCIS +++，其中包括多尺度訓練/測試，水平測試和OHEM [38]。所有條目都是單模型結果。
Note that since we only compute masks on the top 100 detection boxes, Mask R-CNN adds a small overhead to its Faster R-CNN counterpart (e.g., ～20% on typical models).
請注意，由于我們僅計算前100個檢測框中的掩碼，Mask R-CNN為其較快的R-CNN對象（例如典型模型上的約20％）增加了一個小的開銷。
4. Experiments: Instance Segmentation4.實驗：實例分割
We perform a thorough comparison of Mask R-CNN to the state of the art along with comprehensive ablations on the COCO dataset [28]. We report the standard COCO metrics including AP (averaged over IoU thresholds), AP50, AP75, and APS, APM , APL (AP at different scales). Unless noted, AP is evaluating using mask IoU. As in previous work [5, 27], we train using the union of 80k train images and a 35k subset of val images (trainval35k), and report ablations on the remaining 5k val images (minival). We also report results on test-dev [28].
我們對Mask R-CNN進行了徹底的比較，并對COCO數據集進行了全面的消融[28]。我們報告標準的COCO指標，包括AP（平均在IoU閾值上），AP50，AP75和APS，APM，APL（AP在不同尺度上）。除非另有說明，否則AP正在使用掩膜IoU進行評估。和以前的工作[5,27]一樣，我們訓練使用80k列車圖像和val圖像的35k子集（trainval35k）的聯合，并報告其余5k val圖像（微型）上的消融。我們還在測試開發中報告結果[28]。
4.1. Main Results4.1。主要結果
We compare Mask R-CNN to the state-of-the-art methods in instance segmentation in Table 1. All instantiations of our model outperform baseline variants of previous state-of-the-art models. This includes MNC [10] and FCIS [26], the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN with ResNet-101-FPN backbone outperforms FCIS+++ [26], which includes multi-scale train/test, horizontal ?ip test, and online hard example mining (OHEM) [38]. While outside the scope of this work, we expect many such improvements to be applicable to ours. Mask R-CNN outputs are visualized in Figures 2 and 5. Mask R-CNN achieves good results even under challenging conditions. In Figure 6 we compare our Mask R-CNN baseline and FCIS+++ [26]. FCIS+++ exhibits systematic artifacts on overlapping instances, suggesting that it is challenged by the fundamental dif?culty of instance segmentation. Mask R-CNN shows no such artifacts.
我們將Mask R-CNN與表1中實例分割中的最新方法進行了比較。我們模型的所有實例都優于先前最先進的模型的基線變體。其中包括MNC [10]和FCIS [26]，分別是2015年和2016年分類挑戰的獲勝者。ResNet-101-FPN骨干網掩碼R-CNN的性能優于FCIS +++ [26]，其中包括多尺度訓練/測試，水平流測試和在線硬示例挖掘（OHEM）[38]。雖然超出了本工作的范圍，但我們預計許多此類改進將適用于我們的工作。圖2和圖5中顯示了掩膜R-CNN輸出。面具R-CNN即使在具有挑戰性的條件下也能取得良好效果。在圖6中，我們比較了我們的Mask R-CNN基線和FCIS +++ [26]。FCIS +++在重疊的實例中展現出系統性的人為因素，這表明它受到實例分割根本困難的挑戰。掩碼R-CNN沒有顯示這樣的文物。

Figure 6. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects.
圖6. FCIS +++ [26]（頂部）與屏蔽R-CNN（底部，ResNet-101-FPN）。 FCIS展示重疊對象的系統性文物。

(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).
（b）多項式與獨立式口罩（ResNet-50-C4）：通過類別式口罩（sigmoid）進行解耦可獲得多項式口罩（softmax）的巨大收益。

(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.
（e）掩模分支（ResNet-50-FPN）：用于掩模預測的完全卷積網絡（FCN）與多層感知器（MLP，完全連接）。FCN改善了結果，因為它們利用了對空間布局的明確編碼。
Table 2. Ablations. We train on trainval35k, test on minival, and report mask AP unless otherwise noted.
表2.消融。除非另有說明，否則我們在trainval35k上訓練，在minival上測試，并報告mask AP。
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet.
（a）骨干架構：更好的骨干帶來預期的收益：更深的網絡效果更好，FPN優于C4功能，ResNeXt改進ResNet。

(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features (Table 2c), resulting in big accuracy gaps.
（d）RoIlign（ResNet-50-C5，步幅32）：使用大步功能的面罩級和盒級AP。錯位比步幅-16的特征更嚴重（表2c），導致很大的精度差距。
4.2. Ablation Experiments4.2。消融實驗
We run a number of ablations to analyze Mask R-CNN. Results are shown in Table 2 and discussed in detail next.
我們運行一些消融來分析Mask R-CNN。結果顯示在表2中并在下面詳細討論。
Architecture: Table 2a shows Mask R-CNN with various backbones. It bene?ts from deeper networks (50 vs. 101) and advanced designs including FPN and ResNeXt. We note that not all frameworks automatically bene?t from deeper or advanced networks (see benchmarking in [21]).
架構：表2a顯示了具有各種骨架的Mask R-CNN。它受益于更深的網絡（50對101）和先進的設計，包括FPN和ResNeXt。我們注意到并非所有框架都自動從更深或更高級的網絡中獲益（參見[21]中的基準測試）。
Multinomial vs. Independent Masks: Mask R-CNN decouples mask and class prediction: as the existing box branch predicts the class label, we generate a mask for each class without competition among classes (by a per-pixel sigmoid and a binary loss). In Table 2b, we compare this to using a per-pixel softmax and a multinomial loss (as commonly used in FCN [30]). This alternative couples the tasks of mask and class prediction, and results in a severe loss in mask AP (5.5 points). This suggests that once the instance has been classi?ed as a whole (by the box branch), it is suf?cient to predict a binary mask without concern for the categories, which makes the model easier to train.
多項式與獨立式掩碼：掩碼R-CNN分離掩碼和類別預測：由于現有的分支預測類別標簽，因此我們為每個類別生成一個掩碼，而不會在類別間進行競爭（按像素S形和二進制丟失）。在表2b中，我們將其與使用每像素softmax和多項損失（如FCN [30]中常用的）進行比較。這種替代方案將掩模和類別預測的任務相結合，并導致掩模AP（5.5分）的嚴重損失。這表明一旦實例被整體分類（通過盒子分支），預測二進制掩碼就足夠了，而不用考慮類別，這使得模型更易于訓練。
Class-Speci?c vs. Class-Agnostic Masks: Our default instantiation predicts class-speci?c masks, i.e., one
Class-Speci fi c與Class-Agnostic Masks：我們的默認實例化預測了類特定的掩碼，即一個
? RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ～3 points and AP75 by ～5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers.
（c）RoIlign（ResNet-50-C4）：使用各種RoI圖層蒙版結果。我們的RoIlign層將AP提高了約3分，AP75提高了約5分。使用適當的對齊是造成RoI層之間巨大差距的唯一因素。
mask per class. Interestingly, Mask R-CNN with classagnostic masks (i.e., predicting a single output regardless of class) is nearly as effective: it has 29.7 mask AP vs. 30.3 for the class-speci?c counterpart on ResNet-50-C4. This further highlights the division of labor in our approach which largely decouples classi?cation and segmentation.
每個班級的面具。有趣的是，具有分類掩碼的掩碼R-CNN（即預測單個輸出而不管類別）幾乎同樣有效：它具有29.7掩碼AP，而對于ResNet-50-C4上的類別特定對應字符，掩碼AP為30.3。這進一步突出了我們的方法中的分工，這種分工在很大程度上將分類和分割分開。
RoIAlign: An evaluation of our proposed RoIAlign layer is shown in Table 2c. For this experiment we use the ResNet50-C4 backbone, which has stride 16. RoIAlign improves AP by about 3 points over RoIPool, with much of the gain coming at high IoU (AP75). RoIAlign is insensitive to max/average pool; we use average in the rest of the paper. Additionally, we compare with RoIWarp proposed in MNC [10] that also adopt bilinear sampling. As discussed in §3, RoIWarp still quantizes the RoI, losing alignment with the input. As can be seen in Table 2c, RoIWarp performs on par with RoIPool and much worse than RoIAlign. This highlights that proper alignment is key.
Roialign：我們建議的RoIlign層的評估如表2c所示。在這個實驗中，我們使用了跨度為16的ResNet50-C4主干。RoIAlign比RoIPool提高了約3個點，其中很大的收益來自高IoU（AP75）。RoIlign對最大/平均水池不敏感;我們在本文的其余部分使用平均值。另外，我們與在MNC [10]中提出的RoIWarp進行比較，該方法也采用雙線性采樣。正如§3所討論的那樣，RoIWarp仍然量化了RoI，失去了與輸入的一致性。從表2c可以看出，RoIWarp的表現與RoIPool相當，比RoIAlign差很多。這突出表明正確的對齊是關鍵。
We also evaluate RoIAlign with a ResNet-50-C5 backbone, which has an even larger stride of 32 pixels. We use the same head as in Figure 4 (right), as the res5 head is not applicable. Table 2d shows that RoIAlign improves mask AP by a massive 7.3 points, and mask AP75 by 10.5 points (50% relative improvement). Moreover, we note that with RoIAlign, using stride-32 C5 features (30.9 AP) is more accurate than using stride-16 C4 features (30.3 AP, Table 2c). RoIAlign largely resolves the long-standing challenge of using large-stride features for detection and segmentation. Finally, RoIAlign shows a gain of 1.5 mask AP and 0.5 box AP when used with FPN, which has ?ner multi-level strides. For keypoint detection that requires ?ner alignment, RoIAlign shows large gains even with FPN (Table 6).
我們還用一個ResNet-50-C5骨干來評估RoIlign，這個骨干有32個像素的更大步幅。我們使用與圖4（右）相同的頭，因為res5頭不適用。表2d顯示RoIAlign提高了掩模AP的7.3點，掩蓋AP75 10.5點（相對提高50％）。此外，我們注意到使用RoIAlign，使用步幅-32 C5功能（30.9 AP）比使用步幅-16 C4功能（30.3 AP，表2c）更準確。RoIAlign在很大程度上解決了使用大步功能進行檢測和分割的長期挑戰。最后，與FPN一起使用時，RoIAlign顯示1.5掩模AP和0.5盒AP的增益，FPN具有更精細的多級步幅。對于需要精細對齊的關鍵點檢測，RoIAlign即使使用FPN也顯示出較大的增益（表6）。

Mask Branch: Segmentation is a pixel-to-pixel task and we exploit the spatial layout of masks by using an FCN. In Table 2e, we compare multi-layer perceptrons (MLP) and FCNs, using a ResNet-50-FPN backbone. Using FCNs gives a 2.1 mask AP gain over MLPs. We note that we choose this backbone so that the conv layers of the FCN head are not pre-trained, for a fair comparison with MLP.
遮罩分支：分割是一個像素到像素的任務，我們通過使用FCN來利用遮罩的空間布局。在表2e中，我們使用ResNet-50-FPN主干比較了多層感知器（MLP）和FCN。使用FCN可以提供2.1 Mbps的AP掩碼。我們注意到，我們選擇了這個骨干，這樣FCN頭部的conv層沒有經過預先訓練，與MLP進行公平比較。
4.3. Bounding Box Detection Results4.3。邊界框檢測結果
We compare Mask R-CNN to the state-of-the-art COCO bounding-box object detection in Table 3. For this result, even though the full Mask R-CNN model is trained, only the classi?cation and box outputs are used at inference (the mask output is ignored). Mask R-CNN using ResNet-101FPN outperforms the base variants of all previous state-ofthe-art models, including the single-model variant of GRMI [21], the winner of the COCO 2016 Detection Challenge. Using ResNeXt-101-FPN, Mask R-CNN further improves results, with a margin of 3.0 points box AP over the best previous single model entry from [39] (which used Inception-ResNet-v2-TDM).
我們將Mask R-CNN與表3中的最新COCO包圍盒對象檢測進行比較。對于這個結果，即使訓練完整的Mask R-CNN模型，只有分類和框輸出用于推理（掩碼輸出被忽略）。使用ResNet-101FPN的面罩R-CNN優于以前所有先進模型的基礎變體，其中包括COMI 2016檢測挑戰賽獲勝者GRMI [21]的單模型變體。使用ResNeXt-101-FPN，Mask R-CNN進一步改進了結果，與[39]（使用Inception-ResNet-v2-TDM）的最佳單一模型條目相比，框AP的余量為3.0分。
As a further comparison, we trained a version of Mask R-CNN but without the mask branch, denoted by “Faster R-CNN, RoIAlign” in Table 3. This model performs better than the model presented in [27] due to RoIAlign. On the other hand, it is 0.9 points box AP lower than Mask R-CNN. This gap of Mask R-CNN on box detection is therefore due solely to the bene?ts of multi-task training.
作為進一步的比較，我們訓練了一個版本的掩模R-CNN，但沒有掩模分支，表3中的“Faster R-CNN，RoIlign”表示。由于RoIlign的原因，該模型的性能比[27]中介紹的模型要好。另一方面，比面具R-CNN低0.9個盒子AP。因此掩模R-CNN在盒子檢測上的差距僅僅是由于多任務訓練的好處。
Lastly, we note that Mask R-CNN attains a small gap between its mask and box AP: e.g., 2.7 points between 37.1 (mask, Table 1) and 39.8 (box, Table 3). This indicates that our approach largely closes the gap between object detection and the more challenging instance segmentation task.
最后，我們注意到Mask R-CNN在其掩模和盒AP之間獲得了一個小間隙：例如，在37.1（掩模，表1）和39.8（框3）之間的2.7個點。這表明我們的方法在很大程度上縮小了對象檢測與更具挑戰性的實例分割任務之間的差距。
4.4. Timing4.4. Timing
Inference: We train a ResNet-101-FPN model that shares features between the RPN and Mask R-CNN stages, following the 4-step training of Faster R-CNN [36]. This model runs at 195ms per image on an Nvidia Tesla M40 GPU (plus 15ms CPU time resizing the outputs to the original resolution), and achieves statistically the same mask AP as the unshared one. We also report that the ResNet-101-C4 variant takes ～400ms as it has a heavier box head (Figure 4), so we do not recommend using the C4 variant in practice.
推論：我們訓練了一個ResNet-101-FPN模型，該模型在R-CNN更快的四步訓練之后訓練RPN和Mask R-CNN階段之間的特征[36]。Nvidia Tesla M40 GPU（加上15ms CPU時間，將輸出調整為原始分辨率）時，該模型以195ms的速度運行，并實現與非共享模式相同的掩模AP。我們還報告說ResNet-101-C4變體需要400毫秒，因為它有一個較重的盒子頭（圖4），所以我們不建議在實踐中使用C4變體。
Although Mask R-CNN is fast, we note that our design is not optimized for speed, and better speed/accuracy tradeoffs could be achieved [21], e.g., by varying image sizes and proposal numbers, which is beyond the scope of this paper.
盡管掩模R-CNN速度很快，但我們注意到我們的設計并未針對速度進行優化，并且可以實現更好的速度/精度折衷[21]，例如，通過改變圖像尺寸和提案編號，這超出了本白皮書的范圍。
Training: Mask R-CNN is also fast to train. Training with ResNet-50-FPN on COCO trainval35k takes 32 hours in our synchronized 8-GPU implementation (0.72s per 16image mini-batch), and 44 hours with ResNet-101-FPN. In fact, fast prototyping can be completed in less than one day when training on the train set. We hope such rapid training will remove a major hurdle in this area and encourage more people to perform research on this challenging topic.
訓練：面具R-CNN訓練也很快。在COCO trainval35k上使用ResNet-50-FPN進行培訓的同步8 GPU實現需要32小時（每16圖像微型批次0.72s），使用ResNet-101-FPN需要44小時。實際上，快速原型設計可以在不到一天的時間內在火車上進行訓練時完成。我們希望這種快速培訓能夠消除該領域的一個主要障礙，并鼓勵更多的人對這個具有挑戰性的話題進行研究。
5. Mask R-CNN for Human Pose Estimation5.掩蓋R-CNN用于人體姿態估計
Our framework can easily be extended to human pose estimation. We model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left shoulder, right elbow). This task helps demonstrate the ?exibility of Mask R-CNN. We note that minimal domain knowledge for human pose is exploited by our system, as the experiments are mainly to demonstrate the generality of the Mask R-CNN framework. We expect that domain knowledge (e.g., modeling structures [6]) will be complementary to our simple approach.
我們的框架可以很容易地擴展到人體姿態估計。我們將一個關鍵點的位置建模為一個單獨的熱掩模，并采用掩模R-CNN預測K個掩模，每個K個關鍵點類型（例如左肩，右肘）各一個。這項任務有助于展示Mask R-CNN的靈活性。我們注意到，我們的系統利用了人類姿態的最小領域知識，因為實驗主要是為了展示Mask R-CNN框架的一般性。我們期望領域知識（例如，建模結構[6]）將與我們簡單的方法相輔相成。
Implementation Details: We make minor modi?cations to the segmentation system when adapting it for keypoints. For each of the K keypoints of an instance, the training target is a one-hot binary mask where only a single pixel is labeled as foreground. During training, for each visible ground-truth keypoint, we minimize the cross-entropy loss over an -way softmax output (which encourages a [6] is the 2016 competition winner that uses multi-scale testing, post-processing with CPM [44], and ?ltering with an object detector, adding a cumulative ～5 points (clari?ed in personal communication). ?: G-RMI was trained on COCO plus MPII [1] (25k images), using two models (Inception-ResNet-v2 for bounding box detection and ResNet-101 for keypoints).
實施細節：對關鍵點進行調整時，我們對細分系統進行細微修改。對于實例的每個K關鍵點，訓練目標是一個熱點二進制掩碼，其中只有一個像素標記為前景。在訓練過程中，對于每個可見的地面真值關鍵點，我們將 -way softmax輸出的交叉熵損失最小化（鼓勵[6]是2016年競賽獲勝者，使用多尺度測試，CPM后處理[使用兩種模型（Inception-ResNet-1）對G-RMI進行COCO加MPII [1]（25k圖像）的訓練，并用目標檢測器進行濾波，累加約5個點（在個人通信中加以澄清） v2用于邊界框檢測，ResNet-101用于關鍵點）。

Figure 7. Keypoint detection results on COCO test using Mask R-CNN (ResNet-50-FPN), with person segmentation masks predicted from the same model. This model has a keypoint AP of 63.1 and runs at 5 fps.
圖7.使用Mask R-CNN（ResNet-50-FPN）在COCO測試中的關鍵點檢測結果，以及從相同模型預測的人分割掩碼。該模型的關鍵點AP為63.1，運行速度為5 fps。

Table 4. Keypoint detection AP on COCO test-dev. Ours is a single model (ResNet-50-FPN) that runs at 5 fps. CMU-Pose+++
表4. COCO test-dev上的關鍵點檢測AP。我們是以5 fps運行的單一型號（ResNet-50-FPN）。 CMU-姿態+++
single point to be detected). We note that as in instance segmentation, the K keypoints are still treated independently. We adopt the ResNet-FPN variant, and the keypoint head architecture is similar to that in Figure 4 (right). The keypoint head consists of a stack of eight 3×3 512-d conv layers, followed by a deconv layer and 2× bilinear upscaling, producing an output resolution of 56×56. We found that a relatively high resolution output (compared to masks) is required for keypoint-level localization accuracy.
單點待檢測）。我們注意到，與實例分割一樣，K關鍵點仍然是獨立處理的。我們采用ResNet-FPN變體，關鍵點頭結構與圖4（右）相似。關鍵點頭由8個3×3 512-d的conv層組成，其后是去卷積層和2倍雙線性放大，產生56×56的輸出分辨率。我們發現對于關鍵點級別的定位精度需要相對較高的分辨率輸出（與掩模相比）。
Models are trained on all COCO trainval35k images that contain annotated keypoints. To reduce over?tting, as this training set is smaller, we train using image scales randomly sampled from [640, 800] pixels; inference is on a single scale of 800 pixels. We train for 90k iterations, starting from a learning rate of 0.02 and reducing it by 10 at 60k and 80k iterations. We use bounding-box NMS with a threshold of 0.5. Other details are identical as in §3.1.
模型在所有包含注釋關鍵點的COCO trainval35k圖像上進行訓練。為減少過度訓練，由于訓練集較小，我們使用從[640,800]像素中隨機采樣的圖像比例進行訓練;推斷是在800像素的單一尺度上進行的。我們訓練90k迭代，從0.02的學習率開始，在60k和80k迭代時將其減少10。我們使用邊界框NMS，閾值為0.5。其他細節與§3.1中的相同。
Main Results and Ablations: We evaluate the person keypoint AP (APkp) and experiment with a ResNet-50-FPN backbone; more backbones will be studied in the appendix. Table 4 shows that our result (62.7 APkp) is 0.9 points higher than the COCO 2016 keypoint detection winner [6] that uses a multi-stage processing pipeline (see caption of Table 4). Our method is considerably simpler and faster.
主要結果和消融：我們評估人員關鍵點AP（APkp）并嘗試使用ResNet-50-FPN主干;附錄中將研究更多骨干。表4顯示我們的結果（62.7 APkp）比使用多級處理管道的COCO 2016關鍵點檢測獲勝者[6]高0.9個點（見表4的標題）。我們的方法相當簡單快捷。
More importantly, we have a uni?ed model that can si multaneously predict boxes, segments, and keypoints while running at 5 fps. Adding a segment branch (for the person category) improves the APkp to 63.1 (Table 4) on test-dev. More ablations of multi-task learning on minival are in Table 5. Adding the mask branch to the box-only (i.e., Faster R-CNN) or keypoint-only versions consistently improves these tasks. However, adding the keypoint branch reduces the box/mask AP slightly, suggesting that while keypoint detection bene?ts from multitask training, it does not in turn help the other tasks. Nevertheless, learning all three tasks jointly enables a uni?ed system to ef?ciently predict all outputs simultaneously (Figure 7). We also investigate the effect of RoIAlign on keypoint detection (Table 6). Though this ResNet-50-FPN backbone has ?ner strides (e.g., 4 pixels on the ?nest level), RoIAlign still shows signi?cant improvement over RoIPool and increases APkp by 4.4 points. This is because keypoint detections are more sensitive to localization accuracy. This again indicates that alignment is essential for pixel-level localization, including masks and keypoints.
更重要的是，我們有一個統一的模型，可以在5 fps下運行時同時預測盒子，分段和關鍵點。添加段分支（針對人員類別）將test-dev上的APkp值提高到63.1（表4）。表5中更多關于微型多任務學習的消除。將掩碼分支添加到僅包裝盒（即更快的R-CNN）或僅有關鍵點的版本可以持續改進這些任務。但是，添加關鍵點分支會略微減少盒/掩碼AP，這表明雖然多任務訓練可以實現關鍵點檢測，但它不會幫助其他任務。不過，聯合學習所有三項任務可以使統一系統同時有效地預測所有輸出（圖7）。我們還調查RoIAlign對關鍵點檢測的影響（表6）。盡管ResNet-50-FPN骨干網有很大的進展（例如，在嵌套層面上有4個像素），但RoIAlign仍然顯示出比RoIPool有顯著的提高，APkp增加4.4點。這是因為關鍵點檢測對定位精度更敏感。這再次表明，對齊對像素級本地化至關重要，包括掩碼和關鍵點。

Table 5. Multi-task learning of box, mask, and keypoint about the person category, evaluated on minival. All entries are trained on the same data for fair comparisons. The backbone is ResNet50-FPN. The entries with 64.2 and 64.7 AP on minival have test-dev AP of 62.7 and 63.1, respectively (see Table 4).
表5.關于人物類別的盒子，面具和關鍵點的多任務學習，在迷你游戲上評估。所有的參賽作品都使用相同的數據進行公平比較。骨干是ResNet50-FPN。 minival上的64.2和64.7 AP的條目分別具有62.7和63.1的測試開發AP（參見表4）。

Table 6. RoIAlign vs. RoIPool for keypoint detection on minival. The backbone is ResNet-50-FPN.
表6. RoIlign與RoIPool用于微型關鍵點檢測。骨干是ResNet-50-FPN。
Given the effectiveness of Mask R-CNN for extracting object bounding boxes, masks, and keypoints, we expect it be an effective framework for other instance-level tasks.
鑒于Mask R-CNN提取對象邊界框，掩碼和關鍵點的有效性，我們預計它將成為其他實例級任務的有效框架。

Appendix A: Experiments on Cityscapes附錄A：城市風景的實驗
We further report instance segmentation results on the Cityscapes [7] dataset. This dataset has fine annotations for 2975 train, 500 val, and 1525 test images. It has 20k coarse training images without instance annotations, which we do not use. All images are 2048×1024 pixels. The instance segmentation task involves 8 object categories, whose numbers of instances on the fine training set are: Instance segmentation performance on this task is measured by the COCO-style mask AP (averaged over IoU thresholds); AP50 (i.e., mask AP at an IoU of 0.5) is also reported.
我們進一步報告Cityscapes [7]數據集上的實例分割結果。該數據集對2975列車，500 val和1525測試圖像具有良好的注釋。它有20k個沒有實例注釋的粗糙訓練圖像，我們不使用它。所有圖像都是2048×1024像素。實例分段任務涉及8個對象類別，其在精細訓練集上的實例數量為：此任務上的實例分段性能由COCO式掩碼AP（在IoU閾值上平均）測量;也報告AP50（即，IoU為0.5的掩碼AP）。

Implementation: We apply our Mask R-CNN models with the ResNet-FPN-50 backbone; we found the 101-layer counterpart performs similarly due to the small dataset size. We train with image scale (shorter side) randomly sampled from [800, 1024], which reduces over?tting; inference is on a single scale of 1024 pixels. We use a mini-batch size of 1 image per GPU (so 8 on 8 GPUs) and train the model for 24k iterations, starting from a learning rate of 0.01 and reducing it to 0.001 at 18k iterations. It takes ～4 hours of training on a single 8-GPU machine under this setting.
實施：我們將我們的Mask R-CNN模型與ResNet-FPN-50骨干一起使用;我們發現由于數據集的大小很小，101層對應表現相似。我們訓練時采用從[800，1024]隨機采樣的圖像縮放比例（短邊），這可以減少過度擬合;推斷是在1024像素的單一尺度上進行的。我們在每個GPU上使用1個圖像的小批量（在8個GPU上使用8個），并對模型進行24k次迭代訓練，從學習率0.01開始，在18k迭代時將其降至0.001。在此設置下，單個8 GPU計算機需要花費約4小時的培訓時間。
Results: Table 7 compares our results to the state of the art on the val and test sets. Without using the coarse training set, our method achieves 26.2 AP on test, which is over 30% relative improvement over the previous best entry (DIN [3]), and is also better than the concurrent work of SGN’s 25.0 [29]. Both DIN and SGN use fine + coarse data. Compared to the best entry using fine data only (17.4 AP), we achieve a ～50% improvement.
結果：表7將我們的結果與val和測試集上的現有技術進行比較。在不使用粗糙訓練集的情況下，我們的方法在測試中達到26.2 AP，相對于以前的最佳條目（DIN [3]），相對提高30％以上，并且也優于SGN 25.0的同時工作[29]。 DIN和SGN都使用精細+粗糙的數據。與僅使用精細數據（17.4 AP）的最佳條目相比，我們實現了約50％的改進。
For the person and car categories, the Cityscapes dataset exhibits a large number of within-category overlapping instances (on average 6 people and 9 cars per image). We argue that within-category overlap is a core dif?culty of instance segmentation. Our method shows massive improvement on these two categories over the other best entries (relative ～40% improvement on person from 21.8 to 30.5 and ～20% improvement on car from 39.4 to 46.9), even though our method does not exploit the coarse data.
對于個人和汽車類別，Cityscapes數據集展示了大量類別內重疊實例（平均每個圖像6人和9輛汽車）。我們認為，類別內重疊是實例分割的核心難題。我們的方法顯示，對于其他最佳條目，這兩個類別都有了很大的改進（相對于人員從21.8提高到40.5％，從39.4提高到了30.5，汽車提高了20％，從39.4提高到46.9），盡管我們的方法沒有利用粗略數據。
A main challenge of the Cityscapes dataset is training models in a low-data regime, particularly for the categories of truck, bus, and train, which have about 200-500 train ing samples each. To partially remedy this issue, we further report a result using COCO pre-training. To do this, we initialize the corresponding 7 categories in Cityscapes from a pre-trained COCO Mask R-CNN model (rider being randomly initialized). We ?ne-tune this model for 4k iterations in which the learning rate is reduced at 3k iterations, which takes ～1 hour for training given the COCO model.
Cityscapes數據集的一個主要挑戰是在低數據情況下訓練模型，尤其是卡車，公交車和火車類別的訓練模型，每個訓練樣本大約有200-500個訓練樣本。為了部分解決這個問題，我們使用COCO預培訓進一步報告結果。為此，我們從預先訓練好的COCO Mask R-CNN模型（騎手被隨機初始化）初始化Cityscapes中相應的7個類別。我們對這個模型進行了微調4k迭代，其中學習速率在3k次迭代時減少，在COCO模型的情況下，這需要約1小時的訓練時間。

Figure 8. Mask R-CNN results on Cityscapes test (32.0 AP). The bottom-right image shows a failure prediction.
圖8.在Cityscapes測試中屏蔽R-CNN結果（32.0 AP）。右下圖顯示故障預測。
The COCO pre-trained Mask R-CNN model achieves 32.0 AP on test, almost a 6 point improvement over the fine-only counterpart. This indicates the important role the amount of training data plays. It also suggests that methods on Cityscapes might be in?uenced by their lowshot learning performance. We show that using COCO pretraining is an effective strategy on this dataset.
COCO預先訓練的Mask R-CNN模型在測試中達到了32.0 AP，比精細對手提高了近6個點。這表明培訓數據的重要作用。它還表明，城市風景的方法可能受其低迷學習表現的影響。我們表明使用COCO預訓練是對這個數據集的有效策略。
Finally, we observed a bias between the val and test AP, as is also observed from the results of [23, 4, 29]. We found that this bias is mainly caused by the truck, bus, and train categories, with the fine-only model having val/test AP of 28.8/22.8, 53.5/32.2, and 33.0/18.6, respectively. This suggests that there is a domain shift on these categories, which also have little training data. COCO pre-training helps to improve results the most on these categories; however, the domain shift persists with 38.0/30.1, 57.5/40.9, and 41.2/30.9 val/test AP, respectively. Note that for the person and car categories we do not see any such bias (val/test AP are within point).
最后，我們觀察到val和測試AP之間存在偏差，從[23,4,29]的結果中也可以看出。我們發現，這種偏見主要是由卡車，公共汽車和火車類別造成的，純罰款模型的val / test AP分別為28.8 / 22.8,53.5 / 32.2和33.0 / 18.6。這表明這些類別存在域名轉移，這些域名也很少有培訓數據。COCO預培訓有助于提高這些類別的最佳結果;然而，域變化仍然分別為38.0 / 30.1,57.5 / 40.9和41.2 / 30.9 val / test AP。請注意，對于人員和汽車類別，我們沒有看到任何此類偏差（VAL /測試AP在點內）。
Example results on Cityscapes are shown in Figure 8.
城市風景示例結果如圖8所示。

Table 8. Enhanced detection results of Mask R-CNN on COCO minival. Each row adds an extra component to the above row. We denote ResNeXt model by ‘X’ for notational brevity.
表8.在COCO minival上增強Mask R-CNN的檢測結果。每行添加一個額外的組件到上面的行。為了符號簡潔，我們用’X’表示ResNeXt模型。
Appendix B: Enhanced Results on COCO附錄B：關于COCO的增強結果
As a general framework, Mask R-CNN is compatible with complementary techniques developed for detection/segmentation, including improvements made to Fast/Faster R-CNN and FCNs. In this appendix we describe some techniques that improve over our original results. Thanks to its generality and ?exibility, Mask R-CNN was used as the framework by the three winning teams in the COCO 2017 instance segmentation competition, which all signi?cantly outperformed the previous state of the art.
作為一個通用框架，Mask R-CNN與為檢測/分割開發的補充技術兼容，包括對快速/更快的R-CNN和FCN進行改進。在本附錄中，我們將介紹一些改進我們原始結果的技術。由于其通用性和靈活性，COCO 2017實例細分競賽中三個獲勝團隊使用Mask R-CNN作為框架，這些團隊的表現都優于先前的技術水平。
Instance Segmentation and Object Detection實例分割和對象檢測
We report some enhanced results of Mask R-CNN in Table 8. Overall, the improvements increase mask AP 5.1 points (from 36.7 to 41.8) and box AP 7.7 points (from 39.6 to 47.3). Each model improvement increases both mask AP and box AP consistently, showing good generalization of the Mask R-CNN framework. We detail the improvements next. These results, along with future updates, can be reproduced by our released code at https://github.com/ facebookresearch/Detectron, and can serve as higher baselines for future research.
我們在表8中報告Mask R-CNN的一些增強結果?？傮w而言，這些改進提高了掩護AP 5.1點（從36.7到41.8）和AP 7.7點（從39.6到47.3）。每個模型的改進都會一致地增加掩模AP和框AP的數量，顯示掩模R-CNN框架具有很好的一般性。我們接下來詳細介紹改進。這些結果以及未來的更新可以通過我們在https://github.com/ facebookresearch / Detectron上發布的代碼進行復制，并且可以作為未來研究的更高基線。
Updated baseline: We start with an updated baseline with a different set of hyper-parameters. We lengthen the training to 180k iterations, in which the learning rate is reduced by 10 at 120k and 160k iterations. We also change the NMS threshold to 0.5 (from a default value of 0.3). The updated baseline has 37.0 mask AP and 40.5 box AP.
更新后的基線：我們從更新后的基線開始，使用一組不同的超參數。我們將訓練延長到180k迭代，其中在120k和160k迭代時學習速率減少10。我們還將NMS閾值更改為0.5（默認值為0.3）。更新的基線有37.0掩模AP和40.5盒AP。
End-to-end training: All previous results used stagewise training, i.e., training RPN as the ?rst stage and Mask R-CNN as the second. Following [37], we evaluate endto-end (‘e2e’) training that jointly trains RPN and Mask RCNN. We adopt the ‘approximate’ version in [37] that only computes partial gradients in the RoIAlign layer by ignoring the gradient w.r.t. RoI coordinates. Table 8 shows that e2e training improves mask AP by 0.6 and box AP by 1.2. ImageNet-5k pre-training: Following [45], we experiment with models pre-trained on a 5k-class subset of ImageNet (in contrast to the standard 1k-class subset). This 5× increase in pre-training data improves both mask and box 1 AP. As a reference, [40] used ～250× more images (300M) and reported a 2-3 box AP improvement on their baselines.
端到端培訓：以前的所有研究結果均采用分階段培訓，即將RPN作為第一階段訓練，將面具R-CNN作為第二階段訓練。在[37]之后，我們評估聯合訓練RPN和掩模RCNN的端對端（‘e2e’）訓練。我們采用[37]中的’近似’版本，僅通過忽略梯度w.r.t來計算RoIAlign層中的部分梯度。 RoI坐標。表8顯示，e2e訓練將掩蔽AP提高0.6，將AP提高1.2。ImageNet-5k預訓練：在[45]之后，我們試驗了在ImageNet的5k級子集上預訓練的模型（與標準的1k級子集相反）。訓練前數據增加5倍，可以改善掩模和方框1的AP。作為參考文獻，[40]使用了~250倍的圖像（300M），并在其基線上報告了2-3盒AP改善。

Table 9. Enhanced keypoint results of Mask R-CNN on COCO minival. Each row adds an extra component to the above row. Here we use only keypoint annotations but no mask annotations. We denote ResNet by ‘R’ and ResNeXt by ‘X’ for brevity.
表9. COCO minival上Mask R-CNN增強的關鍵點結果。每行添加一個額外的組件到上面的行。這里我們只使用關鍵點注釋但不使用遮罩注釋。為了簡潔起見，我們用’R’和ResNeXt’X’來表示ResNet。
Train-time augmentation: Scale augmentation at train time further improves results. During training, we randomly sample a scale from [640, 800] pixels and we increase the number of iterations to 260k (with the learning rate reduced by 10 at 200k and 240k iterations). Train-time augmentation improves mask AP by 0.6 and box AP by 0.8.
訓練時間增量：訓練時間的增量訓練可進一步提高結果。在訓練過程中，我們從[640,800]個像素中隨機抽取一個比例，我們將迭代次數增加到260k（在200k和240k迭代時學習率降低了10）。訓練時間增加將掩護AP提高0.6點，將AP掩護提高0.8點。
Model architecture: By upgrading the 101-layer ResNeXt to its 152-layer counterpart [19], we observe an increase of 0.5 mask AP and 0.6 box AP. This shows a deeper model can still improve results on COCO.
模型架構：通過將101層ResNeXt升級到152層對應模型[19]，我們觀察到0.5掩模AP和0.6盒AP的增加。這表明一個更深的模型仍然可以改善COCO的結果。
Using the recently proposed non-local (NL) model [43], we achieve 40.3 mask AP and 45.0 box AP. This result is without test-time augmentation, and the method runs at 3fps on an Nvidia Tesla P100 GPU at test time.
使用最近提出的非局部（NL）模型[43]，我們實現了40.3掩模AP和45.0盒AP。這一結果沒有測試時間增強，并且測試時該方法在Nvidia Tesla P100 GPU上以3fps運行。
Test-time augmentation: We combine the model results evaluated using scales of [400, 1200] pixels with a step of 100 and on their horizontal ?ips. This gives us a singlemodel result of 41.8 mask AP and 47.3 box AP.
測試時間增量：我們將使用[400，1200]像素的縮放比例評估的模型結果與100的步長以及它們的水平面結合起來。這給了我們41.8掩模AP和47.3盒AP的單模型結果。
The above result is the foundation of our submission to the COCO 2017 competition (which also used an ensemble, not discussed here). The ?rst three winning teams for the instance segmentation task were all reportedly based on an extension of the Mask R-CNN framework.
以上結果是我們提交COCO 2017比賽的基礎（其中也使用了一個合奏組合，這里不再討論）。據報道，實例分割任務的前三名獲勝團隊都是基于Mask R-CNN框架的擴展。
Keypoint Detection關鍵點檢測
We report enhanced results of keypoint detection in Table 9. As an updated baseline, we extend the training schedule to 130k iterations in which the learning rate is reduced by 10 at 100k and 120k iterations. This improves APkp by about 1 point. Replacing ResNet-50 with ResNet-101 and ResNeXt-101 increases APkp to 66.1 and 67.3, respectively. With a recent method called data distillation [35], we are able to exploit the additional 120k unlabeled images provided by COCO. In brief, data distillation is a self-training strategy that uses a model trained on labeled data to predict annotations on unlabeled images, and in turn updates the model with these new annotations. Mask R-CNN provides an effective framework for such a self-training strategy. With data distillation, Mask R-CNN APkp improve by 1.8 points to 69.1. We observe that Mask R-CNN can bene?t from extra data, even if that data is unlabeled.
我們在表9中報告關鍵點檢測的增強結果。作為更新后的基線，我們將訓練計劃延長到130k次迭代，其中在100k和120k迭代時學習率降低了10。這可以提高APkp大約1分。用ResNet-101和ResNeXt-101代替ResNet-50，APkp分別增加到66.1和67.3。利用最近稱為數據精餾的方法[35]，我們可以利用COCO提供的額外的120k無標簽圖像。簡而言之，數據提煉是一種自我訓練策略，它使用訓練有標簽數據的模型來預測未標記圖像上的注釋，并用這些新注釋來更新模型。面具R-CNN為這種自我培訓戰略提供了一個有效的框架。通過數據提煉，Mask R-CNN APkp提高1.8點至69.1。我們觀察到Mask R-CNN可以從額外的數據中獲益，即使這些數據沒有標記。
By using the same test-time augmentation as used for instance segmentation, we further boost APkp to 70.4.
通過使用與實例分段相同的測試時間增強功能，我們將APkp進一步提升至70.4。
Acknowledgements: We would like to acknowledge Ilija Radosavovic for contributions to code release and enhanced results, and the Caffe2 team for engineering support.
致謝：我們要感謝Ilija Radosavovic對代碼發布和增強結果的貢獻，以及Caffe2工程團隊的支持。
References參考
[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014. 8
[1] M. Andriluka，L. Pishchulin，P. Gehler和B. Schiele。 2D人體姿勢估計：新的基準和最先進的分析。在CVPR，2014年。8
[2] P. Arbel′aez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014. 2
[2] P. Arbel’aez，J. Pont-Tuset，J. T. Barron，F. Marques和J. Malik。多尺度組合分組。在CVPR，2014。2
[3] A. Arnab and P. H. Torr. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017. 3, 9
[3] A.阿納布和P.托爾。 Pixelwise實例分割與動態實例化網絡。在CVPR，2017.3,9
[4] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017. 3, 9
[4] M. Bai和R. Urtasun。深度分水嶺變換例如分割。在CVPR，2017.3,9
[5] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016. 5
[5] S. Bell，C. L. Zitnick，K. Bala和R. Girshick。 Insideoutside net：使用跳池和循環神經網絡檢測上下文中的對象。在CVPR，2016。5
[6] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part af?nity ?elds. In CVPR, 2017. 7, 8
[6] Z. Cao，T. Simon，S.-E.魏和Y.謝赫。實時多人2d姿態估計使用部分親和力字段。在CVPR，2017.7,8
[7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 9
[7] M. Cordts，M. Omran，S. Ramos，T. Rehfeld，M. Enzweiler，R. Benenson，U. Franke，S. Roth和B. Schiele。用于語義城市場景理解的Cityscapes數據集。在CVPR，2016。9
[8] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016. 2
[8] J. Dai，K. He，Y. Li，S. Ren和J. Sun.實例敏感的完全卷積網絡。在ECCV，2016。2
[9] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015. 2
[9] J. Dai，K. He和J. Sun.用于聯合對象和東西分割的卷積特征掩蔽。在CVPR，2015。2
[10] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016. 2, 3, 4, 5, 6
[10] J. Dai，K. He和J. Sun.通過多任務網絡級聯的實例感知語義分割。在CVPR，2016年。2，3，4，5，6
[11] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In NIPS, 2016. 2
[11] J. Dai，Y. Li，K. He和J. Sun. R-FCN：通過基于區域的完全卷積網絡進行目標檢測。在NIPS，2016。2
[12] R. Girshick. Fast R-CNN. In ICCV, 2015. 1, 2, 3, 4, 6
[12] R. Girshick?？霷-CNN。在ICCV，2015年1，2，3，4，6
[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 2, 3
[13] R. Girshick，J. Donahue，T. Darrell和J. Malik。豐富的功能層次結構，用于精確的對象檢測和語義分割。在CVPR，2014。2，3
[14] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks. In CVPR, 2015. 4
[14] R. Girshick，F。Iandola，T. Darrell和J. Malik?？勺冃瘟慵Ｐ褪蔷矸e神經網絡。在CVPR，2015。4
[15] B. Hariharan, P. Arbel′aez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV. 2014. 2
[15] B. Hariharan，P.阿爾貝阿茲，R. Girshick和J.馬利克。同時檢測和分割。在ECCV中。 2
[16] B. Hariharan, P. Arbel′aez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and ?ne-grained localization. In CVPR, 2015. 2
[16] B. Hariharan，P.阿爾貝阿茲，R. Girshick和J.馬利克。用于對象分割和細化本地化的高列。在CVPR，2015。2
[17] Z. Hayder, X. He, and M. Salzmann. Shape-aware instance segmentation. In CVPR, 2017. 9
[17] Z. Hayder，X. He和M. Salzmann。形狀感知實例分段。在CVPR，2017年。9
[18] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. 2014. 1, 2
[18] K. He，X. Zhang，S. Ren和J. Sun.空間金字塔池在深度卷積網絡中進行視覺識別。在ECCV中。 2014. 1，2
[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 2, 4, 7, 10
[19] K. He，X. Zhang，S. Ren和J. Sun.圖像識別的深度殘留學習。在CVPR，2016年2月，4日，7日，10日
[20] J. Hosang, R. Benenson, P. Doll′ar, and B. Schiele. What makes for effective detection proposals? PAMI, 2015. 2
[20] J.Hosang，R.Bennenson，P.Doll’ar和B.Schiele。什么使得有效的檢測建議成為可能PAMI，2015。2
[21] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017. 2, 3, 4, 6, 7
[21] J. Huang，V. Rathod，C. Sun，M. Zhu，A. Korattikara，A. Fathi，I. Fischer，Z. Wojna，Y. Song，S. Guadarrama，et al?，F代卷積物體檢測器的速度/精度折衷。在CVPR，2017年2月，3日，4日，6日和7日
[22] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015. 4
[22] M. Jaderberg，K. Simonyan，A. Zisserman和K. Kavukcuoglu。空間變壓器網絡。在NIPS，2015年。4
[23] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother. Instancecut: from edges to instances with multicut. In CVPR, 2017. 3, 9
[23] A. Kirillov，E. Levinkov，B. Andres，B. Savchynskyy和C. Rother。 Instancecut：從邊到具有multicut的實例。在CVPR，2017.3,9
[24] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classi?cation with deep convolutional neural networks. In NIPS, 2012. 2
[24] A. Krizhevsky，I. Sutskever和G. Hinton。 ImageNet分類與深卷積神經網絡。在NIPS，2012年。2
[25] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 2
[25] Y. LeCun，B. Boser，J. S. Denker，D. Henderson，R. E. Howard，W. Hubbard和L. D. Jackel。反向傳播適用于手寫郵政編碼識別。神經計算，1989。2
[26] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017. 2, 3, 5, 6
[26] Y.Li，H.Qi，J.Dai，X.Ji，和Y.We。完全卷積實例感知語義分割。在CVPR，2017年。2，3，5，6
[27] T.-Y. Lin, P. Doll′ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 2, 4, 5, 7
[27] T.-Y. Lin，P. Doll’ar，R. Girshick，K. He，B. Hariharan和S. Belongie。特征金字塔網絡用于對象檢測。在CVPR，2017年2月，4日，5日，7日
[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll′ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 5
[28] T.-Y. Lin，M. Maire，S. Belongie，J. Hays，P. Perona，D. Ramanan，P. Doll’ar和C. L. Zitnick。 Microsoft COCO：上下文中的通用對象。在ECCV，2014.2,5
[29] S. Liu, J. Jia, S. Fidler, and R. Urtasun. SGN: Sequential grouping networks for instance segmentation. In ICCV, 2017. 3, 9
[29] S. Liu，J. Jia，S. Fidler和R. Urtasun。 SGN：用于實例分段的順序分組網絡。在ICCV，2017.3,9
[30] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1, 3, 6
[30] J. Long，E. Shelhamer和T. Darrell。用于語義分割的完全卷積網絡。在CVPR，2015。1，3，6
[31] V. Nair and G. E. Hinton. Recti?ed linear units improve restricted boltzmann machines. In ICML, 2010. 4
[31] V. Nair和G. E. Hinton。整型線性單元改進了受限玻爾茲曼機器。在ICML，2010年。4
[32] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multiperson pose estimation in the wild. In CVPR, 2017. 8
[32] G. Papandreou，T. Zhu，N. Kanazawa，A. Toshev，J. Tompson，C. Bregler和K. Murphy。在野外對準確的多人姿勢估計。在CVPR，2017年。8
[33] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015. 2, 3
[33] P. O. Pinheiro，R. Collobert和P. Dollar。學習細分對象候選者。在NIPS，2015。2，3
[34] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Doll′ar. Learning to re?ne object segments. In ECCV, 2016. 2, 3
[34] P. O. Pinheiro，T.-Y. Lin，R. Collobert和P. Doll’ar。學習重新定義對象段。在ECCV，2016年2月3日
[35] I. Radosavovic, P. Doll′ar, R. Girshick, G. Gkioxari, and K. He. Data distillation: Towards omni-supervised learning. arXiv:1712.04440, 2017. 10
[35] I. Radosavovic，P. Doll’ar，R. Girshick，G. Gkioxari和K. He。數據蒸餾：邁向全方位監督學習。 arXiv：1712.04440，2017。10
[36] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 2, 3, 4, 7
[36] S. Ren，K. He，R. Girshick和J. Sun.更快的R-CNN：通過區域提案網絡實現對象實時檢測。在NIPS中，2015年1，2，3，4，7
[37] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In TPAMI, 2017. 10
[37] S. Ren，K. He，R. Girshick和J. Sun.更快的R-CNN：通過區域提案網絡實現對象實時檢測。在TPAMI，2017年。10
[38] A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016. 2, 5
[38] A. Shrivastava，A. Gupta和R. Girshick。在線硬示例挖掘培訓基于區域的對象檢測器。在CVPR，2016。2，5
[39] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851, 2016. 4, 7
[39] A. Shrivastava，R. Sukthankar，J. Malik和A. Gupta。超越跳過連接：自頂向下調制物體檢測。 arXiv：1612.06851，2016.4,7
[40] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017. 10
[40] C. Sun，A. Shrivastava，S. Singh和A. Gupta。重溫深度學習時代數據的不合理有效性。在ICCV，2017。10
[41] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR Workshop, 2016. 7
[41] C. Szegedy，S. Ioffe和V. Vanhoucke。初始-v4，初始階段和剩余連接對學習的影響。在ICLR研討會上，2016
[42] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013. 2
[42] J. R. Uijlings，K. E. van de Sande，T. Gevers和A. W. Smeulders。選擇性搜索對象識別。 IJCV，2013。2
[43] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. arXiv:1711.07971, 2017. 10
[43] X. Wang，R. Girshick，A. Gupta和K. He。非局部神經網絡。 arXiv：1711.07971,2010。10
[44] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016. 8
[44] S.-E. Wei，V. Ramakrishna，T. Kanade和Y. Sheikh。卷積式姿態機。在CVPR，2016年。8
[45] S. Xie, R. Girshick, P. Doll′ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. 4, 10
[45] S. Xie，R. Girshick，P. Doll’ar，Z. Tu和K. He。深度神經網絡的聚合殘差變換。在CVPR，2017.4,10

總結

以上是生活随笔為你收集整理的Mask R-CNN学习笔记的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

Mask R-CNN学习笔记

Mask R-CNN

摘要

1.介紹

總結