當前位置：首頁 > 人工智能 > 目标检测 >内容正文

目标检测

目标检测经典论文——Fast R-CNN论文翻译（中英文对照版）：Fast R-CNN（Ross Girshick， Microsoft Research（微软研究院））

發(fā)布時間：2023/12/20 目标检测 47 豆豆

生活随笔收集整理的這篇文章主要介紹了目标检测经典论文——Fast R-CNN论文翻译（中英文对照版）：Fast R-CNN（Ross Girshick， Microsoft Research（微软研究院））小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

目標檢測經(jīng)典論文翻譯匯總：[翻譯匯總]

翻譯pdf文件下載：[下載地址]

此版為純中文版，中英文對照版請穩(wěn)步：[Fast?R-CNN純中文版]

Fast R-CNN

Ross Girshick

Microsoft Research（微軟研究院）

rbg@microsoft.com

Abstract

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https: //github.com/rbgirshick/fast-rcnn.

摘要

本文提出了一種快速的基于區(qū)域的卷積網(wǎng)絡方法（fast R-CNN）用于目標檢測。Fast R-CNN建立在以前使用的深卷積網(wǎng)絡有效地分類目標的成果上。相比于之前的研究工作，Fast R-CNN采用了多項創(chuàng)新提高了訓練和測試速度，同時也提高了檢測準確度。Fast R-CNN訓練非常深的VGG16網(wǎng)絡比R-CNN快9倍，測試時快213倍，并在PASCAL VOC上得到了更高的準確度。與SPPnet相比，Fast R-CNN訓練VGG16網(wǎng)絡比他快3倍，測試速度快10倍，并且更準確。Fast R-CNN的Python和C ++（使用Caffe）實現(xiàn)以MIT開源許可證發(fā)布在：https://github.com/rbgirshick/fast-rcnn。

1. Introduction

Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches (e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.

1. 引言

最近，深度卷積網(wǎng)絡[14, 16]已經(jīng)顯著提高了圖像分類[14]和目標檢測[9, 19]的準確性。與圖像分類相比，目標檢測是一個更具挑戰(zhàn)性的任務，需要更復雜的方法來解決。由于這種復雜性，當前的方法（例如，[9, 11, 19, 25]）采用多級pipelines的方式訓練模型，既慢且精度不高。

Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.

復雜性的產(chǎn)生是因為檢測需要目標的精確定位，這就導致兩個主要的難點。首先，必須處理大量候選目標位置（通常稱為“proposals”）。第二，這些候選框僅提供粗略定位，其必須被精細化以實現(xiàn)精確定位。這些問題的解決方案經(jīng)常會影響速度、準確性或簡潔性。

In this paper, we streamline the training process for state-of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

在本文中，我們簡化了最先進的基于卷積網(wǎng)絡的目標檢測器的訓練過程[9, 11]。我們提出一個單階段訓練算法，聯(lián)合學習候選框分類和修正他們的空間位置。

The resulting method can train a very deep detection network (VGG16 [20]) 9× faster than R-CNN [9] and 3× faster than SPPnet [11]. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of 66% (vs. 62% for R-CNN).

最終方法能夠訓練非常深的檢測網(wǎng)絡（例如VGG16），其網(wǎng)絡比R-CNN快9倍，比SPPnet快3倍。在運行時，檢測網(wǎng)絡在PASCAL VOC 2012數(shù)據(jù)集上實現(xiàn)最高準確度，其中mAP為66％（R-CNN為62％），每張圖像處理時間為0.3秒，不包括候選框的生成（注：所有的時間都是使用一個超頻875MHz的Nvidia K40 GPU測試的）。

1.1. RCNN and SPPnet

The Region-based Convolutional Network method (R-CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:

1. Training is a multi-stage pipeline. R-CNN first fine-tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.

2. Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.

3. Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).

1.1. R-CNN與SPPnet

基于區(qū)域的卷積網(wǎng)絡方法（R-CNN）[9]通過使用深度卷積網(wǎng)絡來分類目標候選框，獲得了很高的目標檢測精度。然而，R-CNN具有明顯的缺點：

1. 訓練過程是多級pipeline。R-CNN首先使用目標候選框?qū)矸e神經(jīng)網(wǎng)絡使用log損失進行fine-tune。然后，它將卷積神經(jīng)網(wǎng)絡得到的特征送入SVM。這些SVM作為目標檢測器，替代通過fine-tune學習的softmax分類器。在第三個訓練階段，學習bounding-box回歸器。

2. 訓練在時間和空間上是的開銷很大。對于SVM和bounding-box回歸訓練，從每個圖像中的每個目標候選框提取特征，并寫入磁盤。對于VOC07 trainval上的5k個圖像，使用如VGG16非常深的網(wǎng)絡時，這個過程在單個GPU上需要2.5天。這些特征需要數(shù)百GB的存儲空間。

3. 目標檢測速度很慢。在測試時，從每個測試圖像中的每個目標候選框提取特征。用VGG16網(wǎng)絡檢測目標時，每個圖像需要47秒（在GPU上）。

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by max-pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6×6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.

R-CNN很慢是因為它為每個目標候選框進行卷積神經(jīng)網(wǎng)絡前向傳遞，而沒有共享計算。SPPnet網(wǎng)絡[11]提出通過共享計算加速R-CNN。SPPnet計算整個輸入圖像的卷積特征圖，然后使用從共享特征圖提取的特征向量來對每個候選框進行分類。通過最大池化將候選框內(nèi)的特征圖轉(zhuǎn)化為固定大小的輸出（例如6×6）來提取針對候選框的特征。多輸出尺寸被池化，然后連接成空間金字塔池[15]。SPPnet在測試時將R-CNN加速10到100倍。由于更快的候選框特征提取，訓練時間也減少了3倍。

SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

SPP網(wǎng)絡也有顯著的缺點。像R-CNN一樣，訓練過程是一個多級pipeline，涉及提取特征、使用log損失對網(wǎng)絡進行fine-tuning、訓練SVM分類器以及最后擬合檢測框回歸。特征也要寫入磁盤。但與R-CNN不同，在[11]中提出的fine-tuning算法不能更新在空間金字塔池之前的卷積層。不出所料，這種局限性（固定的卷積層）限制了深層網(wǎng)絡的精度。

1.2. Contributions

We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast RCNN method has several advantages:

1. Higher detection quality (mAP) than R-CNN, SPPnet

2. Training is single-stage, using a multi-task loss

3. Training can update all network layers

4. No disk storage is required for feature caching

1.2. 貢獻

我們提出一種新的訓練算法，修正了R-CNN和SPPnet的缺點，同時提高了速度和準確性。因為它能比較快地進行訓練和測試，我們稱之為Fast R-CNN。Fast RCNN方法有以下幾個優(yōu)點：

1. 比R-CNN和SPPnet具有更高的目標檢測精度（mAP）。

2. 訓練是使用多任務損失的單階段訓練。

3. 訓練可以更新所有網(wǎng)絡層參數(shù)。

4. 不需要磁盤空間緩存特征。

Fast R-CNN is written in Python and C++ (Caffe [13]) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

Fast R-CNN使用Python和C++(Caffe[13])編寫，以MIT開源許可證發(fā)布在：https://github.com/rbgirshick/fast-rcnn。

2. Fast R-CNN architecture and training

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

Figure 1. Fast R-CNN architecture. An input image and multiple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.

2. Fast R-CNN架構與訓練

Fast R-CNN的架構如圖1所示。Fast R-CNN網(wǎng)絡將整個圖像和一組候選框作為輸入。網(wǎng)絡首先使用幾個卷積層（conv）和最大池化層來處理整個圖像，以產(chǎn)生卷積特征圖。然后，對于每個候選框，RoI池化層從特征圖中提取固定長度的特征向量。每個特征向量被送入一系列全連接（fc）層中，其最終分支成兩個同級輸出層：一個輸出K個類別加上1個包含所有背景類別的Softmax概率估計，另一個層輸出K個類別的每一個類別輸出四個實數(shù)值。每組4個值表示K個類別中一個類別的修正后檢測框位置。

圖1. Fast R-CNN架構。輸入圖像和多個感興趣區(qū)域（RoI）被輸入到全卷積網(wǎng)絡中。每個RoI被池化到固定大小的特征圖中，然后通過全連接層（FC）映射到特征向量。網(wǎng)絡對于每個RoI具有兩個輸出向量：Softmax概率和每類bounding-box回歸偏移量。該架構是使用多任務損失進行端到端訓練的。

2.1. The RoI pooling layer

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H×W (e.g., 7 7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).

2.1. RoI池化層

RoI池化層使用最大池化將任何有效的RoI內(nèi)的特征轉(zhuǎn)換成具有H×W（例如，7×7）的固定空間范圍的小特征圖，其中H和W是層的超參數(shù)，獨立于任何特定的RoI。在本文中，RoI是卷積特征圖中的一個矩形窗口。每個RoI由指定其左上角(r,c)及其高度和寬度(h,w)的四元組(r,c,h,w)定義。

RoI max pooling works by dividing the h×w RoI window into an H ×W grid of sub-windows of approximate size h/H×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].

RoI最大池化通過將大小為h×w的RoI窗口分割成H×W個網(wǎng)格，子窗口大小約為h/H×w/W，然后對每個子窗口執(zhí)行最大池化，并將輸出合并到相應的輸出網(wǎng)格單元中。同標準的最大池化一樣，池化操作獨立應用于每個特征圖通道。RoI層只是SPPnets[11]中使用的空間金字塔池層的特例，其只有一個金字塔層。我們使用[11]中給出的池化子窗口計算方法。

2.2. Initializing from pre-trained networks

We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.

2.2 從預訓練網(wǎng)絡初始化

我們實驗了三個預訓練的ImageNet [4]網(wǎng)絡，每個網(wǎng)絡有五個最大池化層和5至13個卷積層（網(wǎng)絡詳細信息見4.1節(jié)）。當預訓練網(wǎng)絡初始化Fast R-CNN網(wǎng)絡時，其經(jīng)歷三個變換。

First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).

首先，最后的最大池化層由RoI池層代替，其將H和W設置為與網(wǎng)絡的第一個全連接層兼容的配置（例如，對于VGG16，H=W=7）。

Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K+1 categories and category-specific bounding-box regressors).

其次，網(wǎng)絡的最后一個全連接層和Softmax（其被訓練用于1000類ImageNet分類）被替換為前面描述的兩個同級層（全連接層和K+1個類別的Softmax以及特定類別的bounding-box回歸）。

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

最后，網(wǎng)絡被修改為采用兩個數(shù)據(jù)輸入：圖像的列表和這些圖像中的RoI的列表。

2.3. Finetuning for detection

Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.

2.3 檢測任務fine-tune

用反向傳播訓練所有網(wǎng)絡權重是Fast R-CNN的重要能力。首先，讓我們闡明為什么SPPnet無法更新低于空間金字塔池化層的權重。

The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).

根本原因是當每個訓練樣本（即RoI）來自不同的圖像時，通過SPP層的反向傳播是非常低效的，這正是訓練R-CNN和SPPnet網(wǎng)絡的方法。低效是因為每個RoI可能具有非常大的感受野，通常跨越整個輸入圖像。由于正向傳播必須處理整個感受野，訓練輸入很大（通常是整個圖像）。

We propose a more efficient training method that takes advantage of feature sharing during training. In Fast R-CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).

我們提出了一種更有效的訓練方法，利用訓練期間的特征共享。在Fast R-CNN網(wǎng)絡訓練中，隨機梯度下降（SGD）的小批量是被分層采樣的，首先采樣N個圖像，然后從每個圖像采樣R/N個RoI。關鍵的是，來自同一圖像的RoI在前向和后向傳播中共享計算和內(nèi)存。減小N，就減少了小批量的計算。例如，當N=2和R=128時，得到的訓練方案比從128幅不同的圖采樣一個RoI（即R-CNN和SPPnet的策略）快64倍。

One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.

這個策略的一個令人擔心的問題是它可能導致訓練收斂變慢，因為來自相同圖像的RoI是相關的。這個問題似乎在實際情況下并不存在，當N=2和R=128時，我們使用比R-CNN更少的SGD迭代就獲得了良好的結果。

In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The components of this procedure (the loss, mini-batch sampling strategy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.

除了分層采樣，Fast R-CNN使用了一個精細的訓練過程，在fine-tune階段聯(lián)合優(yōu)化Softmax分類器和bounding-box回歸，而不是分別在三個獨立的階段訓練softmax分類器、SVM和回歸器[9, 11]。下面將詳細描述該過程（損失、小批量采樣策略、通過RoI池化層的反向傳播和SGD超參數(shù)）。

Multi-task loss. A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distribution (per RoI), p=(p0,…, pK), over K+1 categories. As usual, p is computed by a softmax over the K+1 outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, tk=(tkx, tky, tkw, tkh), for each of the K object classes, indexed by k. We use the parameterization for tk given in [9], in which tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.

多任務損失。Fast R-CNN網(wǎng)絡具有兩個同級輸出層。第一個輸出在K+1個類別上的離散概率分布（每個RoI），p=(p0,…,pK)。通常，基于全連接層的K+1個輸出通過Softmax來計算p。第二個輸出層輸出bounding-box回歸偏移，即tk=(tkx, tky, tkw, tkh)，k表示K個類別的索引。我們使用[9]中給出方法對tk進行參數(shù)化，其中tk指定相對于候選框的尺度不變轉(zhuǎn)換和對數(shù)空間高度/寬度移位。

Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:

in which Lcls(p, u) = - log pu is log loss for true class u.

每個訓練的RoI用類真值u和bounding-box回歸目標真值v打上標簽。我們對每個標記的RoI使用多任務損失L以聯(lián)合訓練分類和bounding-box回歸：

其中Lcls(p, u) = - log pu，是類真值u的log損失。

The second task loss, Lloc, is defined over a tuple of true bounding-box regression targets for class u, v = (vx, vy, vw, vh), and a predicted tuple tu = (tux ; tuy ; tuw; tuh ), again for class u. The Iverson bracket indicator function [u≥1] evaluates to 1 when u≥1 and 0 otherwise. By convention the catch-all background class is labeled u = 0. For background RoIs there is no notion of a ground-truth bounding box and hence Lloc is ignored. For bounding-box regression, we use the loss

in which

is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity.

對于類真值u，第二個損失Lloc是定義在bounding-box回歸目標真值元組u, v = (vx, vy, vw, vh)和預測元組tu=(tux,tuy,tuw,tuh)上的損失。Iverson括號指示函數(shù)[u≥1]，當u≥1的時候值為1，否則為0。按照慣例，任何背景類標記為u=0。對于背景RoI，沒有檢測框真值的概念，因此Lloc被忽略。對于檢測框回歸，我們使用損失：

其中：

是魯棒的L1損失，對于異常值比在R-CNN和SPPnet中使用的L2損失更不敏感。當回歸目標無界時，具有L2損失的訓練可能需要仔細調(diào)整學習速率，以防止爆炸梯度。公式(3)消除了這種敏感性。

The hyper-parameter λ in Eq. 1 controls the balance between the two task losses. We normalize the ground-truth regression targets vi to have zero mean and unit variance. All experiments use? = 1.

公式(1)中的超參數(shù)λ控制兩個任務損失之間的平衡。我們將回歸目標真值vi歸一化為具有零均值和方差為1的分布。所有實驗都使用λ=1。

We note that [6] uses a related loss to train a class-agnostic object proposal network. Different from our approach, [6] advocates for a two-network system that separates localization and classification. OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).

我們注意到[6]使用相關損失來訓練一個類別無關的目標候選網(wǎng)絡。與我們的方法不同的是，[6]倡導一個將定位和分類分離的雙網(wǎng)絡系統(tǒng)。OverFeat[19]、R-CNN[9]和SPPnet[11]也訓練分類器和檢測框定位器，但是這些方法使用逐級訓練，這對于Fast R-CNN來說不是最好的選擇。

Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u≥1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5], following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.

小批量采樣。在fine-tune期間，每個SGD的小批量由N=2個圖像構成，均勻地隨機選擇（如通常的做法，我們實際上迭代數(shù)據(jù)集的排列）。我們使用大小為R=128的小批量，從每個圖像采樣64個RoI。如在[9]中，我們從候選框中獲取25％的RoI，這些候選框與檢測框真值的交并比IoU至少為0.5。這些RoI只包括用前景對象類標記的樣本，即u≥1。根據(jù)[11]，剩余的RoI從候選框中采樣，該候選框與檢測框真值的最大IoU在區(qū)間[0.1, 0.5]。這些是背景樣本，并用u=0標記。0.1的閾值下限似乎充當困難樣本重挖掘的啟發(fā)式算法[8]。在訓練期間，圖像以概率0.5水平翻轉(zhuǎn)。不使用其他數(shù)據(jù)增強。

Back-propagation through RoI pooling layers. Backpropagation routes derivatives through the RoI pooling layer. For clarity, we assume only one image per mini-batch (N = 1), though the extension to N > 1 is straightforward because the forward pass treats all images independently.

通過RoI池化層的反向傳播。反向傳播通過RoI池化層。為了清楚起見，我們假設每個小批量(N=1)只有一個圖像，擴展到N>1是顯而易見的，因為前向傳播獨立地處理所有圖像。

Let xi∈R be the i-th activation input into the RoI pooling layer and let yrj be the layer’s j-th output from the r-th RoI. The RoI pooling layer computes yrj = xi*(r, j), in which i*(r, j) = argmax xi’ . R(r; j) is the index set of inputs in the sub-window over which the output unit yrj max pools. A single xi may be assigned to several different outputs yrj .

令RoI池化層的第i個激活輸入xi∈R，令yrj是第r個RoI層的第j個輸出。RoI池化層計算yrj = xi*(r, j)，其中i*(r, j) = argmax xi’ . R(r; j)是輸出單元yrj最大池化的子窗口中的輸入的索引集合。一個xi可以被分配給幾個不同的輸出yrj。

The RoI pooling layer’s backwards function computes partial derivative of the loss function with respect to each input variable xi by following the argmax switches:

RoI池化層反向傳播函數(shù)通過遵循argmax switches來計算關于每個輸入變量xi的損失函數(shù)的偏導數(shù)：

In words, for each mini-batch RoI r and for each pooling output unit yrj, the partial derivative ?L/?yrj is accumulated if i is the argmax selected for yrj by max pooling. In back-propagation, the partial derivatives ?L/?yrj are already computed by the backwards function of the layer on top of the RoI pooling layer.

換句話說，對于每個小批量RoI r和對于每個池化輸出單元yrj，如果i是yrj通過最大池化選擇的argmax，則將這個偏導數(shù)?L/?yrj積累下來。在反向傳播中，偏導數(shù)?L/?yrj已經(jīng)由RoI池化層頂部的層的反向傳播函數(shù)計算。

SGD hyper-parameters. The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0. All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. When training on VOC07 or VOC12 trainval we run SGD for 30k mini-batch iterations, and then lower the learning rate to 0.0001 and train for another 10k iterations. When we train on larger datasets, we run SGD for more iterations, as described later. A momentum of 0:9 and parameter decay of 0.0005 (on weights and biases) are used.

SGD超參數(shù)。用于Softmax分類和檢測框回歸的全連接層的權重分別使用具有方差0.01和0.001的零均值高斯分布初始化。偏置初始化為0。所有層的權重學習率為1倍的全局學習率，偏置為2倍的全局學習率，全局學習率為0.001。當對VOC07或VOC12 trainval訓練時，我們進行30k次小批量SGD迭代，然后將學習率降低到0.0001，再訓練10k次迭代。當我們訓練更大的數(shù)據(jù)集，我們運行SGD更多的迭代，如下文所述。使用0.9的動量和0.0005的參數(shù)衰減（權重和偏置）。

2.4. Scale invariance

We explore two ways of achieving scale invariant object detection: (1) via “brute force” learning and (2) by using image pyramids. These strategies follow the two approaches in [11]. In the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.

2.4. 尺度不變性

我們探索兩種實現(xiàn)尺度不變目標檢測的方法：（1）通過“brute force”學習和（2）通過使用圖像金字塔。這些策略遵循[11]中的兩種方法。在“brute force”方法中，在訓練和測試期間以預定義的像素大小處理每個圖像。網(wǎng)絡必須直接從訓練數(shù)據(jù)學習尺度不變性目標檢測。

The multi-scale approach, in contrast, provides approximate scale-invariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, following [11], as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.

相反，多尺度方法通過圖像金字塔向網(wǎng)絡提供近似尺度不變性。在測試時，圖像金字塔用于大致縮放-規(guī)范化每個候選框。按照[11]中的方法，作為數(shù)據(jù)增強的一種形式，在多尺度訓練期間，我們在每次圖像采樣時隨機采樣金字塔尺度。由于GPU內(nèi)存限制，我們只對較小的網(wǎng)絡進行多尺度訓練。

3. Fast R-CNN detection

Once a Fast R-CNN network is fine-tuned, detection amounts to little more than running a forward pass (assuming object proposals are pre-computed). The network takes as input an image (or an image pyramid, encoded as a list of images) and a list of R object proposals to score. At test-time, R is typically around 2000, although we will consider cases in which it is larger (≈45k). When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 2242 pixels in area [11].

3. Fast R-CNN檢測

一旦Fast R-CNN網(wǎng)絡被fine-tune完畢，檢測相當于運行前向傳播（假設候選框是預先計算的）。網(wǎng)絡將輸入圖像（或圖像金字塔，編碼為圖像列表）和待計算得分的R個候選框的列表作為輸入。在測試的時候，R通常在2000左右，盡管我們需要考慮更大（約45k）的情況。當使用圖像金字塔時，每個RoI被縮放，使其最接近[11]中的2242個像素。

For each test RoI r, the forward pass outputs a class posterior probability distribution p and a set of predicted bounding-box offsets relative to r (each of the K classes gets its own refined bounding-box prediction). We assign a detection confidence to r for each object class k using the estimated probability Pr(class = k｜r) ? pk. We then perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN [9].

對于每個測試的RoI r，正向傳播輸出類別后驗概率分布p和相對于r的預測檢測框偏移集合（K個類別中的每個類別獲得其自己的修正的檢測框預測結果）。我們使用估計的概率Pr(class=k|r)?pk為每個對象類別k分配r的檢測置信度。然后，我們使用R-CNN [9]算法的設置和算法對每個類別獨立執(zhí)行非極大值抑制。

3.1. Truncated SVD for faster detection

For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (see Fig. 2). Large fully connected layers are easily accelerated by compressing them with truncated SVD [5, 23].

Figure 2. Timing for VGG16 before and after truncated SVD. Before SVD, fully connected layers fc6 and fc7 take 45% of the time.

3.1. 使用截斷的SVD實現(xiàn)更快的檢測

對于整個圖像進行分類任務時，與卷積層相比，計算全連接層花費的時間較小。相反，在圖像檢測任務中，要處理大量的RoI，并且接近一半的前向傳播時間用于計算全連接層（參見圖2）。較大的全連接層可以輕松地通過用截短的SVD[5, 23]壓縮來提升速度。

圖2. 截斷SVD之前和之后VGG16的時間分布。在SVD之前，全連接層fc6和fc7消耗45％的時間。

In this technique, a layer parameterized by the u × v weight matrix W is approximately factorized as

using SVD. In this factorization, U is a u × t matrix comprising the first t left-singular vectors of W, Et is a t diagonal matrix containing the top t singular values of W, and V is v × t matrix comprising the first t right-singular vectors of W. Truncated SVD reduces the parameter count from uv to t(u + v), which can be significant if t is much smaller than min(u, v). To compress a network, the single fully connected layer corresponding to W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix EtVT (and no biases) and the second uses U (with the original biases associated with W). This simple compression method gives good speedups when the number of RoIs is large.

在這種技術中，層的u × v權重矩陣W通過SVD被近似分解為：

在這種分解中，U是一個u×t的矩陣，包括W的前t個左奇異向量，Et是t×t對角矩陣，其包含W的前t個奇異值，并且V是v×t矩陣，包括W的前t個右奇異向量。截斷SVD將參數(shù)計數(shù)從uv減少到t(u+v)個，如果t遠小于min(u,v)，則是非常有意義的。為了壓縮網(wǎng)絡，對應于W的單個全連接層由兩個全連接層替代，在它們之間沒有非線性。這些層中的第一層使用權重矩陣EtVT（沒有偏置），并且第二層使用U（其中原始偏差與W相關聯(lián)）。當RoI的數(shù)量較大時，這種簡單的壓縮方法能實現(xiàn)很好的加速。

4. Main results

Three main results support this paper’s contributions:

1. State-of-the-art mAP on VOC07, 2010, and 2012

2. Fast training and testing compared to R-CNN, SPPnet

3. Fine-tuning conv layers in VGG16 improves mAP

4. 主要結果

三個主要結果支持本文的貢獻：

1. VOC07，2010和2012的最高的mAP。

2. 相比R-CNN、SPPnet，訓練和測試的速度更快。

3. 對VGG16卷積層Fine-tuning后提升了mAP。

4.1. Experimental setup

Our experiments use three pre-trained ImageNet models that are available online. The first is the CaffeNet (essentially AlexNet [14]) from R-CNN [9]. We alternatively refer to this CaffeNet as model S, for “small.” The second network is VGG_CNN_M_1024 from [3], which has the same depth as S, but is wider. We call this network model M, for “medium.” The final network is the very deep VGG16 model from [20]. Since this model is the largest, we call it model L. In this section, all experiments use single-scale training and testing (s = 600; see Section 5.2 for details).

4.1. 實驗設置

我們的實驗使用了三個經(jīng)過預訓練的ImageNet網(wǎng)絡模型，這些模型可以在線獲得(腳注：https://github.com/BVLC/caffe/wiki/Model-Zoo)。第一個是來自R-CNN [9]的CaffeNet（實質(zhì)上是AlexNet[14]）。我們將這個CaffeNet稱為模型S，即小模型。第二網(wǎng)絡是來自[3]的VGG_CNN_M_1024，其具有與S相同的深度，但是更寬。我們把這個網(wǎng)絡模型稱為M，即中等模型。最后一個網(wǎng)絡是來自[20]的非常深的VGG16模型。由于這個模型是最大的，我們稱之為L。在本節(jié)中，所有實驗都使用單尺度訓練和測試（s=600，詳見5.2節(jié)）。

4.2. VOC 2010 and 2012 results

On these datasets, we compare Fast R-CNN (FRCN, for short) against the top methods on the comp4 (outside data) track from the public leaderboard (Table 2, Table 3). For the NUS_NIN_c2000 and BabyLearning methods, there are no associated publications at this time and we could not find exact information on the ConvNet architectures used; they are variants of the Network-in-Network design [17]. All other methods are initialized from the same pre-trained VGG16 network.

Table 2. VOC 2010 test detection average precision (%). BabyLearning uses a network based on [17]. All other methods use VGG16. Training set key: 12: VOC12 trainval, Prop.: proprietary dataset, 12+seg: 12 with segmentation annotations, 07++12: union of VOC07 trainval, VOC07 test, and VOC12 trainval.

Table 3. VOC 2012 test detection average precision (%). BabyLearning and NUS NIN c2000 use networks based on [17]. All other methods use VGG16. Training set key: see Table 2, Unk.: unknown.

4.2. VOC 2010和2012數(shù)據(jù)集上的結果

（如上面表2，表3所示）在這些數(shù)據(jù)集上，我們比較Fast R-CNN（簡稱FRCN）和公共排行榜中comp4（外部數(shù)據(jù)）上的主流方法（腳注：http://host.robots.ox.ac.uk:8080/leaderboard）。對于NUS_NIN_c2000和BabyLearning方法，目前沒有相關的出版物，我們無法找到有關所使用的ConvNet體系結構的確切信息；它們是Network-in-Network的變體[17]。所有其他方法都通過相同的預訓練VGG16網(wǎng)絡進行了初始化。

表2. VOC 2010測試檢測平均精度（％）。BabyLearning使用基于[17]的網(wǎng)絡。所有其他方法使用VGG16。訓練集關鍵字：12代表VOC12 trainval，Prop.代表專有數(shù)據(jù)集，12+seg代表具有分割注釋的VOC2012，07++12代表VOC2007 trainval、VOC2007 test和VOC2012 trainval的組合。

表3. VOC 2012測試檢測平均精度（％）。BabyLearning和NUS_NIN_c2000使用基于[17]的網(wǎng)絡。所有其他方法使用VGG16。訓練設置：見表2，Unk.代表未知。

Fast R-CNN achieves the top result on VOC12 with a mAP of 65.7% (and 68.4% with extra data). It is also two orders of magnitude faster than the other methods, which are all based on the “slow” R-CNN pipeline. On VOC10, SegDeepM [25] achieves a higher mAP than Fast R-CNN (67.2% vs. 66.1%). SegDeepM is trained on VOC12 trainval plus segmentation annotations; it is designed to boost R-CNN accuracy by using a Markov random field to reason over R-CNN detections and segmentations from the O2P [1] semantic-segmentation method. Fast R-CNN can be swapped into SegDeepM in place of R-CNN, which may lead to better results. When using the enlarged 07++12 training set (see Table 2 caption), Fast R-CNN’s mAP increases to 68.8%, surpassing SegDeepM.

Fast R-CNN在VOC12上獲得最高結果，mAP為65.7％（加上額外數(shù)據(jù)為68.4％）。它也比其他方法快兩個數(shù)量級，這些方法都基于比較“慢”的R-CNN網(wǎng)絡。在VOC10上，SegDeepM [25]獲得了比Fast R-CNN更高的mAP（67.2％對比66.1％）。SegDeepM使用VOC12 trainval訓練集及分割標注進行了訓練，它被設計為通過使用馬爾可夫隨機場推理R-CNN檢測和來自O2P [1]的語義分割方法的分割來提高R-CNN精度。Fast R-CNN可以替換SegDeepM中使用的R-CNN，這可以獲得更好的結果。當使用擴大的07++12訓練集（見表2標題）時，Fast R-CNN的mAP增加到68.8％，超過SegDeepM。

4.3. VOC 2007 results

On VOC07, we compare Fast R-CNN to R-CNN and SPPnet. All methods start from the same pre-trained VGG16 network and use bounding-box regression. The VGG16 SPPnet results were computed by the authors of [11]. SPPnet uses five scales during both training and testing. The improvement of Fast R-CNN over SPPnet illustrates that even though Fast R-CNN uses single-scale training and testing, fine-tuning the conv layers provides a large improvement in mAP (from 63.1% to 66.9%). R-CNN achieves a mAP of 66.0%. As a minor point, SPPnet was trained without examples marked as “difficult” in PASCAL. Removing these examples improves Fast R-CNN mAP to 68.1%. All other experiments use “difficult” examples.

4.3. VOC 2007數(shù)據(jù)集上的結果

在VOC07數(shù)據(jù)集上，我們比較Fast R-CNN與R-CNN和SPPnet的mAP。所有方法從相同的預訓練VGG16網(wǎng)絡開始，并使用bounding-box回歸。VGG16 SPPnet結果由論文[11]的作者提供。SPPnet在訓練和測試期間使用五個尺度。Fast R-CNN對SPPnet的改進說明，即使Fast R-CNN使用單個尺度訓練和測試，卷積層fine-tuning在mAP中貢獻了很大的改進（從63.1％到66.9％）。R-CNN的mAP為66.0％。其次，SPPnet是在PASCAL中沒有被標記為“困難”的樣本上進行了訓練。除去這些樣本，Fast R-CNN的mAP達到68.1％。所有其他實驗都使用被標記為“困難”的樣本。

4.4. Training and testing time

Fast training and testing times are our second main result. Table 4 compares training time (hours), testing rate (seconds per image), and mAP on VOC07 between Fast RCNN, R-CNN, and SPPnet. For VGG16, Fast R-CNN processes images 146× faster than R-CNN without truncated SVD and 213× faster with it. Training time is reduced by 9×, from 84 hours to 9.5. Compared to SPPnet, Fast RCNN trains VGG16 2.7× faster (in 9.5 vs. 25.5 hours) and tests 7× faster without truncated SVD or 10× faster with it. Fast R-CNN also eliminates hundreds of gigabytes of disk storage, because it does not cache features.

Table 4. Runtime comparison between the same models in Fast RCNN,

R-CNN, and SPPnet. Fast R-CNN uses single-scale mode. SPPnet uses the five scales specified in [11]. ?Timing provided by the authors of [11]. Times were measured on an Nvidia K40 GPU.

4.4. 訓練和測試時間

快速的訓練和測試是我們的第二個主要成果。表4比較了Fast RCNN，R-CNN和SPPnet之間的訓練時間（單位小時），測試速率（每秒圖像數(shù)）和VOC07上的mAP。對于VGG16，沒有截斷SVD的Fast R-CNN處理圖像比R-CNN快146倍，有截斷SVD的R-CNN快213倍。訓練時間減少9倍，從84小時減少到9.5小時。與SPPnet相比，沒有截斷SVD的Fast RCNN訓練VGG16網(wǎng)絡比SPPnet快2.7倍（9.5小時相比于25.5小時），測試時間快7倍，有截斷SVD的Fast RCNN比的SPPnet快10倍。Fast R-CNN還不需要數(shù)百GB的磁盤存儲，因為它不緩存特征。

表4. Fast RCNN，R-CNN和SPPnet中相同模型之間的運行時間比較。Fast R-CNN使用單尺度模式。SPPnet使用[11]中指定的五個尺度。?的時間由[11]的作者提供。在Nvidia K40 GPU上的進行了時間測量。

Truncated SVD. Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression. Fig. 2 illustrates how using the top 1024 singular values from the 25088×4096 matrix in VGG16’s fc6 layer and the top 256 singular values from the 4096×4096 fc7 layer reduces runtime with little loss in mAP. Further speed-ups are possible with smaller drops in mAP if one fine-tunes again after compression.

截斷的SVD。截斷的SVD可以將檢測時間減少30％以上，同時能保持mAP只有很小（0.3個百分點）的下降，并且無需在模型壓縮后執(zhí)行額外的fine-tune。圖2顯示了如何使用來自VGG16的fc6層中的25088×4096矩陣的頂部1024個奇異值和來自fc7層的4096×4096矩陣的頂部256個奇異值減少運行時間，而mAP幾乎沒有損失。如果在壓縮之后再次fine-tune，則可以在mAP更小下降的情況下進一步提升速度。

4.5. Which layers to fine-tune?

For the less deep networks considered in the SPPnet paper [11], fine-tuning only the fully connected layers appeared to be sufficient for good accuracy. We hypothesized that this result would not hold for very deep networks. To validate that fine-tuning the conv layers is important for VGG16, we use Fast R-CNN to fine-tune, but freeze the thirteen conv layers so that only the fully connected layers learn. This ablation emulates single-scale SPPnet training and decreases mAP from 66.9% to 61.4% (Table 5). This experiment verifies our hypothesis: training through the RoI pooling layer is important for very deep nets.

Table 5. Effect of restricting which layers are fine-tuned for VGG16. Fine-tuning ≥ fc6 emulates the SPPnet training algorithm [11], but using a single scale. SPPnet L results were obtained using five scales, at a significant (7×) speed cost.

4.5. fine-tune哪些層？

對于SPPnet論文[11]中提到的不太深的網(wǎng)絡，僅fine-tuning全連接層似乎足以獲得良好的準確度。我們假設這個結果不適用于非常深的網(wǎng)絡。為了驗證fine-tune卷積層對于VGG16的重要性，我們使用Fast R-CNN進行fine-tune，但凍結十三個卷積層，以便只有全連接層學習。這種消融模擬了單尺度SPPnet訓練，將mAP從66.9％降低到61.4％（如表5所示）。這個實驗驗證了我們的假設：通過RoI池化層的訓練對于非常深的網(wǎng)是重要的。

表5. 對VGG16 fine-tune的層進行限制產(chǎn)生的影響。fine-tune fc6以上的層模擬了單尺度SPPnet訓練算法[11]。SPPnet L是使用五個尺度，以顯著（7倍）的速度成本獲得的結果。

Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3_1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2_1 slows training by 1.3× (12.5 vs. 9.5 hours) compared to learning from conv3_1; and (2) updating from conv1_1 over-runs GPU memory. The difference in mAP when learning from conv2_1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3_1 and up; all experiments with models S and M fine-tune layers conv2 and up.

這是否意味著所有卷積層應該進行fine-tune？簡而言之，不是的。在較小的網(wǎng)絡（S和M）中，我們發(fā)現(xiàn)conv1（譯者注：第一個卷積層）是通用的、不依賴于特定任務的（一個眾所周知的事實[14]）。允許conv1學習或不學習，對mAP沒有很關鍵的影響。對于VGG16，我們發(fā)現(xiàn)只需要更新conv3_1及以上（13個卷積層中的9個）的層。這個觀察結果是實用的：（1）與從conv3_1更新相比，從conv2_1更新使訓練變慢1.3倍（12.5小時對比9.5小時），（2）從conv1_1更新時GPU內(nèi)存不夠用。從conv2_1學習時mAP僅增加0.3個點（如表5最后一列所示）。本文中所有Fast R-CNN的結果都fine-tune VGG16 conv3_1及以上的層，所有用模型S和M的實驗fine-tune conv2及以上的層。

5. Design evaluation

We conducted experiments to understand how Fast R-CNN compares to R-CNN and SPPnet, as well as to evaluate design decisions. Following best practices, we performed these experiments on the PASCAL VOC07 dataset.

5. 設計評估

我們通過實驗來了解Fast RCNN與R-CNN和SPPnet的比較，以及評估設計決策。按照最佳實踐，我們在PASCAL VOC07數(shù)據(jù)集上進行了這些實驗。

5.1. Does multi-task training help?

Multi-task training is convenient because it avoids managing a pipeline of sequentially-trained tasks. But it also has the potential to improve results because the tasks influence each other through a shared representation (the ConvNet) [2]. Does multi-task training improve object detection accuracy in Fast R-CNN?

5.1. 多任務訓練有用嗎？

多任務訓練是方便的，因為它避免管理順序訓練任務的pipeline。但它也有可能改善結果，因為任務通過共享的表示（ConvNet）[2]相互影響。多任務訓練能提高Fast R-CNN中的目標檢測精度嗎？

To test this question, we train baseline networks that use only the classification loss, Lcls, in Eq. 1 (i.e., setting λ= 0). These baselines are printed for models S, M, and L in the first column of each group in Table 6. Note that these models do not have bounding-box regressors. Next (second column per group), we take networks that were trained with the multi-task loss (Eq. 1, λ=1), but we disable bounding-box regression at test time. This isolates the networks’ classification accuracy and allows an apples-to-apples comparison with the baseline networks.

Table 6. Multi-task training (forth column per group) improves mAP over piecewise training (third column per group).

為了測試這個問題，我們訓練僅使用公式(1)中的分類損失Lcls（即設置λ=0）的基準網(wǎng)絡。這些baselines是表6中每組的第一列。請注意，這些模型沒有bounding-box回歸器。接下來（每組的第二列），是我們采用多任務損失（公式(1)，λ=1）訓練的網(wǎng)絡，但是我們在測試時禁用bounding-box回歸。這隔離了網(wǎng)絡的分類準確性，并允許與基準網(wǎng)絡的相似類別之類的比較（譯者注：apples-to-apples comparision意思是比較兩個相同類別的事或物）。

表6. 多任務訓練（每組第四列）改進了分段訓練（每組第三列）的mAP。

Across all three networks we observe that multi-task training improves pure classification accuracy relative to training for classification alone. The improvement ranges from +0.8 to +1.1 mAP points, showing a consistent positive effect from multi-task learning.

在所有三個網(wǎng)絡中，我們觀察到多任務訓練相對于單獨的分類訓練提高了純分類準確度。改進范圍從+0.8到+1.1個mAP點，顯示了多任務學習的一致的積極效果。

Finally, we take the baseline models (trained with only the classification loss), tack on the bounding-box regression layer, and train them with Lloc while keeping all other network parameters frozen. The third column in each group shows the results of this stage-wise training scheme: mAP improves over column one, but stage-wise training underperforms multi-task training (forth column per group).

最后，我們采用baseline模型（僅使用分類損失進行訓練），加上bounding-box回歸層，并使用Lloc訓練它們，同時保持所有其他網(wǎng)絡參數(shù)凍結。每組中的第三列顯示了這種逐級訓練方案的結果：mAP相對于第一列有改進，但逐級訓練表現(xiàn)不如多任務訓練（每組第四列）。

5.2. Scale invariance: to brute force or finesse?

We compare two strategies for achieving scale-invariant object detection: brute-force learning (single scale) and image pyramids (multi-scale). In either case, we define the scale s of an image to be the length of its shortest side.

5.2. 尺度不變性：暴力或精細？

我們比較兩個策略實現(xiàn)尺度不變物體檢測：暴力學習（單尺度）和圖像金字塔（多尺度）。在任一情況下，我們將尺度s定義為圖像短邊的長度。

All single-scale experiments use s = 600 pixels; s may be less than 600 for some images as we cap the longest image side at 1000 pixels and maintain the image’s aspect ratio. These values were selected so that VGG16 fits in GPU memory during fine-tuning. The smaller models are not memory bound and can benefit from larger values of s; however, optimizing s for each model is not our main concern. We note that PASCAL images are 384 × 473 pixels on average and thus the single-scale setting typically upsamples images by a factor of 1.6. The average effective stride at the RoI pooling layer is thus ≈ 10 pixels.

所有單尺度實驗使用s=600像素，對于一些圖像，s可以小于600，因為我們保持橫縱比縮放圖像，并限制其最長邊為1000像素。選擇這些值使得VGG16在fine-tune期間不至于GPU內(nèi)存不足。較小的模型占用顯存更少，所以可受益于較大的s值。然而，每個模型的優(yōu)化不是我們的主要的關注點。我們注意到PASCAL圖像平均大小是384×473像素的，因此單尺度設置通常以1.6的倍數(shù)對圖像進行上采樣。因此，RoI池化層的平均有效步長約為10像素。

In the multi-scale setting, we use the same five scales specified in [11] (s ∈ {480, 576, 688, 864, 1200}) to facilitate comparison with SPPnet. However, we cap the longest side at 2000 pixels to avoid exceeding GPU memory.

在多尺度模型配置中，我們使用[11]中指定的相同的五個尺度（s∈{480,576,688,864,1200}），以方便與SPPnet進行比較。但是，我們限制長邊最大為2000像素，以避免GPU內(nèi)存不足。

Table 7 shows models S and M when trained and tested with either one or five scales. Perhaps the most surprising result in [11] was that single-scale detection performs almost as well as multi-scale detection. Our findings confirm their result: deep ConvNets are adept at directly learning scale invariance. The multi-scale approach offers only a small increase in mAP at a large cost in compute time (Table 7). In the case of VGG16 (model L), we are limited to using a single scale by implementation details. Yet it achieves a mAP of 66.9%, which is slightly higher than the 66.0% reported for R-CNN [10], even though R-CNN uses “infinite” scales in the sense that each proposal is warped to a canonical size.

Table 7. Multi-scale vs. single scale. SPPnet ZF (similar to model S) results are from [11]. Larger networks with a single-scale offer the best speed / accuracy tradeoff. (L cannot use multi-scale in our implementation due to GPU memory constraints.)

表7顯示了當使用一個或五個尺度進行訓練和測試時的模型S和M的結果。也許在[11]中最令人驚訝的結果是單尺度檢測幾乎與多尺度檢測一樣好。我們的研究結果能證明他們的結果：深度卷積網(wǎng)絡擅長直接學習到尺度的不變性。多尺度方法消耗大量的計算時間僅帶來了很小的mAP提升（表7）。在VGG16（模型L）的情況下，我們實現(xiàn)細節(jié)限制而只能使用單個尺度。然而，它得到了66.9％的mAP，略高于R-CNN[10]的66.0％，盡管R-CNN在某種意義上使用了“無限”尺度，但每個候選區(qū)域還是被縮放為規(guī)范大小。

表7. 多尺度對比單尺度。SPPnet ZF（類似于模型S）的結果來自[11]。具有單尺度的較大網(wǎng)絡具有最佳的速度/精度平衡。（L在我們的實現(xiàn)中不能使用多尺度，因為GPU內(nèi)存限制。）

Since single-scale processing offers the best tradeoff between speed and accuracy, especially for very deep models, all experiments outside of this sub-section use single-scale training and testing with s = 600 pixels.

由于單尺度處理能夠權衡好速度和精度之間的關系，特別是對于非常深的模型，本小節(jié)以外的所有實驗使用單尺度s=600像素的尺度進行訓練和測試。

5.3. Do we need more training data?

A good object detector should improve when supplied with more training data. Zhu et al. [24] found that DPM [8] mAP saturates after only a few hundred to thousand training examples. Here we augment the VOC07 trainval set with the VOC12 trainval set, roughly tripling the number of images to 16.5k, to evaluate Fast R-CNN. Enlarging the training set improves mAP on VOC07 test from 66.9% to 70.0% (Table 1). When training on this dataset we use 60k mini-batch iterations instead of 40k.

Table 1. VOC 2007 test detection average precision (%). All methods use VGG16. Training set key: 07: VOC07 trainval, 07\diff: 07 without “difficult” examples, 07+12: union of 07 and VOC12 trainval. ySPPnet results were prepared by the authors of [11].

5.3. 我們需要更多訓練數(shù)據(jù)嗎？

當提供更多的訓練數(shù)據(jù)時，好的目標檢測器應該會進一步提升性能。Zhu等人[24]發(fā)現(xiàn)DPM [8]模型的mAP在只有幾百到幾千個訓練樣本的時候就達到飽和了。實驗中我們增加VOC07 trainval訓練集與VOC12 trainval訓練集，大約增加到三倍的圖像使其數(shù)量達到16.5k，以評估Fast R-CNN。擴大訓練集將VOC07測試集的mAP從66.9％提高到70.0％（表1）。使用這個數(shù)據(jù)集進行訓練時，我們使用60k次小批量迭代而不是40k。

表1. VOC 2007測試檢測平均精度（％）。所有方法都使用VGG16。訓練集關鍵字：07代表VOC07 trainval，07\diff代表07沒有“困難”的樣本，07 + 12表示VOC07和VOC12 trainval的組合。SPPnet結果由[11]的作者提供。

We perform similar experiments for VOC10 and 2012, for which we construct a dataset of 21.5k images from the union of VOC07 trainval, test, and VOC12 trainval. When training on this dataset, we use 100k SGD iterations and lower the learning rate by 0.1× each 40k iterations (instead of each 30k). For VOC10 and 2012, mAP improves from 66.1% to 68.8% and from 65.7% to 68.4%, respectively.

我們對VOC2010和2012進行類似的實驗，我們用VOC07 trainval、test和VOC12 trainval數(shù)據(jù)集構造了21.5k個圖像的數(shù)據(jù)集。當用這個數(shù)據(jù)集訓練時，我們使用100k次SGD迭代，并且每40k次迭代（而不是每30k次）降低學習率10倍。對于VOC2010和2012，mAP分別從66.1％提高到68.8％和從65.7％提高到68.4％。

5.4. Do SVMs outperform softmax?

Fast R-CNN uses the softmax classifier learnt during fine-tuning instead of training one-vs-rest linear SVMs post-hoc, as was done in R-CNN and SPPnet. To understand the impact of this choice, we implemented post-hoc SVM training with hard negative mining in Fast R-CNN. We use the same training algorithm and hyper-parameters as in R-CNN.

5.4. SVM分類是否優(yōu)于Softmax？

Fast R-CNN在fine-tune期間使用softmax分類器學習，而不是像R-CNN和SPPnet在最后訓練一對多線性SVM。為了理解這種選擇的影響，我們在Fast R-CNN中進行了具有難負采樣的事后SVM訓練。我們使用與R-CNN中相同的訓練算法和超參數(shù)。

Table 8 shows softmax slightly outperforming SVM for all three networks, by +0.1 to +0.8 mAP points. This effect is small, but it demonstrates that “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches. We note that softmax, unlike one-vs-rest SVMs, introduces competition between classes when scoring a RoI.

Table 8. Fast R-CNN with softmax vs. SVM (VOC07 mAP).

如表8所示，對于所有三個網(wǎng)絡，Softmax略優(yōu)于SVM，mAP分別提高了0.1和0.8個點。這個提升效果很小，但是它表明與先前的多級訓練方法相比，“一次性”fine-tune是足夠的。我們注意到，不像一對多的SVM那樣，Softmax會在計算RoI得分時引入類別之間的競爭。

表8. 用Softmax的Fast R-CNN對比用SVM的Fast RCNN（VOC07 mAP）。

5.5. Are more proposals always better?

There are (broadly) two types of object detectors: those that use a sparse set of object proposals (e.g., selective search [21]) and those that use a dense set (e.g., DPM [8]). Classifying sparse proposals is a type of cascade [22] in which the proposal mechanism first rejects a vast number of candidates leaving the classifier with a small set to evaluate. This cascade improves detection accuracy when applied to DPM detections [21]. We find evidence that the proposal-classifier cascade also improves Fast R-CNN accuracy.

5.5. 更多的候選區(qū)域更好嗎？

（廣義上）存在兩種類型的目標檢測器：一類是使用候選區(qū)域稀疏集合檢測器（例如，selective search [21]）和另一類使用密集集合（例如DPM [8]）。分類稀疏候選區(qū)域通過一種級聯(lián)方式[22]的，其中候選機制首先舍棄大量候選區(qū)域，留下較小的集合讓分類器來評估。當應用于DPM檢測時，這種級聯(lián)的方式提高了檢測精度[21]。我們發(fā)現(xiàn)proposal-classifier級聯(lián)方式也提高了Fast R-CNN的精度。

Using selective search’s quality mode, we sweep from 1k to 10k proposals per image, each time re-training and re-testing model M. If proposals serve a purely computational role, increasing the number of proposals per image should not harm mAP.

使用selective search的質(zhì)量模式，我們對每個圖像掃描1k到10k個候選框，每次重新訓練和重新測試模型M。如果候選框純粹扮演計算的角色，增加每個圖像的候選框數(shù)量不會影響mAP。

We find that mAP rises and then falls slightly as the proposal count increases (Fig. 3, solid blue line). This experiment shows that swamping the deep classifier with more proposals does not help, and even slightly hurts, accuracy.

我們發(fā)現(xiàn)隨著候選區(qū)域數(shù)量的增加，mAP先上升然后略微下降（如圖3藍色實線所示）。這個實驗表明，深度神經(jīng)網(wǎng)絡分類器使用更多的候選區(qū)域沒有幫助，甚至稍微有點影響準確性。

This result is difficult to predict without actually running the experiment. The state-of-the-art for measuring object proposal quality is Average Recall (AR) [12]. AR correlates well with mAP for several proposal methods using R-CNN, when using a fixed number of proposals per image. Fig. 3 shows that AR (solid red line) does not correlate well with mAP as the number of proposals per image is varied. AR must be used with care; higher AR due to more proposals does not imply that mAP will increase. Fortunately, training and testing with model M takes less than 2.5 hours. Fast R-CNN thus enables efficient, direct evaluation of object proposal mAP, which is preferable to proxy metrics.

Figure 3. VOC07 test mAP and AR for various proposal schemes.

如果不實際進行實驗，這個結果很難預測。用于評估候選區(qū)域質(zhì)量的最流行的技術是平均召回率(Average Recall, AR) [12]。當對每個圖像使用固定數(shù)量的候選區(qū)域時，AR與使用R-CNN的幾種候選區(qū)域方法時的mAP具有良好的相關性。圖3表明AR（紅色實線）與mAP不相關，因為每個圖像的候選區(qū)域數(shù)量是變化的。AR必須謹慎使用，由于更多的候選區(qū)域會得到更高的AR，然而這并不意味著mAP也會增加。幸運的是，使用模型M的訓練和測試需要不到2.5小時。因此，Fast R-CNN能夠高效地、直接地評估目標候選區(qū)域的mAP，是很合適的代理指標。

圖3. 各種候選區(qū)域方案下VOC07測試的mAP和AR。

We also investigate Fast R-CNN when using densely generated boxes (over scale, position, and aspect ratio), at a rate of about 45k boxes / image. This dense set is rich enough that when each selective search box is replaced by its closest (in IoU) dense box, mAP drops only 1 point (to 57.7%, Fig. 3, blue triangle).

我們還研究了當使用密集生成框（在不同縮放尺度、位置和寬高比上）大約45k個框/圖像比例時的Fast R-CNN網(wǎng)絡模型。這個密集集足夠大，當每個selective search框被其最近（IoU）密集框替換時，mAP只降低1個點（到57.7％，如圖3藍色三角形所示）。

The statistics of the dense boxes differ from those of selective search boxes. Starting with 2k selective search boxes, we test mAP when adding a random sample of 1000×{2,4,6,8,10,32,45} dense boxes. For each experiment we re-train and re-test model M. When these dense boxes are added, mAP falls more strongly than when adding more selective search boxes, eventually reaching 53.0%.

密集框的統(tǒng)計信息與selective search框的統(tǒng)計信息不同。從2k個selective search框開始，我們再從1000×{2,4,6,8,10,32,45}中隨機添加密集框，并測試mAP。對于每個實驗，我們重新訓練和重新測試模型M。當添加這些密集框時，mAP比添加更多選擇性搜索框時下降得更強，最終達到53.0％。

We also train and test Fast R-CNN using only dense boxes (45k / image). This setting yields a mAP of 52.9% (blue diamond). Finally, we check if SVMs with hard negative mining are needed to cope with the dense box distribution. SVMs do even worse: 49.3% (blue circle).

我們還訓練和測試了Fast R-CNN只使用密集框（45k/圖像）。此設置的mAP為52.9％（藍色菱形）。最后，我們檢查是否需要使用難樣本重訓練的SVM來處理密集框分布。SVM結果更糟糕：49.3％（藍色圓圈）。

5.6. Preliminary MS COCO results

We applied Fast R-CNN (with VGG16) to the MS COCO dataset [18] to establish a preliminary baseline. We trained on the 80k image training set for 240k iterations and evaluated on the “test-dev” set using the evaluation server. The PASCAL-style mAP is 35.9%; the new COCO-style AP, which also averages over IoU thresholds, is 19.7%.

5.6. MS COCO初步結果

我們將Fast R-CNN（使用VGG16）應用于MS COCO數(shù)據(jù)集[18]，以建立初始baseline。我們在80k圖像訓練集上進行了240k次迭代訓練，并使用評估服務器對“test-dev”數(shù)據(jù)集進行評估。PASCAL形式的mAP為35.9％；新的COCO標準下的AP為19.7％，即超過IoU閾值的平均值。

6. Conclusion

This paper proposes Fast R-CNN, a clean and fast update to R-CNN and SPPnet. In addition to reporting state-of-the-art detection results, we present detailed experiments that we hope provide new insights. Of particular note, sparse object proposals appear to improve detector quality. This issue was too costly (in time) to probe in the past, but becomes practical with Fast R-CNN. Of course, there may exist yet undiscovered techniques that allow dense boxes to perform as well as sparse proposals. Such methods, if developed, may help further accelerate object detection.

6. 結論

本文提出Fast R-CNN，一個對R-CNN和SPPnet更新的簡潔、快速版本。除了報告目前最先進的檢測結果之外，我們還提供了詳細的實驗，希望提供新的思路。特別值得注意的是，稀疏目標候選區(qū)域似乎提高了檢測器的質(zhì)量。過去這個問題代價太大（在時間上）而一直無法深入探索，但Fast R-CNN使其變得可能。當然，可能存在未發(fā)現(xiàn)的技術，使得密集框能夠達到與稀疏候選框類似的效果。如果這樣的方法被開發(fā)出來，則可以幫助進一步加速目標檢測。

Acknowledgements. I thank Kaiming He, Larry Zitnick, and Piotr Doll′ar for helpful discussions and encouragement.

致謝：感謝Kaiming He，Larry Zitnick和Piotr Dollár的有益的討論和鼓勵。

References

參考文獻

[1] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV,2012. 5

[2] R. Caruana. Multitask learning. Machine learning, 28(1), 1997. 6

[3] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014. 5

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 2

[5] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014. 4

[6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. 3

[7] M. Everingham, L. Van Gool, C. K. I.Williams, J.Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. 1

[8] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 3, 7, 8

[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 3, 4, 8

[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Regionbased convolutional networks for accurate object detection and segmentation. TPAMI, 2015. 5, 7, 8

[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 1, 2, 3, 4, 5, 6, 7

[12] J. H. Hosang, R. Benenson, P. Doll′ar, and B. Schiele. What makes for effective detection proposals? arXiv preprint arXiv:1502.05082, 2015. 8

[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proc. of the ACM International Conf. on Multimedia, 2014. 2

[14] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 1, 4, 6

[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 1

[16] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comp., 1989. 1

[17] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014. 5

[18] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll′ar, and C. L. Zitnick. Microsoft COCO: common objects in context. arXiv e-prints, arXiv:1405.0312 [cs.CV], 2014. 8

[19] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In ICLR, 2014. 1, 3

[20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1, 5

[21] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013. 8

[22] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001. 8

[23] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, 2013. 4

[24] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes. Do we need more training data or better models for object detection? In BMVC, 2012. 7

[25] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segDeepM: Exploiting segmentation and context in deep neural networks for object detection. In CVPR, 2015. 1, 5

總結

以上是生活随笔為你收集整理的目标检测经典论文——Fast R-CNN论文翻译（中英文对照版）：Fast R-CNN（Ross Girshick， Microsoft Research（微软研究院））的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： vs2015发布网站到IIS
下一篇： 11、C++各大有名库的介绍——综合