Paper9:Fast RCNN
code:s available under the open-source MIT License at https://github.com/rbgirshick/ fast-rcnn.
摘要:
Fast R-CNN在訓(xùn)練和測(cè)試上的速度都得到提高,而且準(zhǔn)確率也提高了。在on PASCAL VOC 2012上,Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012。Fast R-CNN與SPPNet相比,Fast R-CNN訓(xùn)練VGG16更快,準(zhǔn)確率也更高;
The Fast R-CNN method has several advantages:
1. Higher detection quality (mAP) than R-CNN, SPPnet
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching
mAP:mean Average Precision,簡單翻譯過來就是平均的平均精確度(沒錯(cuò),就是兩個(gè)平均),首先是一個(gè)類別內(nèi),求平均精確度(Average Precision),然后對(duì)所有類別的平均精確度再求平均(mean Average Precision)。
- mAP: mean Average Precision, 即各類別AP的平均值
- AP: PR曲線下面積,后文會(huì)詳細(xì)講解
- PR曲線: Precision-Recall曲線
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- TP: IoU>0.5的檢測(cè)框數(shù)量(同一Ground Truth只計(jì)算一次)
- FP: IoU<=0.5的檢測(cè)框,或者是檢測(cè)到同一個(gè)GT的多余檢測(cè)框的數(shù)量
- FN: 沒有檢測(cè)到的GT的數(shù)量
Propose:First, numerous candidate object locations (often called “proposals”) must be processed。
1、Introduction
Methods:In this paper, we streamline the training process for state-of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.
Result:The resulting method can train a very deep detection network (VGG16 [20]) 9× faster than R-CNN [9] and 3× faster than SPPnet [11]. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of 66% (vs. 62% for R-CNN).1。
首先指出R-CNN的缺點(diǎn):The Region-based Convolutional Network method (R-CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:
1. Training is a multi-stage pipeline.(多階段訓(xùn)練)
R-CNN first fine tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
2. Training is expensive in space and time.(訓(xùn)練階段對(duì)于空間和時(shí)間的消耗太高)
For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
3. Object detection is slow.(物體檢測(cè)比較慢)
At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).
論述R-CNN和SPPNet:(再看一下金字塔池化spatial pyramid pooling15和微調(diào)算法11)
R-CNN很慢,因?yàn)樗鼘?duì)每個(gè)對(duì)象建議執(zhí)行一個(gè)ConvNet forward pass,而不共享計(jì)算。而Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computatio。
The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by max pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6 × 6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.
(SPPnet方法為整個(gè)輸入圖像計(jì)算卷積特征圖,然后使用從共享特征圖中提取的特征向量對(duì)每個(gè)對(duì)象提議進(jìn)行分類。 通過最大程度地將提案中的要素地圖部分集中到固定大小的輸出(例如6×6)中,提取提案的要素。 合并多個(gè)輸出大小,然后像在空間金字塔合并中一樣進(jìn)行串聯(lián)[15]。 在測(cè)試時(shí),SPPnet將R-CNN的速度提高了10到100倍。 由于建議特征提取速度更快,培訓(xùn)時(shí)間也減少了3倍。 )
SPPnet的缺點(diǎn):
SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.
(SPPnet也有明顯的缺點(diǎn)。 像R-CNN一樣,訓(xùn)練是一個(gè)多階段的管道,涉及提取特征,對(duì)網(wǎng)絡(luò)進(jìn)行l(wèi)og損失微調(diào),訓(xùn)練SVM,最后擬合邊界框回歸器。 特征也會(huì)寫入磁盤。 但是與R-CNN不同,文獻(xiàn)[11]中提出的微調(diào)算法無法更新空間金字塔池之前的卷積層。 毫無疑問,此限制(固定的卷積層)限制了非常深的網(wǎng)絡(luò)的準(zhǔn)確性。 )
2、Fast RCNN architecture and training
快速的R-CNN網(wǎng)絡(luò)以一幅完整的圖像和一組對(duì)象建議(object proposals)作為輸入。First,該網(wǎng)絡(luò)首先對(duì)整個(gè)圖像進(jìn)行卷積(conv)和最大池化層處理,生成conv特征映射。Then,然后,對(duì)每個(gè)對(duì)象提議object proposal,一個(gè)感興趣區(qū)域(RoI)池化層從特征映射feature map中提取一個(gè)固定長度的特征向量。每個(gè)特征向量被送到一個(gè)全連接序列中,最終分支成兩個(gè)sibling的輸出層;其中一個(gè)對(duì)K個(gè)對(duì)象類加上一個(gè)“背景”類產(chǎn)生softmax 概率估計(jì),另一個(gè)輸出層為K個(gè)對(duì)象類中的每一個(gè)輸出四個(gè)實(shí)數(shù)值,每組4個(gè)值對(duì)K個(gè)類別之一精確的邊界框(bounding box)位置進(jìn)行編碼。
2.1?The RoI pooling layer
?
2.2 Initializing from pre-trained networks
?
2.3 Fine-tuning for detection
為什么SPPnet不能更新空間金字塔池化層以下的權(quán)重?其根本原因是,當(dāng)每個(gè)訓(xùn)練樣本(即RoI)來自不同的圖像時(shí),通過SPP層的反向傳播效率非常低,這正是R-CNN和SPPnet網(wǎng)絡(luò)的訓(xùn)練方式。效率低的原因是每個(gè)RoI可能有一個(gè)非常大的接受野,通常跨越整個(gè)輸入圖像。因此前向傳播必須處理整個(gè)感受野,所以訓(xùn)練輸入量很大(通常是整個(gè)圖像)。
fast R-cnn training advantage:(than R-cnn and SPPnet)
We propose a more efficient training method that takes advantage of feature sharing during training.In Fast R-CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy)
Q:從每一個(gè)image 采樣R/N個(gè)ROIs,(R是什么?N是什么?N是images的個(gè)數(shù),mini-batches of size R = 128)
IOU(重疊度)(Intersection over Union):
物體檢測(cè)需要定位出物體的bounding box,就像下面的圖片一樣,我們不僅要定位出車輛的bounding box 我們還要識(shí)別出bounding box 里面的物體就是車輛。
? ? ?
- ground-truth bounding boxes(人為在訓(xùn)練集圖像中標(biāo)出要檢測(cè)物體的大概范圍)
- 我們的算法得出的結(jié)果范圍。
對(duì)于bounding box的定位精度,有一個(gè)很重要的概念: 因?yàn)槲覀兯惴ú豢赡馨俜职俑斯?biāo)注的數(shù)據(jù)完全匹配,因此就存在一個(gè)定位精度評(píng)價(jià)公式:IOU。 它定義了兩個(gè)bounding box的重疊度,如下圖所示
? ? ? ? ? ? ? ? ? ? ? ?
2、IoU的計(jì)算?
IoU是兩個(gè)區(qū)域重疊的部分除以兩個(gè)區(qū)域的集合部分得出的結(jié)果,通過設(shè)定的閾值,與這個(gè)IoU計(jì)算結(jié)果比較
就是矩形框A、B的重疊面積占A、B并集的面積比例。
舉例如下:綠色框是準(zhǔn)確值,紅色框是預(yù)測(cè)值。
?
總結(jié)
以上是生活随笔為你收集整理的Paper9:Fast RCNN的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 随笔2:关于linux和python
- 下一篇: Python:python中的可变类型和