當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

Paper9：Fast RCNN

發(fā)布時(shí)間：2023/11/27 生活经验 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper9：Fast RCNN 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

code：s available under the open-source MIT License at https://github.com/rbgirshick/ fast-rcnn.

摘要：

Fast R-CNN在訓(xùn)練和測(cè)試上的速度都得到提高，而且準(zhǔn)確率也提高了。在on PASCAL VOC 2012上，Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012。Fast R-CNN與SPPNet相比，Fast R-CNN訓(xùn)練VGG16更快，準(zhǔn)確率也更高；

The Fast R-CNN method has several advantages:

1. Higher detection quality (mAP) than R-CNN, SPPnet

2. Training is single-stage, using a multi-task loss

3. Training can update all network layers

4. No disk storage is required for feature caching

mAP：mean Average Precision,簡單翻譯過來就是平均的平均精確度（沒錯(cuò)，就是兩個(gè)平均），首先是一個(gè)類別內(nèi)，求平均精確度（Average Precision），然后對(duì)所有類別的平均精確度再求平均（mean Average Precision）。

mAP: mean Average Precision, 即各類別AP的平均值
AP: PR曲線下面積，后文會(huì)詳細(xì)講解
PR曲線: Precision-Recall曲線
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
TP: IoU>0.5的檢測(cè)框數(shù)量（同一Ground Truth只計(jì)算一次）
FP: IoU<=0.5的檢測(cè)框，或者是檢測(cè)到同一個(gè)GT的多余檢測(cè)框的數(shù)量
FN: 沒有檢測(cè)到的GT的數(shù)量

Propose：First, numerous candidate object locations (often called “proposals”) must be processed。

1、Introduction

Methods：In this paper, we streamline the training process for state-of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

Result：The resulting method can train a very deep detection network (VGG16 [20]) 9× faster than R-CNN [9] and 3× faster than SPPnet [11]. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of 66% (vs. 62% for R-CNN).1。

首先指出R-CNN的缺點(diǎn)：The Region-based Convolutional Network method (R-CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks：

1. Training is a multi-stage pipeline.（多階段訓(xùn)練）

R-CNN first fine tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.

2. Training is expensive in space and time.（訓(xùn)練階段對(duì)于空間和時(shí)間的消耗太高）

For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.

3. Object detection is slow.（物體檢測(cè)比較慢）

At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).

論述R-CNN和SPPNet：（再看一下金字塔池化spatial pyramid pooling15和微調(diào)算法11）

R-CNN很慢，因?yàn)樗鼘?duì)每個(gè)對(duì)象建議執(zhí)行一個(gè)ConvNet forward pass，而不共享計(jì)算。而Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computatio。

The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by max pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6 × 6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.

（SPPnet方法為整個(gè)輸入圖像計(jì)算卷積特征圖，然后使用從共享特征圖中提取的特征向量對(duì)每個(gè)對(duì)象提議進(jìn)行分類。通過最大程度地將提案中的要素地圖部分集中到固定大小的輸出（例如6×6）中，提取提案的要素。合并多個(gè)輸出大小，然后像在空間金字塔合并中一樣進(jìn)行串聯(lián)[15]。在測(cè)試時(shí)，SPPnet將R-CNN的速度提高了10到100倍。由于建議特征提取速度更快，培訓(xùn)時(shí)間也減少了3倍。）

SPPnet的缺點(diǎn)：

SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

（SPPnet也有明顯的缺點(diǎn)。像R-CNN一樣，訓(xùn)練是一個(gè)多階段的管道，涉及提取特征，對(duì)網(wǎng)絡(luò)進(jìn)行l(wèi)og損失微調(diào)，訓(xùn)練SVM，最后擬合邊界框回歸器。特征也會(huì)寫入磁盤。但是與R-CNN不同，文獻(xiàn)[11]中提出的微調(diào)算法無法更新空間金字塔池之前的卷積層。毫無疑問，此限制（固定的卷積層）限制了非常深的網(wǎng)絡(luò)的準(zhǔn)確性。）

2、Fast RCNN architecture and training

快速的R-CNN網(wǎng)絡(luò)以一幅完整的圖像和一組對(duì)象建議(object proposals)作為輸入。First，該網(wǎng)絡(luò)首先對(duì)整個(gè)圖像進(jìn)行卷積(conv)和最大池化層處理，生成conv特征映射。Then，然后，對(duì)每個(gè)對(duì)象提議object proposal，一個(gè)感興趣區(qū)域(RoI)池化層從特征映射feature map中提取一個(gè)固定長度的特征向量。每個(gè)特征向量被送到一個(gè)全連接序列中，最終分支成兩個(gè)sibling的輸出層；其中一個(gè)對(duì)K個(gè)對(duì)象類加上一個(gè)“背景”類產(chǎn)生softmax 概率估計(jì)，另一個(gè)輸出層為K個(gè)對(duì)象類中的每一個(gè)輸出四個(gè)實(shí)數(shù)值，每組4個(gè)值對(duì)K個(gè)類別之一精確的邊界框(bounding box)位置進(jìn)行編碼。

2.1?The RoI pooling layer

2.2 Initializing from pre-trained networks

2.3 Fine-tuning for detection

為什么SPPnet不能更新空間金字塔池化層以下的權(quán)重？其根本原因是，當(dāng)每個(gè)訓(xùn)練樣本(即RoI)來自不同的圖像時(shí)，通過SPP層的反向傳播效率非常低，這正是R-CNN和SPPnet網(wǎng)絡(luò)的訓(xùn)練方式。效率低的原因是每個(gè)RoI可能有一個(gè)非常大的接受野，通常跨越整個(gè)輸入圖像。因此前向傳播必須處理整個(gè)感受野，所以訓(xùn)練輸入量很大(通常是整個(gè)圖像)。

fast R-cnn training advantage：(than R-cnn and SPPnet)

We propose a more efficient training method that takes advantage of feature sharing during training.In Fast R-CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy)

Q：從每一個(gè)image 采樣R/N個(gè)ROIs,（R是什么？N是什么？N是images的個(gè)數(shù)，mini-batches of size R = 128）

IOU(重疊度)(Intersection over Union)：

物體檢測(cè)需要定位出物體的bounding box，就像下面的圖片一樣，我們不僅要定位出車輛的bounding box 我們還要識(shí)別出bounding box 里面的物體就是車輛。

? ? ?

ground-truth bounding boxes（人為在訓(xùn)練集圖像中標(biāo)出要檢測(cè)物體的大概范圍）
我們的算法得出的結(jié)果范圍。

對(duì)于bounding box的定位精度，有一個(gè)很重要的概念：因?yàn)槲覀兯惴ú豢赡馨俜职俑斯?biāo)注的數(shù)據(jù)完全匹配，因此就存在一個(gè)定位精度評(píng)價(jià)公式：IOU。它定義了兩個(gè)bounding box的重疊度，如下圖所示

? ? ? ? ? ? ? ? ? ? ? ?

2、IoU的計(jì)算?

IoU是兩個(gè)區(qū)域重疊的部分除以兩個(gè)區(qū)域的集合部分得出的結(jié)果，通過設(shè)定的閾值，與這個(gè)IoU計(jì)算結(jié)果比較

就是矩形框A、B的重疊面積占A、B并集的面積比例。

舉例如下：綠色框是準(zhǔn)確值，紅色框是預(yù)測(cè)值。

總結(jié)

以上是生活随笔為你收集整理的Paper9：Fast RCNN的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

FAST
rcnn

上一篇：随笔2：关于linux和python
下一篇： Python：python中的可变类型和