當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Paper：《Spatial Transformer Networks》的翻译与解读

發布時間：2025/3/21 编程问答 17 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper：《Spatial Transformer Networks》的翻译与解读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Paper：《Spatial Transformer Networks》的翻譯與解讀

《Spatial Transformer Networks》的翻譯與解讀

Abstract

1 Introduction ?

2 Related Work ?

3 Spatial Transformers ?

3.1 Localisation Network ?

3.2 Parameterised Sampling Grid ?

3.3 Differentiable Image Sampling ?

3.4 Spatial Transformer Networks ?

4 Experiments ?

4.1 Distorted MNIST ?

4.2 Street View House Numbers ?

4.3 Fine-Grained Classification ?

5 Conclusion

《Spatial Transformer Networks》的翻譯與解讀

鏈接	https://arxiv.org/pdf/1506.02025.pdf
作者	Max Jaderberg Karen Simonyan Andrew Zisserman Koray Kavukcuoglu Google DeepMind, London, UK {jaderberg,simonyan,zisserman,korayk}@google.com

Abstract

Convolutional Neural Networks define an exceptionally powerful class of models, ?but are still limited by the lack of ability to be spatially invariant to the input data ?in a computationally and parameter efficient manner. In this work we introduce a ?new learnable module, the Spatial Transformer, which explicitly allows the spatial ?manipulation of data within the network. This differentiable module can be ?inserted into existing convolutional architectures, giving neural networks the ability ?to actively spatially transform feature maps, conditional on the feature map ?itself, without any extra training supervision or modification to the optimisation ?process. We show that the use of spatial transformers results in models which ?learn invariance to translation, scale, rotation and more generic warping, resulting ?in state-of-the-art performance on several benchmarks, and for a number of ?classes of transformations.

卷積神經網絡定義了一種非常強大的模型，但仍然受到限制，因為在計算和參數有效的方式下，缺乏對輸入數據的空間不變性。在這項工作中，我們引入了一個新的可學習模塊，空間轉換器，它明確地允許對網絡內的數據進行空間操作。這個可微模塊可以插入到現有的卷積架構中，使神經網絡能夠以特征映射本身為條件主動對特征映射進行空間變換，而無需任何額外的訓練監督或修改優化過程。我們表明，空間轉換器的使用會導致模型學習到平移、縮放、旋轉和更一般的扭曲的不變性，從而在幾個基準測試和許多類轉換上獲得最先進的性能。

1 Introduction ?

Over recent years, the landscape of computer vision has been drastically altered and pushed forward ?through the adoption of a fast, scalable, end-to-end learning framework, the Convolutional Neural ?Network (CNN) [21]. Though not a recent invention, we now see a cornucopia of CNN-based ?models achieving state-of-the-art results in classification [19, 28, 35], localisation [31, 37], semantic ?segmentation [24], and action recognition [12, 32] tasks, amongst others. ? A desirable property of a system which is able to reason about images is to disentangle object ?pose and part deformation from texture and shape. The introduction of local max-pooling layers in ?CNNs has helped to satisfy this property by allowing a network to be somewhat spatially invariant ?to the position of features. However, due to the typically small spatial support for max-pooling ?(e.g. 2 × 2 pixels) this spatial invariance is only realised over a deep hierarchy of max-pooling and ?convolutions, and the intermediate feature maps (convolutional layer activations) in a CNN are not ?actually invariant to large transformations of the input data [6, 22]. This limitation of CNNs is due ?to having only a limited, pre-defined pooling mechanism for dealing with variations in the spatial ?arrangement of data.	近年來，通過采用快速、可擴展、端到端學習框架——卷積神經網絡(CNN)[21]，計算機視覺領域發生了翻天覆地的變化。雖然不是最近才發明的，但我們現在看到大量基于cnn的模型在分類[19,28,35]、定位[31,37]、語義分割[24]和動作識別[12,32]任務等方面取得了最先進的結果。一個能夠對圖像進行推理的系統的一個理想特性是將物體的姿態和部分變形從紋理和形狀中分離出來。在cnn中引入局部最大池層有助于滿足這一特性，因為它允許網絡對特征的位置具有一定的空間不變性。然而，由于典型的對最大池的空間支持很小(例如:這種空間不變性僅在max-pooling和convolutions的深層層次上實現，而CNN中的中間特征映射(convolutional layer activation)對于輸入數據的大變換實際上并不是不變的[6,22]。cnn的這種局限性是由于只有一種有限的、預定義的池機制來處理數據空間安排的變化。
In this work we introduce a Spatial Transformer module, that can be included into a standard neural ?network architecture to provide spatial transformation capabilities. The action of the spatial transformer ?is conditioned on individual data samples, with the appropriate behaviour learnt during training ?for the task in question (without extra supervision). Unlike pooling layers, where the receptive ?fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively ?spatially transform an image (or a feature map) by producing an appropriate transformation for each ?input sample. The transformation is then performed on the entire feature map (non-locally) and ?can include scaling, cropping, rotations, as well as non-rigid deformations. This allows networks ?which include spatial transformers to not only select regions of an image that are most relevant (attention), ?but also to transform those regions to a canonical, expected pose to simplify recognition in ?the following layers. Notably, spatial transformers can be trained with standard back-propagation, ?allowing for end-to-end training of the models they are injected in.	在這項工作中，我們介紹了一個空間轉換器模塊，它可以包含在一個標準的神經網絡結構中，以提供空間轉換能力。空間轉換器的動作以個體數據樣本為條件，并在任務訓練中學習到適當的行為(沒有額外的監督)。與接受域是固定和局部的池化層不同，空間轉換器模塊是一種動態機制，通過為每個輸入樣本生成適當的轉換，可以主動地對圖像(或特征地圖)進行空間轉換。然后在整個特征圖(非局部)上執行轉換，可以包括縮放、剪切、旋轉以及非剛性變形。這使得包含空間變形器的網絡不僅可以選擇圖像中最相關的區域(注意)，而且可以將這些區域轉換成規范的、預期的姿態，從而簡化以下層中的識別。值得注意的是，空間轉換器可以用標準的反向傳播進行訓練，允許對它們所注入的模型進行端到端的訓練。
Figure 1: The result of using a spatial transformer as the ?first layer of a fully-connected network trained for distorted ?MNIST digit classification. (a) The input to the spatial transformer ?network is an image of an MNIST digit that is distorted ?with random translation, scale, rotation, and clutter. (b) ?The localisation network of the spatial transformer predicts a ?transformation to apply to the input image. (c) The output ?of the spatial transformer, after applying the transformation. ?(d) The classification prediction produced by the subsequent ?fully-connected network on the output of the spatial transformer. ?The spatial transformer network (a CNN including a ?spatial transformer module) is trained end-to-end with only ?class labels – no knowledge of the groundtruth transformations ?is given to the system.	圖1:使用空間轉換器作為變形MNIST數字分類訓練的全連接網絡的第一層的結果。(a)空間變壓器網絡的輸入是被隨機平移、縮放、旋轉和雜波扭曲的MNIST數字的圖像。(b)空間轉換器的定位網絡預測將對輸入圖像進行轉換。(c)空間變壓器應用變換后的輸出。(d)隨后的全連接網絡在空間變壓器的輸出上產生的分類預測。空間變壓器網絡(包括空間變壓器模塊的CNN)只使用類標簽進行端到端的訓練——沒有向系統提供關于groundtruth轉換的知識。
Spatial transformers can be incorporated into CNNs to benefit multifarious tasks, for example: ?(i) image classification: suppose a CNN is trained to perform multi-way classification of images ?according to whether they contain a particular digit – where the position and size of the digit may ?vary significantly with each sample (and are uncorrelated with the class); a spatial transformer that ?crops out and scale-normalizes the appropriate region can simplify the subsequent classification ?task, and lead to superior classification performance, see Fig. 1; (ii) co-localisation: given a set of ?images containing different instances of the same (but unknown) class, a spatial transformer can be ?used to localise them in each image; (iii) spatial attention: a spatial transformer can be used for ?tasks requiring an attention mechanism, such as in [14, 39], but is more flexible and can be trained ?purely with backpropagation without reinforcement learning. A key benefit of using attention is that ?transformed (and so attended), lower resolution inputs can be used in favour of higher resolution ?raw inputs, resulting in increased computational efficiency. ? The rest of the paper is organised as follows: Sect. 2 discusses some work related to our own, we ?introduce the formulation and implementation of the spatial transformer in Sect. 3, and finally give ?the results of experiments in Sect. 4. Additional experiments and implementation details are given ?in Appendix A.	空間轉換器可以被納入CNN受益繁雜的任務,例如:(i)圖像分類:假設一個CNN訓練來執行多路圖像分類根據他們是否包含一個特定的數字,數字可能會有所不同的位置和大小明顯與每個樣本(和不相關的類);裁剪和尺度歸一化適當區域的空間轉換器可以簡化后續的分類任務，并導致更高的分類性能，見圖1;(ii)共定位:給定一組包含相同(但未知)類的不同實例的圖像，空間轉換器可以用于在每個圖像中定位它們;(3)空間注意:空間轉換器可以用于需要注意機制的任務，如[14,39]，但更靈活，可以單純用反向傳播進行訓練，無需強化學習。使用attention的一個關鍵好處是，轉換(因此參與)的低分辨率輸入可以用于更高分辨率的原始輸入，從而提高計算效率。本文的其余部分組織如下:第2節討論了與我們相關的一些工作，第3節介紹了空間轉換器的設計和實現，最后給出了第4節的實驗結果。附錄A給出了更多的實驗和實現細節。

2 Related Work ?

In this section we discuss the prior work related to the paper, covering the central ideas of modelling ?transformations with neural networks [15, 16, 36], learning and analysing transformation-invariant ?representations [4, 6, 10, 20, 22, 33], as well as attention and detection mechanisms for feature ?selection [1, 7, 11, 14, 27, 29]. ? Early work by Hinton [15] looked at assigning canonical frames of reference to object parts, a theme ?which recurred in [16] where 2D affine transformations were modeled to create a generative model ?composed of transformed parts. The targets of the generative training scheme are the transformed ?input images, with the transformations between input images and targets given as an additional ?input to the network. The result is a generative model which can learn to generate transformed ?images of objects by composing parts. The notion of a composition of transformed parts is taken ?further by Tieleman [36], where learnt parts are explicitly affine-transformed, with the transform ?predicted by the network. Such generative capsule models are able to learn discriminative features ?for classification from transformation supervision. ?	在本節中,我們討論了之前的相關工作,與神經網絡覆蓋模型轉換的核心觀點(15、16,36),學習和分析transformation-invariant表示(4、6、10、20、22、33),以及注意力和檢測機制特征選擇(1、7、11、14,27歲,29)。  Hinton[15]的早期工作是將標準的參考框架分配給對象部件，這是[16]中反復出現的主題，在這里，2D仿射轉換被建模，以創建由轉換部件組成的生成模型。生成訓練方案的目標是轉換后的輸入圖像，輸入圖像與目標之間的轉換作為網絡的額外輸入。其結果是一個生成模型，該模型可以通過組成部件來學習生成轉換后的物體圖像。Tieleman[36]進一步提出了由轉換部分組成的概念，學習到的部分通過網絡預測的變換進行明確的仿射變換。這種生成膠囊模型能夠從轉換監督中學習判別特征進行分類。
The invariance and equivariance of CNN representations to input image transformations are studied ?in [22] by estimating the linear relationships between representations of the original and transformed ?images. Cohen & Welling [6] analyse this behaviour in relation to symmetry groups, which is also ?exploited in the architecture proposed by Gens & Domingos [10], resulting in feature maps that are ?more invariant to symmetry groups. Other attempts to design transformation invariant representations ?are scattering networks [4], and CNNs that construct filter banks of transformed filters [20, 33]. ?Stollenga et al. [34] use a policy based on a network’s activations to gate the responses of the network’s ?filters for a subsequent forward pass of the same image and so can allow attention to specific ?features. In this work, we aim to achieve invariant representations by manipulating the data rather ?than the feature extractors, something that was done for clustering in [9].	在[22]中，通過估計原始圖像和轉換后圖像的表示之間的線性關系，研究了CNN表示對輸入圖像轉換的不變性和等效性。Cohen和Welling[6]分析了這種與對稱群相關的行為，這也被Gens和Domingos[10]提出的架構所利用，從而產生了對對稱群更不變的特征映射。設計變換不變表示的其他嘗試包括散射網絡[4]和構造變換濾波器組的CNNs[20,33]。Stollenga等人[34]使用一種基于網絡激活的策略來屏蔽網絡過濾器的響應，以便后續轉發相同的圖像，從而允許關注特定的特征。在這項工作中，我們的目標是通過操縱數據而不是特征提取器來實現不變表示，這在[9]中是為了聚類而做的。
Figure 2: The architecture of a spatial transformer module. The input feature map U is passed to a localisation ?network which regresses the transformation parameters θ. The regular spatial grid G over V is transformed to ?the sampling grid Tθ(G), which is applied to U as described in Sect. 3.3, producing the warped output feature ?map V . The combination of the localisation network and sampling mechanism defines a spatial transformer.	圖2:空間變壓器模塊的架構。輸入特征映射U被傳遞到一個定位網絡，該網絡回歸轉換參數θ。將規則空間網格G / V轉換為采樣網格Tθ(G)，如3.3節所述，將采樣網格應用于U，產生扭曲的輸出特征映射V。定位網絡和抽樣機制的結合定義了一個空間轉換器。
Neural networks with selective attention manipulate the data by taking crops, and so are able to learn ?translation invariance. Work such as [1, 29] are trained with reinforcement learning to avoid the need for a differentiable attention mechanism, while [14] use a differentiable attention mechansim ?by utilising Gaussian kernels in a generative model. The work by Girshick et al. [11] uses a region ?proposal algorithm as a form of attention, and [7] show that it is possible to regress salient regions ?with a CNN. The framework we present in this paper can be seen as a generalisation of differentiable ?attention to any spatial transformation.	具有選擇性注意的神經網絡通過獲取作物來操縱數據，因此能夠學習翻譯不變性。像[1,29]這樣的工作通過強化學習進行訓練，以避免對可微分注意機制的需要，而[14]通過在生成模型中使用高斯核函數來使用可微分注意機制。Girshick等人的研究[11]使用區域建議算法作為注意的一種形式，[7]表明可以使用CNN回歸顯著區域。我們在本文中提出的框架可以看作是對任何空間變換的可微注意的推廣。

3 Spatial Transformers ?

In this section we describe the formulation of a spatial transformer. This is a differentiable module ?which applies a spatial transformation to a feature map during a single forward pass, where the ?transformation is conditioned on the particular input, producing a single output feature map. For ?multi-channel inputs, the same warping is applied to each channel. For simplicity, in this section we ?consider single transforms and single outputs per transformer, however we can generalise to multiple ?transformations, as shown in experiments. ?

在本節中，我們將描述空間轉換器的公式。這是一個可微模塊，它在一個單獨的前向過程中對特征映射進行空間變換，其中的變換以特定的輸入為條件，產生一個單獨的輸出特征映射。對于多通道輸入，對每個通道應用相同的翹曲。為簡單起見，在本節中，我們考慮每個變壓器的單一轉換和單一輸出，然而，我們可以推廣到多個轉換，如實驗中所示。 

The spatial transformer mechanism is split into three parts, shown in Fig. 2. In order of computation, ?first a localisation network (Sect. 3.1) takes the input feature map, and through a number of hidden ?layers outputs the parameters of the spatial transformation that should be applied to the feature map ?– this gives a transformation conditional on the input. Then, the predicted transformation parameters ?are used to create a sampling grid, which is a set of points where the input map should be sampled to ?produce the transformed output. This is done by the grid generator, described in Sect. 3.2. Finally, ?the feature map and the sampling grid are taken as inputs to the sampler, producing the output map ?sampled from the input at the grid points (Sect. 3.3). ?
The combination of these three components forms a spatial transformer and will now be described ?in more detail in the following sections.

空間變換機構分為三部分，如圖2所示。按照計算順序，首先定位網絡(第3.1節)獲取輸入特征地圖，并通過若干隱藏層輸出應該應用于特征地圖的空間轉換參數——這將在輸入上給出一個有條件的轉換。然后，使用預測的轉換參數來創建一個采樣網格，該網格是一組應該對輸入映射進行采樣以產生轉換后的輸出的點。這是由第3.2節中描述的網格生成器完成的。最后，將特征映射和采樣網格作為采樣器的輸入，從網格點的輸入產生采樣的輸出映射(第3.3節)。這三個組件的組合形成了一個空間轉換器，下面幾節將對其進行更詳細的描述。

3.1 Localisation Network ?

The localisation network takes the input feature map U ∈ R ?H×W×C with width W, height H and ?C channels and outputs θ, the parameters of the transformation Tθ to be applied to the feature map: ?θ = floc(U). The size of θ can vary depending on the transformation type that is parameterised, ?e.g. for an affine transformation θ is 6-dimensional as in (10). ?The localisation network function floc() can take any form, such as a fully-connected network or ?a convolutional network, but should include a final regression layer to produce the transformation ?parameters θ.

定位網絡取輸入特征圖U∈R H×W×C，寬W，高H, C通道，輸出θ，應用于特征圖的變換Tθ的參數:θ = floc(U)。θ的大小可以根據參數化的轉換類型而變化，例如。對于仿射變換，θ是6維的，如(10)。定位網絡函數floc()可以采取任何形式，例如完全連接的網絡或卷積網絡，但應該包括一個最終的回歸層來產生轉換參數θ。

3.2 Parameterised Sampling Grid ?

To perform a warping of the input feature map, each output pixel is computed by applying a sampling ?kernel centered at a particular location in the input feature map (this is described fully in the next ?section). By pixel we refer to an element of a generic feature map, not necessarily an image. In ?general, the output pixels are defined to lie on a regular grid G = {Gi} of pixels Gi = (x ?t ?i ?, yt ?i ?), ?forming an output feature map V ∈ R ?H0×W0×C , where H0 ?and W0 ?are the height and width of the ?grid, and C is the number of channels, which is the same in the input and output.

要對輸入特征映射執行扭曲，需要通過應用以輸入特征映射中特定位置為中心的采樣核來計算每個輸出像素(下一節將對此進行詳細描述)。像素指的是一般特征圖的一個元素，不一定是圖像。一般來說,躺在一個常規定義的輸出像素網格G = {Gi}像素Gi = (x t,歐美我),形成一個輸出特性映射V∈R H0×W0×C, H0和W0網格的高度和寬度,和C是通道的數量,輸入和輸出是相同的。

where (x ?t ?i ?, yt ?i ?) are the target coordinates of the regular grid in the output feature map, (x ?s ?i ?, ys ?i ?) are ?the source coordinates in the input feature map that define the sample points, and Aθ is the affine ?transformation matrix. We use height and width normalised coordinates, such that ?1 ≤ x ?t ?i ?, yt ?i ≤ 1 ?when within the spatial bounds of the output, and ?1 ≤ x ?s ?i ?, ys ?i ≤ 1 when within the spatial bounds ?of the input (and similarly for the y coordinates). The source/target transformation and sampling is ?equivalent to the standard texture mapping and coordinates used in graphics [8].

其中(x ti, yt i)為輸出特征映射中規則網格的目標坐標，(x s i, ys i)為定義樣本點的輸入特征映射中的源坐標，Aθ為仿射變換矩陣。我們使用的高度和寬度正常化坐標,這樣?1≤x t我次我≤1時在空間范圍內的輸出,和?1≤x, y≤1時在空間范圍內的輸入(同樣的y坐標)。源/目標轉換和采樣等價于圖形[8]中使用的標準紋理映射和坐標。

The class of transformations Tθ may be more constrained, such as that used for attention ?Aθ = ? ?s 0 tx ?0 s ty ? ?(2) ?allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation ?Tθ can also be more general, such as a plane projective transformation with 8 parameters, piecewise ?affine, or a thin plate spline. Indeed, the transformation can have any parameterised form, ?provided that it is differentiable with respect to the parameters – this crucially allows gradients to be ?backpropagated through from the sample points Tθ(Gi) to the localisation network output θ. If the ?transformation is parameterised in a structured, low-dimensional way, this reduces the complexity ?of the task assigned to the localisation network. For instance, a generic class of structured and differentiable ?transformations, which is a superset of attention, affine, projective, and thin plate spline ?transformations, is Tθ = MθB, where B is a target grid representation (e.g. in (10), B is the regular ?grid G in homogeneous coordinates), and Mθ is a matrix parameterised by θ. In this case it is ?possible to not only learn how to predict θ for a sample, but also to learn B for the task at hand.

類的轉換Tθ可能更多限制,比如用于注意θ= 0 0 tx泰(2)允許裁剪,翻譯,和各向同性縮放到不同年代,tx,泰,變換Tθ也可以更普遍,如使用8參數平面射影變換,分段仿射或薄板樣條。事實上，變換可以有任何參數化形式，只要它對參數是可微的——這至關重要地允許梯度通過樣本點Tθ(Gi)反向傳播到定位網絡輸出θ。如果轉換以結構化、低維的方式參數化，這將降低分配給本地化網絡的任務的復雜性。例如,一個泛型類的結構化和可微的轉換,這是一個超集的關注,仿射,投影,和薄板樣條轉換,是M Tθ=θB, B是一個目標網格表示(例如在(10),B是定期在齊次坐標網格G),和Mθ是一個矩陣parameterisedθ。在這種情況下，不僅可以學習如何預測樣本的θ，而且可以學習當前任務的B。

3.3 Differentiable Image Sampling ?

To perform a spatial transformation of the input feature map, a sampler must take the set of sampling ?points Tθ(G), along with the input feature map U and produce the sampled output feature map V . ?Each (x ?s ?i ?, ys ?i ?) coordinate in Tθ(G) defines the spatial location in the input where a sampling kernel ?is applied to get the value at a particular pixel in the output V . This can be written as

為了對輸入特征映射進行空間變換，采樣器必須取采樣點集Tθ(G)，同時取輸入特征映射U，并產生采樣輸出特征映射V。Tθ(G)中的每個(x s i, ys i)坐標定義了輸入中的空間位置，采樣核應用于此，以獲得輸出V中特定像素的值。這可以寫成

where Φx and Φy are the parameters of a generic sampling kernel k() which defines the image ?interpolation (e.g. bilinear), U ?c ?nm is the value at location (n, m) in channel c of the input, and V ?c ?i ?is the output value for pixel i at location (x ?t ?i ?, yt ?i ?) in channel c. Note that the sampling is done ?identically for each channel of the input, so every channel is transformed in an identical way (this ?preserves spatial consistency between channels). ?In theory, any sampling kernel can be used, as long as (sub-)gradients can be defined with respect to ?x ?s ?i ?and y ?s ?i ?. For example, using the integer sampling kernel reduces (3) to ?V ?c ?i = ?X ?H ?n ?X ?W ?m ?U ?c ?nmδ(bx ?s ?i + 0.5c ? m)δ(by ?s ?i + 0.5c ? n) (4) ?where bx + 0.5c rounds x to the nearest integer and δ() is the Kronecker delta function. This ?sampling kernel equates to just copying the value at the nearest pixel to (x ?s ?i ?, ys ?i ?) to the output location ?(x ?t ?i ?, yt ?i ?). Alternatively, a bilinear sampling kernel can be used, giving ?V ?c ?i = ?X ?H ?n ?X ?W ?m ?U ?c ?nm max(0, 1 ? |x ?s ?i ? m|) max(0, 1 ? |y ?s ?i ? n|) (5) ?To allow backpropagation of the loss through this sampling mechanism we can define the gradients ?with respect to U and G. For bilinear sampling (5) the partial derivatives are ??V c ?i ??Uc ?nm ?= ?X ?H ?n ?X ?W ?m ?max(0, 1 ? |x ?s ?i ? m|) max(0, 1 ? |y ?s ?i ? n|) (6) ??V c ?i ??xs ?i ?= ?X ?H ?n ?X ?W ?m ?U ?c ?nm max(0, 1 ? |y ?s ?i ? n|) ?? ?? ?? ?0 if |m ? x ?s ?i ?| ≥ 1 ?1 if m ≥ x ?s ?i ??1 if m < xs ?i ?(7) ?and similarly to (7) for ?V c ?i ??ys ?i ?.

This gives us a (sub-)differentiable sampling mechanism, allowing loss gradients to flow back not ?only to the input feature map (6), but also to the sampling grid coordinates (7), and therefore back ?to the transformation parameters θ and localisation network since ?xs ?i ??θ and ?xs ?i ??θ can be easily derived ?from (10) for example. Due to discontinuities in the sampling fuctions, sub-gradients must be used. ?This sampling mechanism can be implemented very efficiently on GPU, by ignoring the sum over ?all input locations and instead just looking at the kernel support region for each output pixel.

這給了我們一個(子)可微的采樣機制,不僅允許損失梯度回流的輸入特性圖(6),而且采樣網格坐標(7),因此回轉換參數θ和本地化網絡自?x我?θ和?x我?θ可以很容易地由(10)為例。由于抽樣函數的不連續，必須使用次梯度。這種采樣機制可以在GPU上非常有效地實現，忽略所有輸入位置的總和，而只是查看每個輸出像素的內核支持區域。

3.4 Spatial Transformer Networks ?

The combination of the localisation network, grid generator, and sampler form a spatial transformer ?(Fig. 2). This is a self-contained module which can be dropped into a CNN architecture at any point, ?and in any number, giving rise to spatial transformer networks. This module is computationally very ?fast and does not impair the training speed, causing very little time overhead when used naively, and ?even speedups in attentive models due to subsequent downsampling that can be applied to the output ?of the transformer. ?
Placing spatial transformers within a CNN allows the network to learn how to actively transform ?the feature maps to help minimise the overall cost function of the network during training. The ?knowledge of how to transform each training sample is compressed and cached in the weights of ?the localisation network (and also the weights of the layers previous to a spatial transformer) during ?training. For some tasks, it may also be useful to feed the output of the localisation network, θ, ?forward to the rest of the network, as it explicitly encodes the transformation, and hence the pose, of ?a region or object. ?

定位網絡、網格發生器和采樣器的組合形成了一個空間變壓器(圖。2).這是一個自包含的模塊，可以在任意點，任意數量的放入CNN架構中，從而產生空間變壓器網絡。該模塊的計算速度非常快，不影響訓練速度，在天真地使用時造成的時間開銷非常小，甚至在細心的模型中加速，因為后續的下采樣可以應用到變壓器的輸出。在CNN中放置空間變壓器可以讓網絡學習如何積極地轉換特征圖，以幫助在訓練期間最小化網絡的總體成本函數。在訓練期間，如何轉換每個訓練樣本的知識被壓縮并緩存在本地化網絡的權值中(以及空間轉換器之前的層的權值)。對于某些任務，它也可能是有用的供給定位網絡的輸出，θ，向前到網絡的其余部分，因為它明確編碼轉換，因此姿態，一個區域或對象。 

Table 1: Left: The percentage errors for different models on different distorted MNIST datasets. The different ?distorted MNIST datasets we test are TC: translated and cluttered, R: rotated, RTS: rotated, translated, and ?scaled, P: projective distortion, E: elastic distortion. All the models used for each experiment have the same ?number of parameters, and same base structure for all experiments. Right: Some example test images where ?a spatial transformer network correctly classifies the digit but a CNN fails. (a) The inputs to the networks. (b) ?The transformations predicted by the spatial transformers, visualised by the grid Tθ(G). (c) The outputs of the ?spatial transformers. E and RTS examples use thin plate spline spatial transformers (ST-CNN TPS), while R ?examples use affine spatial transformers (ST-CNN Aff) with the angles of the affine transformations given. For ?videos showing animations of these experiments and more see https://goo.gl/qdEhUu.

表1:左:不同模型在不同失真MNIST數據集上的誤差百分比。我們測試的不同扭曲MNIST數據集是TC:平移和雜波，R:旋轉，RTS:旋轉，平移和縮放，P:投影失真，E:彈性失真。各實驗所用模型參數數目相同，基本結構相同。右圖:一些測試圖像的例子，其中空間變壓器網絡正確地分類數字，但CNN失敗了。(a)網絡的輸入。(b)空間變壓器預測的變換，由網格Tθ(G)可視化。(c)空間變壓器的輸出。E和RTS的例子使用薄板樣條空間變壓器(ST-CNN TPS)，而R的例子使用仿射空間變壓器(ST-CNN Aff)，其仿射變換的角度是給定的。有關這些實驗動畫的視頻和更多內容，請參見https://goo.gl/qdEhUu。

It is also possible to use spatial transformers to downsample or oversample a feature map, as one can ?define the output dimensions H0 ?and W0 ?to be different to the input dimensions H and W. However, ?with sampling kernels with a fixed, small spatial support (such as the bilinear kernel), downsampling ?with a spatial transformer can cause aliasing effects.
Finally, it is possible to have multiple spatial transformers in a CNN. Placing multiple spatial transformers ?at increasing depths of a network allow transformations of increasingly abstract representations, ?and also gives the localisation networks potentially more informative representations to base ?the predicted transformation parameters on. One can also use multiple spatial transformers in parallel ?– this can be useful if there are multiple objects or parts of interest in a feature map that should be ?focussed on individually. A limitation of this architecture in a purely feed-forward network is that ?the number of parallel spatial transformers limits the number of objects that the network can model.

還可以使用空間變形金剛downsample或oversample功能地圖,作為一個可以定義的輸出尺寸H0和W0不同輸入維度H和w .然而,與一個固定的采樣內核,小空間(如雙線性內核)的支持,將采樣空間變壓器可能導致混疊效應。最后，在一個CNN中可以有多個空間變壓器。在網絡的深度增加時放置多個空間轉換器，可以實現越來越抽象的表示形式的轉換，同時也為定位網絡提供了潛在的更有信息的表示形式，從而可以根據預測的轉換參數進行轉換。還可以同時使用多個空間轉換器——如果在一個特征圖中有多個對象或感興趣的部分需要分別關注，這可能會很有用。在純前饋網絡中，這種架構的一個限制是并行空間變壓器的數量限制了網絡可以建模的對象的數量。

4 Experiments ?

In this section we explore the use of spatial transformer networks on a number of supervised learning ?tasks. In Sect. 4.1 we begin with experiments on distorted versions of the MNIST handwriting ?dataset, showing the ability of spatial transformers to improve classification performance through ?actively transforming the input images. In Sect. 4.2 we test spatial transformer networks on a challenging ?real-world dataset, Street View House Numbers [25], for number recognition, showing stateof-the-art ?results using multiple spatial transformers embedded in the convolutional stack of a CNN. ?Finally, in Sect. 4.3, we investigate the use of multiple parallel spatial transformers for fine-grained ?classification, showing state-of-the-art performance on CUB-200-2011 birds dataset [38] by discovering ?object parts and learning to attend to them. Further experiments of MNIST addition and ?co-localisation can be found in Appendix A. ?

在本節中，我們將探索空間變壓器網絡在若干監督學習任務中的使用。在4.1節中，我們首先對MNIST筆跡數據集的扭曲版本進行實驗，展示了空間變換器通過主動轉換輸入圖像來提高分類性能的能力。在第4.2節中，我們在具有挑戰性的真實世界數據集上測試了空間變壓器網絡，街道視圖房號[25]，用于數字識別，使用嵌入在CNN卷積堆棧中的多個空間變壓器顯示了最先進的結果。最后，在第4.3節中，我們研究了多個并行空間變形器用于細粒度分類的使用，通過發現對象部件并學習注意它們，展示了cube -200-2011 birds數據集[38]的最先進性能。進一步的MNIST添加和共定位實驗可以在附錄A中找到。

4.1 Distorted MNIST ?

In this section we use the MNIST handwriting dataset as a testbed for exploring the range of transformations ?to which a network can learn invariance to by using a spatial transformer.

?We begin with experiments where we train different neural network models to classify MNIST data ?that has been distorted in various ways: rotation (R), rotation, scale and translation (RTS), projective ?transformation (P), and elastic warping (E) – note that elastic warping is destructive and can not be ?inverted in some cases. The full details of the distortions used to generate this data are given in ?Appendix A. We train baseline fully-connected (FCN) and convolutional (CNN) neural networks, ?as well as networks with spatial transformers acting on the input before the classification network ?(ST-FCN and ST-CNN). The spatial transformer networks all use bilinear sampling, but variants use ?different transformation functions: an affine transformation (Aff), projective transformation (Proj), ?and a 16-point thin plate spline transformation (TPS) [2]. The CNN models include two max-pooling ?layers. All networks have approximately the same number of parameters, are trained with identical ?optimisation schemes (backpropagation, SGD, scheduled learning rate decrease, with a multinomial ?cross entropy loss), and all with three weight layers in the classification network.

在本節中，我們使用MNIST手寫數據集作為測試平臺，來探索網絡可以通過使用空間轉換器學習到的不變性的轉換范圍。我們從實驗開始訓練不同的神經網絡模型分類MNIST數據已經以各種方式扭曲:旋轉(R)、旋轉、尺度和翻譯(RTS)、投影轉換(P)和彈性變形(E) -注意,彈性變形是毀滅性的和在某些情況下不能倒。用于生成這一數據的扭曲的全部細節見附錄a。我們訓練基線全連接(FCN)和卷積(CNN)神經網絡，以及在分類網絡(ST-FCN和ST-CNN)之前使用空間變壓器作用于輸入的網絡。空間變壓器網絡都使用雙線性采樣，但不同的變體使用不同的變換函數:仿射變換(Aff)、投影變換(Proj)和16點薄板樣條變換(TPS)[2]。CNN的模型包括兩個最大匯集層。所有的網絡具有近似相同的參數數目，使用相同的優化方案(backpropagation, SGD，調度學習速率下降，有多項交叉熵損失)進行訓練，并且在分類網絡中都有三個權層。

Table 2: Left: The sequence error for SVHN multi-digit recognition on crops of 64 × 64 pixels (64px), and ?inflated crops of 128 × 128 (128px) which include more background. *The best reported result from [1] uses ?model averaging and Monte Carlo averaging, whereas the results from other models are from a single forward ?pass of a single model. Right: (a) The schematic of the ST-CNN Multi model. The transformations applied by ?each spatial transformer (ST) is applied to the convolutional feature map produced by the previous layer. (b) ?The result of multiplying out the affine transformations predicted by the four spatial transformers in ST-CNN ?Multi, visualised on the input image.

表2:左:64 × 64像素(64px)作物的SVHN多位數識別序列錯誤，128 × 128 (128px)膨大的作物包含更多的背景。*[1]報告的最佳結果使用了模型平均和蒙特卡羅平均，而其他模型的結果來自單個模型的單次向前傳遞。右:(a) ST-CNN多模型示意圖。每個空間變換器(ST)的變換應用于前一層生成的卷積特征圖。(b)將ST-CNN Multi中的四個空間變壓器預測的仿射變換乘出來的結果，在輸入圖像上顯示出來。

The results of these experiments are shown in Table 1 (left). Looking at any particular type of distortion ?of the data, it is clear that a spatial transformer enabled network outperforms its counterpart ?base network. For the case of rotation, translation, and scale distortion (RTS), the ST-CNN achieves 0.5% and 0.6% depending on the class of transform used for Tθ, whereas a CNN, with two maxpooling ?layers to provide spatial invariance, achieves 0.8% error. This is in fact the same error that ?the ST-FCN achieves, which is without a single convolution or max-pooling layer in its network, ?showing that using a spatial transformer is an alternative way to achieve spatial invariance. ST-CNN ?models consistently perform better than ST-FCN models due to max-pooling layers in ST-CNN providing ?even more spatial invariance, and convolutional layers better modelling local structure. We ?also test our models in a noisy environment, on 60 × 60 images with translated MNIST digits and ?background clutter (see Fig. 1 third row for an example): an FCN gets 13.2% error, a CNN gets ?3.5% error, while an ST-FCN gets 2.0% error and an ST-CNN gets 1.7% error. ?
Looking at the results between different classes of transformation, the thin plate spline transformation ?(TPS) is the most powerful, being able to reduce error on elastically deformed digits by ?reshaping the input into a prototype instance of the digit, reducing the complexity of the task for the ?classification network, and does not over fit on simpler data e.g. R. Interestingly, the transformation ?of inputs for all ST models leads to a “standard” upright posed digit – this is the mean pose found ?in the training data. In Table 1 (right), we show the transformations performed for some test cases ?where a CNN is unable to correctly classify the digit, but a spatial transformer network can. Further ?test examples are visualised in an animation here https://goo.gl/qdEhUu.

實驗結果見表1(左)。觀察任何特定類型的數據失真，可以清楚地看出空間轉換器支持的網絡性能優于其對應的基礎網絡。在旋轉、平移和尺度失真(RTS)的情況下，ST-CNN根據用于Tθ的變換類別達到0.5%和0.6%，而使用兩個maxpooling層來提供空間不變性的CNN達到0.8%的誤差。這實際上與ST-FCN實現的錯誤相同，ST-FCN在其網絡中沒有單一的卷積或最大池化層，這表明使用空間轉換器是實現空間不變性的另一種方法。ST-CNN模型的性能始終優于ST-FCN模型，因為ST-CNN中的max-pooling層提供了更多的空間不變性，卷積層更好地建模局部結構。我們也測試模型在一個嘈雜的環境中,在60×60與翻譯MNIST數字圖像和背景雜波(見圖1第三行為例):一個FCN得到13.2%的誤差,CNN獲得3.5%的誤差,而ST-FCN得到2.0%的誤差和ST-CNN得到1.7%的錯誤。 
觀察結果之間的不同類型的轉換,薄板樣條轉換(TPS)是最強大的,能夠減少錯誤彈性變形數字通過重塑輸入數字的一個原型實例,減少任務分類網絡的復雜性,且不適合在簡單的數據,比如r .有趣的是,對所有ST模型的輸入進行轉換，得到一個“標準”的直立姿勢數字——這是在訓練數據中發現的平均姿勢。在表1(右)中，我們展示了在一些測試用例中執行的轉換，其中CNN不能正確地分類數字，但空間轉換器網絡可以。更多的測試示例可以在一個動畫中看到https://goo.gl/qdEhUu。

4.2 Street View House Numbers ?

We now test our spatial transformer networks on a challenging real-world dataset, Street View House ?Numbers (SVHN) [25]. This dataset contains around 200k real world images of house numbers, with ?the task to recognise the sequence of numbers in each image. There are between 1 and 5 digits in ?each image, with a large variability in scale and spatial arrangement. ? We follow the experimental setup as in [1, 13], where the data is preprocessed by taking 64 × 64 ?crops around each digit sequence. We also use an additional more loosely 128×128 cropped dataset ?as in [1]. We train a baseline character sequence CNN model with 11 hidden layers leading to five ?independent softmax classifiers, each one predicting the digit at a particular position in the sequence. ?This is the character sequence model used in [19], where each classifier includes a null-character ?output to model variable length sequences. This model matches the results obtained in [13]. ?	我們現在在一個具有挑戰性的真實世界數據集上測試我們的空間轉換器網絡，街景房屋號碼(SVHN)[25]。該數據集包含約20萬張真實世界的門牌號圖像，任務是識別每張圖像中的數字序列。每張圖像的數字在1 - 5位之間，在尺度和空間安排上有很大的變異性。我們遵循[1,13]中的實驗設置，在每個數字序列周圍取64 × 64個作物對數據進行預處理。我們還使用另一個更松散的128×128裁切數據集，如[1]中所示。我們訓練了一個基線字符序列CNN模型，該模型有11個隱藏層，形成5個獨立的softmax分類器，每個分類器預測序列中特定位置的數字。這是[19]中使用的字符序列模型，其中每個分類器都包含一個空字符輸出來為可變長度序列建模。該模型與[13]得到的結果相匹配。
We extend this baseline CNN to include a spatial transformer immediately following the input (STCNN ?Single), where the localisation network is a four-layer CNN. We also define another extension ?where before each of the first four convolutional layers of the baseline CNN, we insert a spatial ?transformer (ST-CNN Multi), where the localisation networks are all two layer fully connected networks ?with 32 units per layer. In the ST-CNN Multi model, the spatial transformer before the first ?convolutional layer acts on the input image as with the previous experiments, however the subsequent ?spatial transformers deeper in the network act on the convolutional feature maps, predicting a ?transformation from them and transforming these feature maps (this is visualised in Table 2 (right) ?(a)). This allows deeper spatial transformers to predict a transformation based on richer features ?rather than the raw image. All networks are trained from scratch with SGD and dropout [17], with ?randomly initialised weights, except for the regression layers of spatial transformers which are initialised ?to predict the identity transform. Affine transformations and bilinear sampling kernels are ?used for all spatial transformer networks in these experiments.	我們擴展了這個基線CNN，包括一個緊跟輸入的空間轉換器(STCNN單)，其中定位網絡是一個四層的CNN。我們還定義了另一個擴展，在基線CNN的前四個卷積層之前，我們插入一個空間轉換器(ST-CNN Multi)，其中定位網絡都是兩層完全連接的網絡，每層有32個單元。ST-CNN多模型、空間變壓器之前第一個卷積層作用于輸入圖像與之前的實驗一樣,然而隨后在回旋的空間變形金剛更深層次的網絡行為特征圖,預測一個轉換和轉換這些特征圖(這是呈現在表2(右)(a))。這使得更深層的空間變換器能夠根據更豐富的特征而不是原始圖像來預測變換。除了空間變換的回歸層被初始化以預測身份變換外，所有網絡都用SGD和dropout[17]從零開始訓練，并隨機初始化權值。在這些實驗中，所有的空間變壓器網絡都采用了仿射變換和雙線性采樣核。
Table 3: Left: The accuracy on CUB-200-2011 bird classification dataset. Spatial transformer networks with ?two spatial transformers (2×ST-CNN) and four spatial transformers (4×ST-CNN) in parallel achieve higher ?accuracy. 448px resolution images can be used with the ST-CNN without an increase in computational cost ?due to downsampling to 224px after the transformers. Right: The transformation predicted by the spatial ?transformers of 2×ST-CNN (top row) and 4×ST-CNN (bottom row) on the input image. Notably for the ?2×ST-CNN, one of the transformers (shown in red) learns to detect heads, while the other (shown in green) ?detects the body, and similarly for the 4×ST-CNN.	表3:左:CUB-200-2011鳥類分類數據集的精度。兩個空間變壓器(2×ST-CNN)和四個空間變壓器(4×ST-CNN)并行的空間變壓器網絡可以實現更高的精度。448px分辨率的圖像可以與ST-CNN一起使用，而無需增加計算成本，因為經過變壓器后降采樣到224px。右:空間變換2×ST-CNN(上一行)和4×ST-CNN(下一行)在輸入圖像上預測的變換。值得注意的是2×ST-CNN，其中一個變形金剛(紅色顯示)學習檢測頭部，而另一個(綠色顯示)檢測身體，4×ST-CNN也是如此。
The results of this experiment are shown in Table 2 (left) – the spatial transformer models obtain ?state-of-the-art results, reaching 3.6% error on 64×64 images compared to previous state-of-the-art ?of 3.9% error. Interestingly on 128 × 128 images, while other methods degrade in performance, ?an ST-CNN achieves 3.9% error while the previous state of the art at 4.5% error is with a recurrent ?attention model that uses an ensemble of models with Monte Carlo averaging – in contrast the STCNN ?models require only a single forward pass of a single model. This accuracy is achieved due to ?the fact that the spatial transformers crop and rescale the parts of the feature maps that correspond ?to the digit, focussing resolution and network capacity only on these areas (see Table 2 (right) (b) ?for some examples). In terms of computation speed, the ST-CNN Multi model is only 6% slower ?(forward and backward pass) than the CNN.	本實驗的結果如表2(左)所示，空間轉換器模型獲得了最先進的結果，在64×64圖像上的誤差達到3.6%，而之前的最先進的誤差為3.9%。有趣的是在128×128的圖片,而其他方法降解性能,ST-CNN達到3.9%錯誤在之前的4.5%的誤差是復發性注意力模型,它使用一個模型與蒙特卡羅平均——相比之下STCNN模型只需要一個傳球前進的一個模型。之所以能達到這樣的精度，是因為空間變換器只在這些區域對與數字對應的特征地圖部分進行裁剪和縮放，集中分辨率和網絡容量(一些例子見表2(右)(b))。在計算速度方面，ST-CNN Multi model僅比CNN慢6%(前向和后向傳遞)。

4.3 Fine-Grained Classification ?

In this section, we use a spatial transformer network with multiple transformers in parallel to perform ?fine-grained bird classification. We evaluate our models on the CUB-200-2011 birds dataset [38], ?containing 6k training images and 5.8k test images, covering 200 species of birds. The birds appear ?at a range of scales and orientations, are not tightly cropped, and require detailed texture and shape ?analysis to distinguish. In our experiments, we only use image class labels for training. ?
We consider a strong baseline CNN model – an Inception architecture with batch normalisation [18] ?pre-trained on ImageNet [26] and fine-tuned on CUB – which by itself achieves the state-of-theart ?accuracy of 82.3% (previous best result is 81.0% [30]). We then train a spatial transformer ?network, ST-CNN, which contains 2 or 4 parallel spatial transformers, parameterised for attention ?and acting on the input image. Discriminative image parts, captured by the transformers, are passed ?to the part description sub-nets (each of which is also initialised by Inception). The resulting part ?representations are concatenated and classified with a single softmax layer. The whole architecture ?is trained on image class labels end-to-end with backpropagation (full details in Appendix A). ?

在本節中，我們使用一個并行地包含多個變壓器的空間變壓器網絡來執行細粒度的鳥分類。我們在CUB-200-2011鳥類數據集[38]上評估我們的模型，該數據集包含6k的訓練圖像和5.8k的測試圖像，涵蓋200種鳥類。這些鳥出現在不同的尺度和方向上，沒有被緊密裁剪，需要詳細的紋理和形狀分析來區分。在我們的實驗中，我們只使用圖像類標簽進行訓練。我們認為一個強大的基線CNN模型——一個在ImageNet[26]上預先訓練并在CUB上進行調整的帶有批處理標準化[18]的初始架構——它本身就達到了最先進的82.3%的精度(之前最好的結果是81.0%[30])。然后我們訓練一個空間變壓器網絡ST-CNN，它包含2或4個平行的空間變壓器，參數化的注意力和作用于輸入圖像。由變形金剛捕獲的鑒別圖像部件被傳遞到部件描述子網(每個子網也在Inception時初始化)。產生的部件表示用一個單一的softmax層連接和分類。整個體系結構是用圖像類標簽端到端的反向傳播進行訓練的(詳細信息見附錄A)。

The results are shown in Table 3 (left). The ST-CNN achieves an accuracy of 84.1%, outperforming ?the baseline by 1.8%. It should be noted that there is a small (22/5794) overlap between the ImageNet ?training set and CUB-200-2011 test set1 – removing these images from the test set results in ?84.0% accuracy with the same ST-CNN. In the visualisations of the transforms predicted by 2×STCNN ?(Table 3 (right)) one can see interesting behaviour has been learnt: one spatial transformer ?(red) has learnt to become a head detector, while the other (green) fixates on the central part of the ?body of a bird. The resulting output from the spatial transformers for the classification network is ?a somewhat pose-normalised representation of a bird. While previous work such as [3] explicitly ?define parts of the bird, training separate detectors for these parts with supplied keypoint training ?data, the ST-CNN is able to discover and learn part detectors in a data-driven manner without any ?additional supervision. In addition, the use of spatial transformers allows us to use 448px resolution ?input images without any impact in performance, as the output of the transformed 448px images are ?downsampled to 224px before being processed.

結果見表3(左)。ST-CNN的準確率達到了84.1%，比基準高出1.8%。需要注意的是，ImageNet訓練集和CUB-200-2011測試集1之間有一個小的(22/5794)重疊——在ST-CNN相同的情況下，從測試集中去除這些圖像的準確率為84.0%。從2×STCNN(表3(右))預測的變換的可視化圖中，我們可以看到人們學會了一些有趣的行為:一個空間變形器(紅色)學會了成為頭部探測器，而另一個(綠色)則專注于鳥的身體中央部分。從空間變壓器的結果輸出的分類網絡是一個姿態歸一化的鳥類表示。雖然之前的工作，如[3]明確地定義了鳥的部分，訓練這些部分的單獨的檢測器與提供的關鍵訓練數據，ST-CNN能夠以數據驅動的方式發現和學習部分檢測器，而不需要任何額外的監督。此外，空間轉換器的使用允許我們在不影響性能的情況下使用448px分辨率的輸入圖像，因為轉換后的448px圖像的輸出在處理之前會被向下采樣到224px。

5 Conclusion

In this paper we introduced a new self-contained module for neural networks – the spatial transformer. ?This module can be dropped into a network and perform explicit spatial transformations ?of features, opening up new ways for neural networks to model data, and is learnt in an end-toend ?fashion, without making any changes to the loss function. While CNNs provide an incredibly ?strong baseline, we see gains in accuracy using spatial transformers across multiple tasks, resulting ?in state-of-the-art performance. Furthermore, the regressed transformation parameters from the ?spatial transformer are available as an output and could be used for subsequent tasks. While we ?only explore feed-forward networks in this work, early experiments show spatial transformers to be ?powerful in recurrent models, and useful for tasks requiring the disentangling of object reference ?frames, as well as easily extendable to 3D transformations (see Appendix A.3).

本文介紹了一種新的神經網絡自包含模塊——空間變壓器。該模塊可以放入網絡中，對特征進行顯式的空間轉換，為神經網絡建模數據開辟了新途徑，并且可以在不改變損失函數的情況下以端到端方式學習。雖然cnn提供了一個令人難以置信的強大基線，但我們看到在多個任務中使用空間轉換器的準確性有所提高，從而產生了最先進的性能。此外，從空間轉換器返回的轉換參數可作為輸出，并可用于后續任務。雖然我們在這項工作中只探索了前饋網絡，但早期的實驗表明，空間轉換器在循環模型中非常強大，對于需要解離對象參考框架的任務非常有用，而且很容易擴展到3D轉換(見附錄A.3)。

總結

以上是生活随笔為你收集整理的Paper：《Spatial Transformer Networks》的翻译与解读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Dataset之babyboom.dat
下一篇： Py之pandas：利用where、re

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

Paper：《Spatial Transformer Networks》的翻译与解读

《Spatial Transformer Networks》的翻譯與解讀

Abstract

1 Introduction ?

2 Related Work ?

3 Spatial Transformers ?

3.1 Localisation Network ?

3.2 Parameterised Sampling Grid ?

3.3 Differentiable Image Sampling ?

3.4 Spatial Transformer Networks ?

4 Experiments ?

4.1 Distorted MNIST ?

4.2 Street View House Numbers ?

4.3 Fine-Grained Classification ?

5 Conclusion

總結