《PlaneNet-单幅RGB图像的分段平面重建》论文中英文对照解读
論文地址:https://arxiv.org/pdf/1804.06278.pdf
代碼地址:https://github.com/art-programmer/PlaneNet
PlaneNet: Piece-wise Planar Reconstruction from a Single RGB Image
PlaneNet:單幅RGB圖像的分割平面重建
Abstract
論文摘要
EN: This paper proposes a deep neural network (DNN) for piece-wise planar depthmap reconstruction from a single RGB image. While DNNs have brought remarkable progress to single-image depth prediction, piece-wise planar depthmap reconstruction requires a structured geometry representation, and has been a difficult task to master even for DNNs. The proposed end-to-end DNN learns to directly infer a set of plane parameters and corresponding plane segmentation masks from a single RGB image. We have generated more than 50,000 piece-wise planar depthmaps for training and testing from ScanNet, a largescale RGBD video database. Our qualitative and quantitative evaluations demonstrate that the proposed approach outperforms baseline methods in terms of both plane segmentation and depth estimation accuracy. To the best of our knowledge, this paper presents the first end-to-end neural architecture for piece-wise planar reconstruction from a single RGB image. Code and data are available at https://github.com/art-programmer/PlaneNet.
CH: 本篇論文提出了一種深度神經網絡(DNN)去完成單幅圖像的分割平面深度圖重建任務。雖然DNN在單幅圖像上的深度預測取得了顯著的進步,但是分割平面深度圖重建需要一個結構化的幾何表示,即使對于DNN也是很難解決的一個任務。提出的這個端到端的DNN直接從單幅RGB圖像中推算出一套平面參數和對應的平面分割掩膜。我們從 ScanNet 生成了超過50000張的分割平面深度圖用于訓練和測試,ScanNet 是一個大型的 RGBD 視頻數據集。我們的定性和定量評估表明我們提出的這個方法在平面分割和深度估計的精度方面都比基礎的方法效果要好。據我們所知,這篇論文提出的端到端神經網絡結構是第一個用來解決單幅RGB圖像的分割平面重建問題的神經網絡。代碼和數據均在GitHub:https://github.com/art-programmer/PlaneNet
1. Introduction
1. 前言
EN: Human vision has extraordinary perceptual power in understanding advanced scene structures. Looking at a typical indoor scene (for example, Figure 1), we can immediately parse the room into a few major planes (for example, floors, walls, and ceilings), sense the main surface of the furniture, or identify the surface of a horizontal tabletop. Segmental planar geometry understanding will be key to many applications in emerging areas such as robotics or augmented reality (AR). For example, the robot needs to identify the extent of the floor used to plan the move, or the desktop split for placing the object. In AR applications, planar surface inspection is becoming the basic building block for placing virtual objects on the desktop, replacing floor textures or hanging artwork on walls for internal remodeling. A fundamental problem in computer vision is the development of a computational algorithm that masters similar perceptions to implement such an application.
CH: 人類視覺在理解高級別場景結構方面有著非凡的感知能力。看一個典型的室內場景(比如圖一),我們能立即將這個房間分析成一些主要的平面(比如墻,地板,天花板),感知家具的主要表面和水平桌面的表面。分割平面的幾何理解對一些新興領域的許多應用起到了很關鍵的作用,比如機器人或虛擬現實(AR)。例如,機器人需要檢測用于移動的地板的范圍,或者在放置物體時需要分割桌面。在AR應用中,需要往桌面上放置虛擬的物體,更換地板的樣式或對墻上的藝術品進行內部改建,這時的檢測平面的表面就是一個基礎的模塊。計算機視覺中一個基礎的問題是一個能解決相似感知問題的幾何算法來實現這樣的應用。
EN: With the proliferation of deep neural networks, single image depth map inference and room layout estimation have been active areas of research. However, to our surprise, little attention has been paid to the study of segmental planar depth map reconstruction, which mimics this remarkable human perception in a general form. The main challenge is that segmented planar depth maps require a structured geometric representation (ie, a set of planar parameters and their segmentation masks). In particular, we do not know the number of planes to infer, and the order of the planes that are returned in the output feature vector, making the task even challenging for deep neural networks.
CH: 隨著深度神經網絡的興起,單幅圖像的深度圖和房間布局的推斷一直是搞研究的活躍領域。然而,我們感到比較驚訝的是,分割平面深度圖重建這一方面很少有人關注,這一方面一般來說是模仿了人類的這種非凡的感知能力。其中比較主要的挑戰是分割平面深度圖需要一個結構化的幾何表示(i.e.平面參數的集合和它們的分割掩膜)。尤其是,我們不知道需要分割的平面數量,以及平面在輸出特征向量中的順序,完成這些任務對深度神經網絡來說也很有挑戰。
EN: This paper proposes a novel deep neural architecture “PlaneNet” that learns to directly produce a set of plane parameters and probabilistic plane segmentation masks from a single RGB image. Following a recent work on point-setgeneration, we define a loss function that is agnostic to the order of planes. We further control the number of planes by allowing probabilistic plane segmentation masks to be all 0. The network also predicts a depthmap at non-planar surfaces, whose loss is defined through the probabilistic segmentation masks to allow back-propagation. We have generated more than 50,000 piece-wise planar depthmaps from ScanNet as ground-truth by fitting planes to 3D points and projecting them to images. Qualitative and quantitative evaluations show that our algorithm produces significantly better plane segmentation results than the current state-ofthe-art. Furthermore, our depth prediction accuracy is on-par or even superior to the existing single image depth inference techniques that are specifically trained for this task.
CH: 本篇論文提出了一個新的深度神經網絡結構“PlaneNet”,它通過學習訓練直接從單幅RGB圖像中得到一組平面參數和對應的平面分割掩膜。在最近的一項點集分割工作中,我們定義了一個跟平面順序無關的損失函數。我們通過允許概率性的平面分割掩膜為0來進一步的控制平面的數量。這個網絡結構還預測非平面處的深度圖,這個損失是通過概率分割掩膜定義的,可以進行反向傳播。我們通過擬合平面到3D點上,并且將它們投射到圖像中,從 ScanNet 數據集中生成了超過50000張分段平面深度圖作為真實樣本。定性和定量的評估標準表明:我們的算法的平面分割結果相比當下流行的技術,有顯著的提升。此外,我們的深度預測精度甚至要比當下專門針對此任務的算法更優秀。
2. Related work
2. 相關工作
EN: Multi-view piece-wise planar reconstruction. Piece-wise planar depthmap reconstruction was once an active research topic in multi-view 3D reconstruction. The task is to infer a set of plane parameters and assign a plane ID to each pixel. Most existing methods first reconstruct precise 3D points, perform plane-fitting to generate plane hypotheses, then solve a global inference problem to reconstruct a piece-wise planar depthmap. Our approach learns to directly infer plane parameters and plane segmentations from a single RGB image.
CH: 多視圖分段平面重建。分段平面深度圖重建曾經在多視圖3D重建中的活躍研究領域。這個任務是推斷一組平面參數并且給每個像素分配一個平面ID。目前大部分的算法都是首先重建精確的3D點集,擬合平面去生成假設平面,然后求解一個全局的推理問題去重建一個分段平面深度圖。我們的方法通過學習訓練直接從單幅RGB圖像中得到一組平面參數和對應的平面分割掩膜。
EN: Learning based depth reconstruction. Saxena et al. pioneered a learning based approach for depthmap inference from a single image. With the surge of deep neural networks, numerous CNN based approaches have been proposed. However, most techniques simply produce an array of depth values (i.e., depthmap) without plane detection or segmentation. More recently, Wang et al. enforce planarity in depth (and surface normal) predictions by inferring pixels on planar surfaces. This is the closest work to ours. However, they only produce a binary segmentation mask (i.e., if a pixel is on a planar surface or not) without plane parameters or instance-level plane segmentation.
CH: 基于自學習的深度重建。Saxena 等人針對單幅圖像的深度圖推斷提出了一個基于自學習的方法。隨著深度神經網絡的興起,出現了許多基于CNN的方法。但是,大部分的方法只是簡單生成一組深度數值(i.e.深度圖)而沒有平面的檢測與分割。最近,Wang等人通過計算平面上的像素信息,在深度信息(以及表面法線)預測中執行平面化操作。這是跟我們最接近的方法。然而,他們僅僅生成一個二進制的分割掩膜(i.e.一個像素是否在平面上),而沒有平面參數或實例級別的平面分割。
EN: Layout estimation. Room layout estimation also aims at predicting dominant planes in a scene (e.g., walls, floor, and ceiling). Most traditional approaches rely on image processing heuristics to estimate vanishing points of a scene, and aggregate low-level features by a global optimization procedure. Besides low-level features, high-level information has been utilized, such as human poses or semantics. Attempts have been made to go beyond room structure, and predict object geometry. However, the reliance on hand-crafted features makes those methods less robust, and the Manhattan World assumption limits their operating ranges. Recently, Lee et al. proposed an end-to-end deep neural network, RoomNet, which simultaneously classifies a room layout type and predicts corner locations. However, their framework is not applicable to general piece-wise planar scenes.
CH: 房間布局估計。房間布局的估計也是針對一個場景中的主要平面進行預測的。(e.g.墻,地板和天花板)大部分傳統的算法依靠圖像的啟發式處理去估算場景中的消隱點,并通過一個全局的優化程序聚合底層特征。除了底層特征,還使用到了一些高級信息,比如:人類的姿態和語義。嘗試越過房間的結構來預測目標的幾何結構。但是,人工選擇的特征使得這些方法的穩健性比較低,曼哈頓世界的假設也限制了它們的操作范圍。最近,Lee等人,提出了一個端到端的深度神經網絡 RoomNet,它能同時分類房間的布局類型和預測角落的位置。但是,他們的框架不適用與一般情況下的分段平面場景。
EN: Line analysis. Single image 3D reconstruction of line drawings date back to the 60s. The earliest attempt is probably the Robert’s system, which inspired many follow-up works. In real images, extraction of line drawings is challenging. Statistical analysis of line directions, junctions, or image segments have been used to enable 3D reconstruction for architectural scenes or indoor panoramas. Attributed grammar was used to parse an image into a hierarchical graph for 3D reconstruction. However, these approaches require hand-crafted features, grammar specification, or algorithmic rules. Our approach is purely data-driven harnessing the power of deep neural networks.
CH: 線分析。單幅線條圖像的3D重建可以追溯到60年代。最早的嘗試大概是 Robert 的系統,它啟發了許多后面的工作。在實際的圖像中,線條圖的提取有不小的挑戰性。線向統計分析,交叉點和圖像分割已經被用于建筑場景和室內全景圖的3D重建。Attributed grammar 將圖像解析成分層圖用于3D重建。但是,這些傳統的算法需要人工選取的特征,grammar specification, 或算法規則。我們的方法純粹靠數據驅動的深度神經網絡的力量。
3. PlaneNet
3. PlaneNet
EN: We build our network on the Extended Residual Network (DRN) (see Figure 2), which is a flexible framework for global tasks (eg image classification) and pixel prediction tasks (eg semantic segmentation). Given the high-resolution final feature map from the DRN, we make three output branches for the three prediction tasks.
CH: 我們基于 Extended Residual Network (DRN) 來構建我們的網絡,(圖二所示)DRN是針對全局性任務(e.g.圖片分類)和像素預測任務(e.g.語義分割)的一個靈活框架。針對DRN最終輸出的高分辨率的特征圖,我們對于三個不同的預測任務提供了三個分支。
EN: Plane parameters: For each scene, we predict a fixed number (KKK) of planar surfaces S=S1,???SKS = {S_1, · · · S_K}S=S1?,???SK?. Each surface SiS_iSi? is specified by the three plane parameters PiP_iPi? (i.e., encoding a normal and an offset). We use DiD_iDi? to denote a depth image, which can be inferred from the parameters PiP_iPi? .The depth value calculation requires camera intrinsic parameters, which can be estimated via vanishing point analysis, for example. In our experiments, intrinsics are given for each image through the database information.
CH: 平面參數。對于每個場景,我們預測的平面 S=S1,???SKS = {S_1, · · · S_K}S=S1?,???SK? 數量是固定的 KKK。每個平面 SiS_iSi? 都通過三個平面參數 PiP_iPi? 指定。(i.e.編碼法線和偏移量)我們用 DiD_iDi? 來表示深度圖像,它能從參數 PiP_iPi? 中推算出來。深度值的推算需要相機內置參數,而相機內置參數可以通過消隱點分析來估算。但在我們的實驗中相機內置參數是通過數據集每張圖像的信息提供的。
EN: Non-planar depthmap: We model non-planar structures and infer its geometry as a standard depthmap. With abuse of notation, we treat it as the (K+1)th(K+1)^{th}(K+1)th surface and denote the depthmap as DK+1D_{K+1}DK+1?. This does not explain planar surfaces.
CH: 非平面深度圖:我們對非平面結構進行建模處理,并將它的幾何結構推斷為標準的深度圖。用符號表示的話,我們把平面表示為 (K+1)th(K+1)^{th}(K+1)th ,把對應的深度圖表示為 DK+1D_{K+1}DK+1?。但是這個不能用來解釋平面信息。
EN: Segmentation masks: The last output is the probabilistic segmentation masks for the KKK planes (M1,???MK)(M_1, · · · M_K)(M1?,???MK?) and the non planar depthmap (MK+1)(M_{K+1})(MK+1?).
CH: 分割掩膜:最后的輸出是第 KKK 個平面 (M1,???MK)(M_1, · · · M_K)(M1?,???MK?) 的分割掩膜和對應的非平面深度圖 (MK+1)(M_{K+1})(MK+1?) 。
EN: In summary, the network predicts 1) plane parameters (P1,???,PK)(P1, ···, PK)(P1,???,PK), 2) non-planar depth maps (DK+1)(D_{K + 1})(DK+1?), and 3) probability split masks (M1,???,MK+1)(M_1, ···, M_{K + 1})(M1?,???,MK+1?). We now explain more details and loss functions for each task.
CH: 概括起來,這個網絡解決了三個任務:1)平面參數 (P1,???,PK)(P1, ···, PK)(P1,???,PK),2)非平面深度圖 (DK+1)(D_{K + 1})(DK+1?),3)概率分割掩膜 (M1,???,MK+1)(M_1, ···, M_{K + 1})(M1?,???,MK+1?)。下面詳細說明每個任務的更多細節和損失函數。
3.1. Plane parameter branch
3.1. 平面參數分支
EN: The plane parameter branch starts with a global average pooling to reduce the feature map size to 1x1, followed by a fully connected layer to produce K×3K×3K×3 plane parameters. We do not know the number of planes as well as their order in this prediction task. By following prior works, we predict a constant number (K)(K)(K) of planes, then allow some predictions to be invalid by letting the corresponding probabilistic segmentation masks to be 0. Our ground-truth generation process (See Sect. 4) produces at most 10 planes for most examples, thus we set K=10K = 10K=10 in our experiments. We define an order-agnostic loss function based on the Chamfer distance metric for the regressed plane parameters:
LP=∑i=1K?minj∈[1,K]∥Pi??Pj∥22L^P=\sum_{i=1}^{K^*}min_{j\in[1,K]}\Vert P_i^*-P_j \Vert_2^2LP=i=1∑K??minj∈[1,K]?∥Pi???Pj?∥22?
The parameterization PiP_iPi? is given by the 3D coordinate of the point that is closest to the camera center on the plane. Pi?P^?_iPi?? is the ground truth. K?K^?K? is the number of ground-truth planes.
CH: 平面參數分支從一個全局平均 pooling 開始,將特征圖的尺寸變成 1x1,緊接著,通過一個全連接層生成 K×3K×3K×3 的平面參數。我們不知道平面的數量也不知道在這個預測任務中的順序。通過遵循之前的工作,我們預測的平面數量為 KKK,然后通過使對應的概率分割掩膜為 0,讓一些預測的平面無效。我們的大部分真實實例都可以生成十個左右的平面,(見第四節)因此在我們的實驗中設置 K=10K=10K=10。我們基于倒角距離度量針對平面參數的回歸定義了一個與順序無關的損失函數:LP=∑i=1K?minj∈[1,K]∥Pi??Pj∥22L^P=\sum_{i=1}^{K^*}min_{j\in[1,K]}\Vert P_i^*-P_j \Vert_2^2LP=i=1∑K??minj∈[1,K]?∥Pi???Pj?∥22?參數 PiP_iPi? 是根據平面上最靠近相機中心的點的3D坐標得到的。Pi?P^?_iPi?? 是真實實例。K?K^?K? 是真實實例中平面的數量。
3.2. Plane segmentation branch
3.2. 平面分割分支
EN: The branch begins with a pyramid pool module followed by a convolutional layer to produce a K+1K + 1K+1 channel likelihood map for planar and non-planar surfaces. We added a dense conditional random field (DCRF) module based on the fast inference algorithm proposed by Krahenbuhl and Koltun, and jointly trained the DCRF module and the previous layer with Zheng et al. We set the average number of field iterations to 5 during training and set it to 10 during the test. For simplicity, the bandwidth of the bilateral filter is fixed. We use standard softmax cross entropy loss to supervise segmentation training: LM=∑i=1K+1∑p∈I(1(M?(p)=i)log(1?Mi(p)))L^M=\sum_{i=1}^{K+1}\sum_{p \in I}(1(M^{*(p)}=i)log(1-M_i^{(p)}))LM=i=1∑K+1?p∈I∑?(1(M?(p)=i)log(1?Mi(p)?))The internal summation is over the image pixels (I)(I)(I), where Mi(p)M^{(p)}_iMi(p)? denotes the probability of pixel ppp belonging to the ithi^{th}ith plane. M?(p)M^{?(p)}M?(p) is the ground-truth plane-id for the pixel.
CH: 這個分支以一個金字塔池化模塊開始,緊接著通過一個卷積層生成平面和非平面表面 K+1K+1K+1 通道的極大似然圖。我們在 Krahenbuhl 和 Koltun 提出的快速推理算法的基礎上添加了一個密集條件隨機場(DCRF)模塊,并且和 Zheng 等人共同訓練這個 DCRF 模塊和先前的層。我們在訓練期間設置平均場迭代為5,在測試期間設置為10.為簡單起見,雙邊濾波器的帶寬是固定的。我們用標準的 softmax 交叉熵損失函數來監督分割訓練:LM=∑i=1K+1∑p∈I(1(M?(p)=i)log(1?Mi(p)))L^M=\sum_{i=1}^{K+1}\sum_{p \in I}(1(M^{*(p)}=i)log(1-M_i^{(p)}))LM=i=1∑K+1?p∈I∑?(1(M?(p)=i)log(1?Mi(p)?))當 Mi(p)M^{(p)}_iMi(p)? 表示像素 ppp 屬于平面 ithi^{th}ith 的概率時,里面的求和是對圖像像素 (I)(I)(I) 的求和。M?(p)M^{?(p)}M?(p) 是像素在真實實例上所屬的平面 id。
3.3. Non-planar depth branch
3.3. 非平面深度分支
EN: The branch shares the same pyramid pooling module, followed by a convolution layer to produce a 1-channel depthmap. Instead of defining a loss specifically for non-planar regions, we found that exploiting the entire ground-truth depthmap makes the overall training more effective. Specifically, we define the loss as the sum of squared depth differences between the ground-truth and either a predicted plane or a non-planar depthmap, weighted by probabilities:LD=∑i=1K+1∑p∈I(Mi(p)(Di(p)?D?(p))2)L^D=\sum_{i=1}^{K+1}\sum_{p\in I}(M_i^{(p)}(D_i^{(p)}-D^{*(p)})^2)LD=i=1∑K+1?p∈I∑?(Mi(p)?(Di(p)??D?(p))2)Di(p)D_i^{(p)}Di(p)? denotes the depth value at pixel p, while D?(p)D^{*(p)}D?(p) is the ground truth depth value.
CH: 這個分支與平面分割分支分享使用同一個金字塔池化模塊,然后通過一個卷積層生成1通道的深度圖。我們發現使用全部的真實實例深度圖來進行訓練比單獨定義一個非平面區域的損失要更有效。因此,我們將損失定義為真實實例與預測平面或非平面深度圖之間深度差的平方和,并由概率進行加權:LD=∑i=1K+1∑p∈I(Mi(p)(Di(p)?D?(p))2)L^D=\sum_{i=1}^{K+1}\sum_{p\in I}(M_i^{(p)}(D_i^{(p)}-D^{*(p)})^2)LD=i=1∑K+1?p∈I∑?(Mi(p)?(Di(p)??D?(p))2)當D?(p)D^{*(p)}D?(p) 表示真實實例的深度值時,Di(p)D_i^{(p)}Di(p)? 表示在像素 ppp 的深度值。
4. Datasets and implemenation details
4. 數據集和網絡實現細節
EN: We have generated 51,000 ground-truth piece-wise planar depthmaps (50,000 training and 1,000 testing) from ScanNet, a large-scale indoor RGB-D video database. A depthmap in a single RGB-D frame contains holes and the quality deteriorates at far distances. Our approach for ground-truth generation is to directly fit planes to a consolidated mesh and project them back to individual frames, while also exploiting the associated semantic annotations.
CH: 我們從 ScanNet(一個大型室內的 RGB-D 視頻數據庫)中生成了 51,000 張分段平面深度圖作為真實樣本(50,000張訓練,1,000張預測)。單幅RGB-D圖像的深度圖包含 holes,而且圖像內容距離比較遠的效果也會變壞。我們生成真實實例的方法是將平面擬合到統一的網格中,并將他們投射回單個圖像幀,同時還利用了相關的語義注釋。
EN: Specifically, for each sub mesh-models of the same semantic label, we treat mesh-vertices as points and repeat extracting planes by RANSAC with replacement. The inlier distance threshold is 5cm5cm5cm, and the process continues until 90% of the points are covered. We merge two (not necessarily adjacent) planes that span different semantic labels if the plane normal difference is below 20?20^?20? , and if the larger plane fits the smaller one with the mean distance error below 5cm5cm5cm. We project each triangle to individual frames if the three vertices are fitted by the same plane. After projecting all the triangles, we keep only the planes whose projected area is larger than 1% of an image. We discard entire frames if the ratio of pixels covered by the planes is below 50%. For training samples, we randomly choose 90% of the scenes from ScanNet, subsample every 10 frames, compute piecewise planar depthmaps with the above procedure, then use the final random sampling to produce 50,000 examples. The same procedure generates 1,000 testing examples from the remaining 10% of the scenes.
CH: 明確來說,對于相同語義標簽的每個子網格模型,我們將網格頂點視為 points,并通過 RANSAC 算法重復提取平面。這個內部距離的閾值為 5cm5cm5cm,并且這個過程會持續到 points 的百分之九十被覆蓋。如果兩個跨越不同語義標簽的平面的平面法線差異小于20?20^?20? 并且大平面擬合小平面時平均距離誤差小于 5cm5cm5cm,就合并這兩個平面。(不一定相鄰)如果三個網格頂點擬合同一個平面,就把三個頂點投射到單獨的坐標系中。投射完所有的頂點,只保留投射區域大于原圖面積百分之一的平面。如果所有的平面像素覆蓋比小于百分之五十,就丟棄所有的平面。我們從 ScanNet 中隨機選取百分之九十的場景,每十幀采樣一次,使用上述流程生成分段平面深度圖,然后用隨機采樣選出50,000個樣本作為訓練集。相同的流程從 ScanNet 剩余的百分之十場景中選出1,000個樣本作為測試集。
EN: We have implemented PlaneNet using TensorFlow based on DeepLab . Our system is a 101-layer ResNet with Dilated Convolution, while we have followed a prior work and modified the first few layers to deal with the degridding issue. The final feature map of the DRN contains 2096 channels. We use the Adam optimizer with the initial learning rate set to 0.0003. The input image, the output plane segmentation masks, and the non-planar depthmap have a resolution of 256x192. We train our network for 50 epochs on the 50,000 training samples.
CH: 我們基于 DeepLab 的 TensorFlow 實現了 PlaneNet 。我們的方法是有著 Dilated Convolution(擴張卷積)的 101層 Resnet,我們復現了先前的工作,并修改了前幾層,為了處理 degridding 的問題。最后輸出的 DRN 特征圖包括 2096 個通道。我們使用 Adam 優化器設置學習率為 0.0003 來進行網絡訓練優化。輸入圖像,輸出的平面分割掩膜和非平面深度圖的尺寸都為 256x192。我們在 50,000 個訓練樣本上訓練了 50 輪我們的網絡。
5. Experimental results
5. 實驗結果
EN: Figure 3 shows the reconstruction results for various scenarios. Our end-to-end learning framework has successfully restored segmented planar and semantically meaningful structures from a single RGB image, such as a floor, wall, desktop or computer screen. We have included more examples in the supplements. We now provide a quantitative assessment of the accuracy of planar segmentation and depth reconstruction for competitive baselines, and then analyze our results more.
CH: 圖三展示了各種場景的平面重建結果。我們的端到端的學習架構成功的從單幅RGB圖像中重建了分段的平面結構和有意義的語義結構,比如:地板,墻面,桌面或電腦屏幕。在補充材料里有更多的實例結果。我們提出了一個針對平面分割和深度圖重建的定量評估標準,然后對我們的結果進行更多的分析。
EN: Figure 3: Piece-wise planar depthmap reconstruction results by PlaneNet. From left to right: input image, plane segmentation, depthmap reconstruction, and 3D rendering of our depthmap. In the plane segmentation results, the black color shows non-planar surface regions.
CH: 圖三:PlaneNet 的分段平面深度圖重建結果。從左到右:輸入圖像,平面分割結果,深度圖重建結果和深度圖的3D渲染結果。在平面分割結果中,黑色顯示非平面表面區域。
5.1. Plane segmentation accuracy
5.1. 平面分割準確率
EN: Piece-wise planar reconstruction from a single RGB image is a challenging problem. While existing approaches have produced encouraging results, they are based on hand-crafted features and algorithmic designs, and may not match against big-data and deep neural network (DNN) based systems. Much better baselines would then be piece-wise planar depthmap reconstruction techniques from 3D points, where input 3D points are either given by the ground truth depthmaps or inferred by a state-of-the-art DNN-based system.
CH: 單幅RGB圖像的分段平面重建是一個有挑戰的問題。雖然現有的方法在這方面已經有了不錯的結果,但它們都是基于手動設計的特征的算法,并且可能和基于大數據和深度神經網絡的系統不匹配。更好的基準線將來自于3D點的分段平面深度圖重建技術,輸入3D點,然后將由真實實例的深度圖或最先進的DNN系統推測輸出。
EN: In particular, to infer depthmaps, we have used a variant of PlaneNet which only has the pixel-wise depthmap branch, while following Eigen et al. to change the loss. Table 1 shows that this network, PlaneNet (Depth rep.), outperforms the current top-performers on the NYU benchmark.
CH: 特別是,為了推算深度圖,我們使用了 PlaneNet 的變種網絡,只保留像素級的深度圖分支,然后參考 Eigen 等人的思想去改變損失函數。圖一顯示 PlaneNet 在 NYU 的基準上是目前最佳的網絡。
EN: For piece-wise planar depthmap reconstruction, we have used the following three baselines from the literature.
“NYU-Toolbox” is a plane extraction algorithm from the official NYU toolbox that extracts plane hypotheses using RANSAC, and optimizes the plane segmentation via a Markov Random Field (MRF) optimization.
Manhattan World Stereo (MWS) is very similar to NYU-Toolbox except that MWS employs the Manhattan World assumption in extracting planes and exploits vanishing lines in the pairwise terms to improve results.
Piecewise Planar Stereo (PPS) relaxes the Manhattan World assumption of MWS, and uses vanishing lines to generate better plane proposals. Please see the supplementary document for more algorithmic details on the baselines.
CH: 為了對比分段平面深度圖重建,我們使用了文獻中的三個方法作為比較基準。
NYU-Toolbox 是 NYU 官方工具箱中的平面提取算法,使用了 RANSAC 算法提取平面候選區域,然后通過馬爾可夫隨機場(MRF)來優化平面分割。
Manhattan World Stereo (MWS) 與 NYU-Toolbox 很相似,不同之處在于 MWS 在提取平面時用了曼哈頓世界的假設(Manhattan World assumption),并且用成對項中的消失線來改善結果。
Piecewise Planar Stereo (PPS) 放寬了曼哈頓世界假設(Manhattan World assumption)對 MWS 的影響,并使用消失線來生成更好的平面候選區域。
EN: Figure 4 shows the evaluation results on two recall metrics. The first metric is the percentage of correctly predicted ground-truth planes. We consider a ground-truth plane being correctly predicted, if one of the inferred planes has 1) more than 0.5 Intersection over Union (IOU) score and 2) the mean depth difference over the overlapping region is less than a threshold. We vary this threshold from 0 to 0.6m with an increment of 0.05m to plot graphs. The second recall metric is simply the percentage of pixels that are in such overlapping regions where planes are correctly predicted. The figure shows that PlaneNet is significantly better than all the competing methods when inferred depthmaps are used. PlaneNet is even better than some competing methods that use ground-truth depthmaps. This demonstrates the effectiveness of our approach, learning to infer piece-wise planar structures from many examples.
CH: 圖四顯示了兩個召回指標的評估結果。第一個指標是正確預測的真實實例平面的百分比。我們判斷一個真實實例平面預測是否正確的標準是:1)是否有IOU分數大于0.5的平面,2)重疊區域的平均深度差是否小于閾值。我們讓這個閾值從0 - 0.6m以0.05m的速度遞增來畫圖。第二個指標是正確預測平面中重疊區域所占的像素百分比。該圖顯示,在推算深度圖指標中 PlaneNet 要優于其他的方法。證明了我們的方法的有效性,從許多實例中學習推算分段平面結構。
EN: Figure 4: Plane segmentation accuracy against competing baselines that use 3D points as input. Either ground-truth depthmaps or inferred depthmaps (by a DNN-based system) are used as their inputs. PlaneNet outperforms all the other methods that use inferred depthmaps. Surprisingly, PlaneNet is even better than many other methods that use ground-truth depthmaps.
CH: 圖四:使用3D點作為輸入,平面分割準確率的對比。或者使用真實實例深度圖和基于DNN系統推算的深度圖作為輸入。PlaneNet 要優于其他的方法。出人意料的是,PlaneNet 比一些使用真實實例深度圖的方法還要好。
EN: Figure 5 shows qualitative comparisons against existing methods with inferred depthmaps. PlaneNet produces significantly better plane segmentation results, while existing methods often generate many redundant planes where depthmaps are noisy, and fail to capture precise boundaries where the intensity edges are weak.
CH: 圖五顯示了與現有的方法推算出的深度圖的定性比較。PlaneNet 生成了更好的平面分割結果,現有的方法會有一些冗余的平面而且深度圖會有很多噪音,不能精確的捕捉到平面的邊界。
EN: Figure 5: Qualitative comparisons between PlaneNet and existing methods that use inferred depthmaps as the inputs. From left to right: an input image, plane segmentation results for existing methods, and PlaneNet, respectively, and the ground-truth.
CH: 圖五:使用推算的深度圖作為輸入,PlaneNet 與現有的其他方法的定性比較。從左往右:第一列為輸入圖像,第二三四列為現有其他方法的平面分割結果,第五列為PlaneNet 的平面分割結果,第六列為真實實例。
5.2. Depth reconstruction accuracy
5.2. 深度重建的準確率
EN: While the capability to infer a plane segmentation mask and precise plane parameters is the key contribution of the work, it is also interesting to compare against depth prediction methods. This is to ensure that our structured depth prediction does not compromise per-pixel depth prediction accuracy. PlaneNet makes (K+1)(K+1)(K+1) depth value predictions at each pixel. We pick the depth value with the maximum probability in the segmentation mask to define our depthmap.
CH: 雖然這個工作的關鍵是預測平面分割掩膜和精確的平面參數,但也能與深度預測方法進行比較。可以確保我們的深度結構化預測不會對每個像素的深度預測精度造成影響。PlaneNet 對每個像素進行了 (K+1)(K+1)(K+1) 深度值預測。我們選擇分割掩膜中最大概率的深度值來定義深度圖。
EN: Depth accuracies are evaluated on the NYUv2 dataset at 1) planar regions, 2) boundary regions, and 3) the entire image, against three competing baselines. Eigen-VGG is a convolutional architecture to predict both depths and surface normals. SURGE is a more recent depth inference network that optimizes planarity. FCRN is the current state-of-the-art single-image depth inference network .
CH: 深度精度評估基于 NYUv2 數據集的平面區域,邊界區域和整個圖像。三個對比網絡分別是:Eigen-VGG 是用來預測深度值和平面法線的卷積結構。SURGE 是最新的深度推算網絡可以優化平面的。FCRN 是目前最好的單圖像推算網絡。
EN: Depthmaps in NYUv2 are very noisy and ground-truth plane extraction does not work well. Thus, we fine-tune our network using only the depth loss. Note that the key factor in this training is that the network is trained to generate a depthmap through our piece-wise planar depthmap represen-tation. To further verify the effects of this representation, we have also fine-tuned our network in the standard per-pixel depthmap representation by disabling the plane parameter and the plane segmentation branches. In this version, denoted as “PlaneNet (Depth rep.)”, the entire depthmap is predicted in the (K+1)th(K + 1)^{th}(K+1)th depthmap (DK+1)(D_{K+1})(DK+1?).
CH: NYUv2 的深度圖有很多噪音,并且真實實例的平面提取效果不好。因此,我們只使用深度損失來 fine-tune 我們的網絡。注意,訓練時候的關鍵因素是網絡經過訓練可以通過我們分段平面深度信息表示生成深度圖。為了進一步驗證這種表示的效果,我們禁用了平面參數和平面分割掩膜兩個分支,只 fine-tune 像素的深度圖網絡分支,這個版本表示為 PlaneNet (Depth rep.)。
EN: Table 1 shows the depth prediction accuracy on various metrics introduced in the prior work. The left five metrics provide different error statistics such as relative difference (Rel) or rooted-mean-square-error (RMSE) on the average per-pixel depth errors. The right three metrics provide the ratio of pixels, for which the relative difference between the predicted and the ground-truth depths is below a threshold. The table demonstrates that PlaneNet outperforms the state of-the-art of single-image depth inference techniques. As observed in prior works, the planarity constraint makes differences in the depth prediction task, and the improvements are more significant when our piece-wise planar representation is enforced by our network.
CH: 表一展示了先前工作中用的各種指標的深度預測準確度。左邊五個是不同的誤差統計,比如:平均像素深度誤差的相對偏差(Rel)和均方根誤差(RMSE)右邊三個是像素所占的比例,對于那些預測的和實際的深度相對誤差小于閾值的。該表表明 PlaneNet 要優于目前單圖像深度信息推算的最新方法。之前的工作中有觀察到,在深度預測任務中,平面約束可以產生積極的影響,當我們的網絡強制性執行分段平面表示時,這種影響更加的明顯了。
5.3. Plane ordering consistency
5.3. 平面順序的一致性
EN: For segment depth map inference, sorting ambiguity is a challenge. We found that PlaneNet automatically learns consistent sorting without supervision, for example, the floor is always returned to the second plane. In Figure 3, the colors in the planar segmentation results are defined by the order of the planes in the network output. While ordering loses consistency for small objects or extreme camera angles, in most cases, major common surfaces such as floors and walls have a consistent ordering.
CH: 對于分割深度圖的推算,平面的順序是一個挑戰。我們發現 PlaneNet 在沒有干預的情況下會自動進行平面排序,例如:識別出來的地板總是被分到第二個平面。圖三中,平面分割結果的顏色就由輸出的平面順序決定的。一般情況下,對于墻面,地板這些大的平面,順序是一致的,只有在一些小平面上會失去一致性。
EN: We have taken advantage of this property and implemented a simple room layout estimation algorithm. More specifically, we look at the reconstruction example and manually select the plane entries that correspond to the ceiling, floor, and left/middle/right walls. For each possible room layout configuration (for example, a configuration with floor, left and middle walls visible), we build a 3D concave shell based on the plane parameters and project it back into the image to generate the room-layout. We measure the configured score by the number of pixels, where the constructed room layout is consistent with the inferred plane segmentation (determined by the winner). We chose the constructed room layout with the best score as our prediction. Figure 6 shows that our algorithm can generate reasonable room layout estimates even if the scene is confusing and contains many occlusion objects. Table 2 shows a quantitative assessment of the NYUv2 303 data set, where our method is comparable to the prior art designed specifically for this task.
CH: 根據這樣一個特點,我們實現了一個房間布局估計算法,具體來說,我們在重建的實例中手動選擇對應的天花板,墻面,地板等平面。對于每個可能的房間布局配置,我們都根據推算的平面參數構建一個3D結構,然后將這個3D結構投影到原圖像生成房間的布局配置。在構建的房間布局和推斷的平面分割一致時,我們通過像素的數量來衡量預測布局的效果。最后選擇具有最佳效果的房間布局作為輸出的預測結果。圖六顯示即使場景很復雜,有許多遮擋對象,我們的算法也能夠生成合理的房間布局。表二顯示在 NYUv2 303 數據集上,我們的方法與專門針對此任務的方法效果相當。
EN: Figure 6: Room layout estimations. We have exploited the ordering consistency in the predicted planes to infer room layouts.
CH: 圖六:房間布局估計。我們利用預測平面的順序一致性來預測房間布局。
EN: Table 2: Room layout estimations. Quantitative evaluations against the top-performers over the NYUv2 303 dataset.
CH: 表二:房間布局估計。在 NYUv2 303 數據集上與其他算法的定性效果比較。
5.4. Failure modes
5.4. 不足之處
EN: While achieving promising results on most images, PlaneNet has some failure modes as shown in Fig. 7. In the first example, PlaneNet generates two nearly co-planar vertical surfaces in the low-light region below the sink. In the second example, it cannot distinguish a white object on the floor from a white wall. In the third example, it misses a column structure on a wall due to the presence of object clutter. While the capability to infer precise plane parameters is already super-human, there is a lot of room for improvement on the planar segmentation, especially in the absence of texture information or at the presence of clutter.
CH: 雖然在很多圖像上有不錯的效果,但是 PlaneNet 還是有許多不足之處,如圖七所示。在第一個例子中,PlaneNet 在一個低光區域產生了兩個幾乎共面的垂直表面,第二個例子中,沒有把白色墻壁和白色物體區分開來,第三個例子中,由于雜亂物體的影響,錯過了墻上的列結構。雖然 PlaneNet 在推算平面參數的能力已經時很優秀了,但是在平面分割精度方面還有待提升,尤其是在沒有紋理和有雜物的情況下。
EN: Figure 7: Typical failure modes occur in the absence of enough image texture cues or at the presence of small objects and clutter.
CH: 圖七:不足之處在于缺乏紋理或者有小物體遮擋的情況下。
6. Applications
6. 應用
EN: Structured geometry reconstruction is important for many application in Augmented Reality. We demonstrate two image editing pplications enabled by our piece-wise planar representation: texture insertion and replacement (see Fig. 8). We first extract Manhattan directions by using the predicted plane normals through a standard voting scheme . Given a piece-wise planar region, we define an axis of its UV coordinate by the Manhattan direction that is the most parallel to the plane, while the other axis is simply the cross product of the first axis and the plane normal. Given a UV coordinate, we insert a new texture by alpha-blending or completely replace a texture with a new one. Please see the supplementary material and the video for more AR application examples.
CH: 結構化幾何重建對于增強現實中的許多應用都非常重要。通過使用我們的分段平面表示:紋理插入和替換,做了兩個圖像編輯的應用。(見圖八)我們首先用一個標準的表決方法通過預測的平面法線來提取曼哈頓方向。給定分段平面分割的區域,我們通過最平行與平面的曼哈頓方向來定義它的 UV 坐標軸,另一個軸是第一個軸和其平面法線的叉乘。給定 UV 坐標軸,我們通過 alpha-blending 插入新的紋理或者完全替換舊紋理。更多實例請參閱補充材料及視頻。
EN: Figure 8: Texture editing applications. From top to bottom, an input image, a plane segmentation result, and an edited image.
CH: 圖八:圖片紋理編輯應用。
7. Conclusion and future work
7. 結論及未來的工作
EN: This paper proposes PlaneNet, the first deep neural architecture for piece-wise planar depthmap reconstruction from a single RGB image. PlaneNet learns to directly infer a set of plane parameters and their probabilistic segmentation masks. The proposed approach significantly outperforms competing baselines in the plane segmentation task. It also advances the state-of-the-art in the single image depth prediction task. An interesting future direction is to go beyond the depthmap framework and tackle structured geometry prediction problems in a full 3D space.
CH: 本論文提出了第一個用于單幅圖像重建分段平面深度圖的深度神經網絡-PlaneNet。PlaneNet 直接推斷平面參數及其分割掩膜。這個方法不僅在此任務中明顯的優于目前的其他方法,還推動了單一圖像深度預測任務的發展。在未來一個有趣的方向是超越深度圖,直接在3D空間處理幾何結構化預測問題。
個人網站:心安便是歸處
GitHub:oh,ss
總結
以上是生活随笔為你收集整理的《PlaneNet-单幅RGB图像的分段平面重建》论文中英文对照解读的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 客栈(酒店)管理系统(源码+数据库+设计
- 下一篇: Bootloader和BIOS、uboo