3D Scene Reconstruction of A Single Monocular Photos by Estimating the Orientation of Rectangles


(No.15 showed an error due to low photo resoluction)
3D Scene Reconstruction


This is just part of my undergraduate thesis about “Scene Understanding for Robots” which is still under constrction, so that this post may seem unfinished. I’m sorry for that, but you are always welcomed to contact me[AT]qzane.com for details if you are interested.

Original Photo(Perspective projection)

Finding Strong Outlines and Potential Rectangles (Perspective projection)

This is actually a critical section of this task, and I’m currently using some empirical methods to find potential spacial rectangles like using the completeness of angles formed between lines. However the result is not that good and I am trying some new models with the help of NYU Depth Dataset V2 and Stanford 2D-3D-Semantics Dataset.

Questions and suggestions by email are always welcomed!

Surface Oriantation Estimation(Perspective projection)

Detailed reasoning process can be downloaded here. The following is an outline.
As shown in the picture above. tex is the optical center of the camera and tex is its optical axis. tex is a 3D orthogonal coordinate system and tex is the image plane with tex as its image center.

we are going to estimate the oriantation of the tex, or more specifically: its normal vector tex, with its projection tex on the image plane.

The problem can be reduce to find the value of tex which can minimal the value of our object function:
object function


tex (or any other constant, all of which will lead to the same result of tex)

Since our object function has only one variable(tex), there are dozens of methods to find its minimal, I tried particle swarm optimization (PSO), and it worked well.

Once we get the value of tex, we get the value of tex, with which, we can calculate the value of tex easily.

3D Scene Reconstruction(Orthographic projection)

Once we have the oriantations of all rectangles, we can infer their relative positions from the intersections in the photo and eventually reconstruct the whole scene. Although the low resoluction and some potential distortion in the photo may cause some error, they won’t have large impact to the whole model because the oriantations are calculated independently.


I’m currently trying to do some qualitative analysis using NYU Depth Dataset V2 and Stanford 2D-3D-Semantics Dataset.

Written on December 3, 2016