3D RECONSTRUCTION OF INDOOR SCENES FROM A SINGLE IMAGE
Mohammad Haris Baig
Motivation:
3D data is very important for a number of applications, ranging from robotics, augmented reality, animation, object recognition. However reliable 3D is very hard to obtain, the best we can do for most cases is obtain a 2D image of the scene and use that 2D to infer the 3D structure of the scene. Images capture spatial layout of objects, the interaction of light with the objects, and the interaction of objects with each other for a scene. This information can then be used to extract the 3D structure of the scene by exploiting different properties. We are more concerned with the single image un-calibrated camera scenario since in most cases we only have a single image of the scene and no other information about the imaging system.
Some methods that recover 3D scene structure from a single image include geometric techniques that use geometric properties of image capture to recover some 3D information, shape from defocus, where the scene is captured at with a camera using multiple focal lengths and the degree of defocus in the different images is used to infer 3D of the scene. Other cues include shape from shading, where the variation of shading across a surface of uniform color is used to infer the shape, shape from texture, shape from angular regularities and shape from occluding contours.
More recently there has been interest in using machine learning to infer the 3D structure of scenes given single images and their corresponding depth maps. These efforts can be categorized in two broad categories Object Modeling and scene reconstructions. In object modeling the training set is comprised of objects of the same types and their corresponding 3D and the problem is that given a view of the object to infer the 3D structure. These approaches are however not very scalable and not very useful in a more generalized case.
Problem:
We are interested in reconstructing the 3D structure of a scene given an image. The motivation to involve machine learning when considering indoor scenes is more understandable as all indoor scenes are man-made scenes and exhibit strong structural regularities and the idea is that these structural regularities and huge degree of similarity between indoor scenes of similar kind can be used to provide the 3D structure for unknown scenes given a small set of reliable observations that can be made regarding the scene.
Related Word
In this domain we currently see two different kinds of approaches, both exhibiting advantages and disadvantages. On one hand we have parametric methods towards tackling the problem of 3D reconstruction. These are in my observations, focused on the use of Markov Random Fields (MRF’s) [1][2][3]. These are the more structured and disciplined of the techniques available currently. Whereas using parametric methods has strong advantages in the sense that training can be done on small datasets and in a disciplined format, the downside to it is that training takes a lot of time, the approach is not very scalable in terms of use of large datasets that are now becoming available and also, the large number of parameters means it is hard to make sense of what the parameters represent.
On the other hand we have non-parametric methods [4] that are highly scalable, have a small number of parameters and very small training times but are less structured and disciplined and require huge datasets for them to work efficiently.
Objective
The objective of this project is to find out an effective way of reconstructing the visible 3D for indoor scenes given a single image of the scene. To put it slightly more formally, given an input image, we want to predict a depth at every pixel location of that image that corresponds to the distance of the object in 3D from the camera center.
Plan
In order to achieve this goal, I propose a 3 fold plan, following which reasonable success can be achieved given the small time frame of the project. Firstly, I would investigate the accuracy of 3D reconstructions that can be achieved using Gaussian, Laplacian and hybrid models as suggested by [1] and [2] for a pixel based reconstruction. In this area, I am interested in exploring the effects of the use of different features and different combination of features on the accuracy of 3D reconstruction to discover which features represent most efficiently the 3D structures of indoor scenes when using a particular model.
An extension of the use of MRF’s was done by the authors of [1] and [2] in the form of a plane parameter MRF proposed in [3]. I am interested in investigating the role of different features used in this context and their quantifiable accuracy for indoor scene structures from different kinds of scenes. Also, I am interested in exploring how a model that was trained on a dataset comprised mostly of outdoor images performs when it is trained on indoor scenes.
One of the key problems with the parametric methods is scalability and in this regard I would like to attempt the use of SGD (Stochastic Gradient Descent followed by Batch Gradient Descent) for incorporating scalability in this area so that both these methods can be used by training them on a ‘NYU_V2’ [6], a larger dataset that has been obtained with the use of an RGBD sensor.
Next, I would like to explore how non-parametric techniques compare to the results of parametric approaches when using the same dataset. With this regard I would like to explore the work in [4] and how modifications might be made to it to improve the quality of reconstruction by incorporating some structure to it. One such technique that I plan on using is to introduce the notion of reconstructing patches that have similar depths rather than reconstructing depths at a pixel level and to incorporate constraints on smoothness of nearby patches.
Finally, I would like to investigate a method that would be suitable to the problem of 3D reconstruction that can use the strengths of non-parametric methods and still use a parametric method as proposed by the authors of [5] with regard to segmentation a classification problem.
Datasets
For this problem I would like to use
1- Make3D NIPS (2005) dataset
350 training images
100 testing images
2- Make3D PAMI (2008) dataset
400 training Images
125 testing Images
3- NYU_V2 (2012)
1600 Images
Milestone
By the milestone mark, I would demonstrate training and testing of the different models with different features and their comparison on all the dataset for the pixel MRF’s. This requires modifying existing code and rewriting some existing code so that the models comply with the papers referenced.
I would have by this time started modifying the plane MRF so that comparisons can be done and an understanding can be achieved of which MRF can be better used to model the 3D structures of indoor scenes.
I would also present an analysis on the reconstruction capabilities of both the approaches in order to better evaluate how to combine them.
References
[1]- Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng. “Learning Depth from Single Monocular Images” In Neural Information Processing Systems (NIPS) 18, 2005.
[2] - Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng. “3-D Depth Reconstruction from a Single Still Image” , International Journal of Computer Vision (IJCV), Aug 2007.
[3] Ashutosh Saxena, Min Sun, Andrew Y. Ng, “Make3D: Learning 3-D Scene Structure from a Single Still Image” In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2008.
[4] J. Konrad, G. Brown, M. Wang, P. Ishwar, C. Wu, and D. Mukherjee, “Automatic 2D-to-3D image conversion using 3D examples from the Internet,” in Proc. SPIE Stereoscopic Displays and Applications, vol. 8288, Jan. 2012
[5] J. Tighe and S. Lazebnik, “Superparsing: Scalable Nonparametric Image Parsing with Superpixels,” Proc. European Conf. Computer Vision, 2010.