3D scene graphs have recently emerged as a powerful high-level representation of 3D environments. A 3D scene graph models the environment as a layered graph where nodes represent spatial concepts at multiple levels of abstraction (from low-level geometry to high-level semantics including objects, places, rooms, buildings, etc.) and edges represent relations between concepts. page , pdf
While 3D scene graphs can serve as an advanced “Mental Model” for robots, how to build such a rich representation in real-time is still uncharted territory.
This paper describes a real-time Spatial Perception System, a suite of algorithms to build a 3D scene graph from sensor data in real-time.
Our first contribution is to develop real-time algorithms to incrementally construct the layers of a scene graph as the robot explores the environment; these algorithms build a local ESDF around the current robot trajectory estimate, extract a topological map of places from the ESDF, and then segment the places into rooms using an approach inspired by community-detection techniques.
Our second contribution is to investigate loop closure detection and optimization in 3D scene graphs. We show that 3D scene graphs allow defining hierarchical descriptors for place recognition; our descriptors capture statistics across layers in the scene graph, ranging from low-level visual appearance, to summary statistics about objects and places. We then propose the first algorithm to optimize a 3D scene graph in response to loop closures; our approach relies on embedded deformation graphs to simultaneously correct all layers of the scene graph. We implement the proposed system into a highly parallelized architecture, named Hydra, that combines fast early and mid-level perception processes with slower high-level perception. We evaluate Hydra on simulated and real data and show it is able to reconstruct 3D scene graphs with an accuracy comparable with batch offline methods, while running online.
[…] 3D Scene Graphs [4, 26, 49, 50, 63, 67] have recently emerged as powerful high-level representations of 3D environments. […]
~
REFERENCES [1] C. Agia, J. Krishna Murthyand M. Khodeir, O. Miksik, V. Vineet, M. Mukadam, L. Paull, and F. Shkurti. Taskog- raphy: Evalutation robot task planning over large 3D scene graphs. In Conference on Robot Learning (CoRL), 2021. [2] P. Antonante, V. Tzoumas, H. Yang, and L. Carlone. Outlier-robust estimation: Hardness, minimally tuned al- gorithms, and applications. IEEE Trans. Robotics, 38(1): 281–301, 2021. (pdf). [3] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly su- pervised place recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 5297– 5307, 2016. [4] I. Armeni, Z. He, J. Gwak, A. Zamir, M. Fischer, J. Malik, and S. Savarese. 3D scene graph: A structure for unified semantics, 3D space, and camera. In Intl. Conf. on Computer Vision (ICCV), pages 5664–5673, 2019.
[…] Figure 1. 3D Scene Graph: It consists of 4 layers, that represent semantics, 3D space and camera. Elements are nodes in the graph and have certain attributes. Edges are formed between them to denote relationships (e.g., occlusion, relative volume, etc.).
[…] we articulate that 3D space is more stable and invariant, yet connected to images and other pixel and non-pixel output domains (e.g. depth). Hence, we ground semantic information there, and project it to other desired spaces as needed (e.g., images, etc.). Specifically, this means that the information is grounded on the underlying 3D mesh of a building. This approach presents a number of useful values, such as free 3D, amodal, occlusion, and open space analysis. **More importantly, semantics can be projected onto any number of visual observations (images and videos) which provides them with annotations without additional cost.**
[…]
3D Scene Graph A Structure for Unified Semantics, 3D Geometry and Camera. site
~
[26] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. A hierarchical approach for generating descriptive image paragraphs. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 3337–3345. IEEE, 2017. page , pdf
[…] To what extent does describing images with paragraphs differ from sentence-level captioning? […]
[…]
Figure 2. Overview of our model. pdf
Figure 2. Overview of our model. Given an image (left), a region detector (comprising a convolutional network and a region proposal network) detects regions of interest and produces features for each. Region features are projected to RP, pooled to give a compact image representation, and passed to a hierarchical recurrent neural network language model comprising a sentence RNN and a word RNN. The sentence RNN determines the number of sentences to generate based on the halting distribution pi and also generates sentence topic vectors, which are consumed by each word RNN to generate sentences.
[…] 6. Conclusion In this paper we have introduced the task of describing images with long, descriptive paragraphs, and presented a hierarchical approach for generation that leverages the compositional structure of both images and language. We have shown that paragraph generation is different from traditional image captioning and have tailored our model to suit these differences. Experimentally, we have demonstrated the advantages of our approach over traditional image captioning methods and shown how region-level knowledge can be effectively transferred to paragraph captioning. We have also demonstrated the benefits of our model in interpretability, generating descriptive paragraphs using only a subset of image regions. We anticipate further opportunities for knowledge transfer at the intersection of vision and language, and project that visual and lingual compositionality will continue to lie at the heart of effective paragraph generation.