Archival Workshop Publications
1 |
|
Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences Abstract: Autoregressive models have proven to be very powerful in NLP text generation tasks and lately have gained popularity for image generation as well. However, they have seen limited use for the synthesis of 3D shapes so far. This is mainly due to the lack of a straightforward way to linearize 3D data as well as to scaling problems with the length of the resulting sequences when describing complex shapes. In this work we address both of these problems. We use octrees as a compact hierarchical shape representation that can be sequentialized by traversal ordering. Moreover, we introduce an adaptive compression scheme, that significantly reduces sequence lengths and thus enables their effective generation with a transformer, while still allowing fully autoregressive sampling and parallel training. We demonstrate the performance of our model by performing superresolution and comparing against the state-of-the-art in shape generation. |
2 |
|
3DSSR: 3D Subscene Retrieval Abstract: We present the task of 3D subscene retrieval (3DSSR). In this task a user specifies a query object and a set of context objects in a 3D scene. Then, a system retrieves and ranks subscenes from a database of 3D scenes that best correspond to the configuration defined by the query. This formulation generalizes prior work on context-based 3D object retrieval and 3D scene retrieval. To tackle this task we present POINTCROP: a self-supervised point cloud encoder training scheme that enables retrieval of geometrically similar subscenes without relying on object category supervision. We evaluate POINTCROP against alternative methods and baselines through a suite of evaluation metrics that measure the degree of subscene correspondence. Our experiments show that POINTCROP training outperforms supervised and prior self-supervised training paradigms by 4.33% and 9.11% in mAP respectively. |
3 |
|
Attention-based Part Assembly for 3D Volumetric Shape Modeling Abstract: Modeling a 3D volumetric shape as an assembly of decomposed shape parts is much more challenging, but semantically more valuable than direct reconstruction from a full shape representation. The neural network needs to implicitly learn part relations coherently, which is typically performed by dedicated network layers that can generate transformation matrices for each part. In this paper, we propose a VoxAttention network architecture for attention-based part assembly. We further propose a variant of using channel-wise part attention and show the advantages of this approach. Experimental results show that our method outperforms most state-of-the-art methods for the part relation-aware 3D shape modeling task. |
4 |
|
SepicNet: Sharp Edges Recovery by Parametric Inference of Curves in 3D Shapes Abstract: 3D scanning as a technique to digitize objects in reality and create their 3D models, is used in many fields and areas. Though the quality of 3D scans depends on the technical characteristics of the 3D scanner, the common drawback is the smoothing of fine details, or the edges of an object. We introduce SepicNet, a novel deep network for the detection and parametrization of sharp edges in 3D shapes as primitive curves. To make the network end-to-end trainable, we formulate the curve fitting in a differentiable manner. We develop an adaptive point cloud sampling technique that captures the sharp features better than uniform sampling. The experiments were conducted on a newly introduced large-scale dataset of 50k 3D scans, where the sharp edge annotations were extracted from their parametric CAD models, and demonstrate significant improvement over state-of-the-art methods. |
5 |
|
IPD-Net: SO(3) Invariant Primitive Decompositional Network for 3D Point Clouds Abstract: In this paper, we propose IPD-Net: Invariant Primitive Decompositional Network, a SO(3) invariant framework for decomposition of a point cloud. The human cognitive system is able to identify and interpret familiar objects regardless of their orientation and abstraction. Recent research aims to bring this capability to machines for understanding the 3D world. In this work, we present a framework inspired by human cognition to decompose point clouds into four primitive 3D shapes (plane, cylinder, cone, and sphere) and enable machines to understand the objects irrespective of its orientations. We employ Implicit Invariant Features (IIF) to learn local geometric relations by implicitly representing the point cloud with enhanced geometric information invariant towards SO(3) rotations. We also use Spatial Rectification Unit (SRU) to extract invariant global signatures. We demonstrate the results of our proposed methodology for SO(3) invariant decomposition on TraceParts Dataset, and show the generalizability of proposed IPD-Net as plugin for downstream task on classification of point clouds. We compare the results of classification with state-of-the-art methods on benchmark dataset (ModelNet40). |
Non-archival Paper Presentations
6 |
|
GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts Abstract: For years, researchers have been devoted to generalizable object perception and manipulation, where crosscategory generalizability is highly desired yet underexplored. In this work, we propose to learn such crosscategory skills via Generalizable and Actionable Parts (GAParts). By identifying and defining 9 GAPart classes (lids, handles, etc.) in 27 object categories, we construct a large-scale part-centric interactive dataset, GAPartNet, where we provide rich, part-level annotations (semantics, poses) for 8,489 part instances on 1,166 objects. Based on GAPartNet, we investigate three cross-category tasks: part segmentation, part pose estimation, and part-based object manipulation. Given the significant domain gaps between seen and unseen object categories, we propose a robust 3D segmentation method from the perspective of domain generalization by integrating adversarial learning techniques. Our method outperforms all existing methods by a large margin, no matter on seen or unseen categories. Furthermore, with part segmentation and pose estimation results, we leverage the GAPart pose definition to design part-based manipulation heuristics that can generalize well to unseen object categories in both the simulator and the real world. |
7 |
|
ROAD: Learning an Implicit Recursive Octree Auto-Decoder to Efficiently Encode 3D Shapes Abstract: Compact and accurate representations of 3D shapes are central to many perception and robotics tasks. State-of-theart learning-based methods can reconstruct single objects but scale poorly to large datasets. We present a novel recursive implicit representation to efficiently and accurately encode large datasets of complex 3D shapes by recursively traversing an implicit octree in latent space. Our implicit Recursive Octree Auto-Decoder (ROAD) learns a hierarchically structured latent space enabling state-of-the-art reconstruction results at a compression ratio above 99%. We also propose an efficient curriculum learning scheme that naturally exploits the coarse-to-fine properties of the underlying octree spatial representation. We explore the scaling law relating latent space dimension, dataset size, and reconstruction accuracy, showing that increasing the latent space dimension is enough to scale to large shape datasets. Finally, we show that our learned latent space encodes a coarse-to-fine hierarchical structure yielding reusable latents across different levels of details, and we provide qualitative evidence of generalization to novel shapes outside the training set. |
8 |
|
FLEX: Full-Body Grasping Without Full-Body Grasps Abstract: Synthesizing 3D human avatars interacting realistically with a scene is an important problem with applications in AR/VR, video games and robotics. Towards this goal, we address the task of generating a virtual human -- hands and full body -- grasping everyday objects. Existing methods approach this problem by collecting a 3D dataset of humans interacting with objects and training on this data. However, 1) these methods do not generalize to different object positions and orientations, or to the presence of furniture in the scene, and 2) the diversity of their generated full-body poses is very limited. In this work, we address all the above challenges to generate realistic, diverse full-body grasps in everyday scenes without requiring any 3D full-body grasping data. Our key insight is to leverage the existence of both full-body pose and hand grasping priors, composing them using 3D geometrical constraints to obtain full-body grasps. We empirically validate that these constraints can generate a variety of feasible human grasps that are superior to baselines both quantitatively and qualitatively. |
9 |
|
Self-Supervised Learning of Action Affordances as Interaction Modes Abstract: When humans perform a task with an articulated object, they interact with the object only in a handful of ways, while the space of all possible interactions is nearly endless. This is because humans have prior knowledge about what interactions are likely to be successful, i.e., to open a new door we first try the handle. While learning such priors without supervision is easy for humans, it is notoriously hard for machines. In this work, we tackle unsupervised learning of priors of useful interactions with articulated objects, which we call interaction modes. In contrast to the prior art, we use no supervision or privileged information; we only assume access to the depth sensor in the simulator to learn the interaction modes. More precisely, we define a successful interaction as the one changing the visual environment substantially and learn a generative model of such interactions, that can be conditioned on the desired goal state of the object. In our experiments, we show that our model covers most of the human interaction modes, outperforms existing state-of-the-art methods for affordance learning, and can generalize to objects never seen during training. Additionally, we show promising results in the goal-conditional setup, where our model can be quickly fine-tuned to perform a given task. We show in the experiments that such affordance learning predicts interaction which covers most modes of interaction for the querying articulated object and can be fine-tuned to a goal-conditional model. |
10 |
|
PartNeRF: Generating Part-Aware Editable 3D Shapes without 3D Supervision Abstract: Impressive progress in generative models and implicit representations gave rise to methods that can generate 3D shapes of high quality. However, being able to locally control and edit shapes is another essential property that can unlock several content creation applications. Local control can be achieved with part-aware models, but existing methods require 3D supervision and cannot produce textures. In this work, we devise PartNeRF, a novel part-aware generative model for editable 3D shape synthesis that does not require any explicit 3D supervision. Our model generates objects as a set of locally defined NeRFs, augmented with an affine transformation. This enables several editing operations such as applying transformations on parts, mixing parts from different objects etc. To ensure distinct, manipulable parts we enforce a hard assignment of rays to parts that makes sure that the color of each ray is only determined by a single NeRF. As a result, altering one part does not affect the appearance of the others. Evaluations on various ShapeNet categories demonstrate the ability of our model to generate editable 3D objects of improved fidelity, compared to previous part-based generative approaches that require 3D supervision or models relying on NeRFs. |
11 |
|
Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds Abstract: Computer-Aided Design (CAD) model reconstruction from point clouds is an important problem at the intersection of computer vision, graphics, and machine learning. Recent advancements in this direction achieve rather reliable semantic segmentation but still struggle to produce an adequate topology of the CAD model. We propose a hybrid analytic-neural reconstruction scheme that bridges the gap between segmented point clouds and structured CAD models. To power the surface fitting stage, we propose a novel implicit neural representation of freeform surfaces, driving up the performance of our overall CAD reconstruction scheme. We evaluate our method on the ABC benchmark of CAD models and set a new state-of-the-art for that dataset. |
Contact Info
E-mail: kmo@nvidia.com
Acknowledgements
Website template borrowed from: https://futurecv.github.io/ (Thanks to Deepak Pathak)