Archival Workshop Publications [CVF Official Archive]
1 |
Poster Session 1
|
3D Scene Angles using UL Decomposition of Planar Homography Abstract: Proctoring during online exams often requires students to be under surveillance from a side pose and there is a strong need to estimate the side camera’s relative position with respect to student’s computer screen. This work uses edge and line detectors to extract the computer screen’s boundaries and estimates homography with respect to rectangular shape with corresponding aspect ratio as in a normal view. A novel Upper-Lower Decomposition of Homography (ULDH) algorithm is proposed that calculates the polar and azimuthal angles with less than 5-degree mean errors and can help distinguish bad camera placements from good ones with good precision. A purpose-built dataset is created and validated for this purpose and the software for key parts of image processing pipeline is made available for remote proctoring purposes. |
2 |
Poster Session 1 & 2
|
MRGAN: Multi-Rooted 3D Shape Representation Learning with Unsupervised Part Disentanglement Abstract: We introduce MRGAN, multi-rooted GAN, the first generative adversarial network to learn a part-disentangled 3D shape representation without any part supervision. The network fuses multiple branches of tree-structured graph convolution layers which produce point clouds in a controllable manner. Specifically, each branch learns to grow a different shape part, offering control over the shape generation at the part level. Our network encourages disentangled generation of semantic parts via two key ingredients: a rootmixing training strategy which helps decorrelate the different branches to facilitate disentanglement, and a set of loss terms designed with part disentanglement and shape semantics in mind. Of these, a novel convexity loss incentivizes the generation of parts that are more convex, as semantic parts tend to be. In addition, a root-dropping loss further ensures that each root seeds a single part, preventing the degeneration or over-growth of the point-producing branches. We evaluate the performance of our network on a number of 3D shape classes, and offer qualitative and quantitative comparisons to previous works and baseline approaches. We demonstrate the controllability offered by our part-disentangled representation through two applications for shape modeling: part mixing and individual part variation, without receiving segmented shapes as input. |
3 |
Poster Session 1 & 2
|
ABD-Net: Attention Based Decomposition Network for 3D Point Cloud Decomposition Abstract: In this paper, we propose Attention Based Decomposition Network (ABD-Net), for point cloud decomposition into basic geometric shapes namely, plane, sphere, cone and cylinder. We show improved performance of 3D object classification using attention features based on primitive shapes in point clouds. Point clouds, being the simple and compact representation of 3D objects have gained increasing popularity. They demand robust methods for feature extraction due to unorderness in point sets. In ABD-Net the proposed Local Proximity Encapsulator captures the local geometric variations along with spatial encoding around each point from the input point sets. The encapsulated local features are further passed to proposed Attention Feature Encoder to learn basic shapes in point cloud. Attention Feature Encoder models geometric relationship between the neighborhoods of all the points resulting in capturing global point cloud information. We demonstrate the results of our proposed ABD-Net on ANSI mechanical component and ModelNet40 datasets. We also demonstrate the effectiveness of ABD-Net over the acquired attention features by improving the performance of 3D object classification on ModelNet40 benchmark dataset and compare them with state-of-the-art techniques. |
Non-archival Short Presentations
4 |
Poster Session 2
|
Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild Abstract: Tracking and reconstructing 3D objects from cluttered scenes are the key components for computer vision, robotics and autonomous driving systems. While recent progress in implicit function (e.g., DeepSDF) has shown encouraging results on high-quality 3D shape reconstruction, it is still very challenging to generalize to cluttered and partially observable LiDAR data. In this paper, we propose to leverage the continuity in video data. We introduce a novel and unified framework which utilizes a DeepSDF model to simultaneously perform object tracking and 3D reconstruction in the wild. We perform online adaptation with the DeepSDF model in the video, iteratively improving the shape reconstruction which leads to improvement on tracking, and vice versa. We experiment with the Waymo dataset, and show significant improvements over state-of-the-art methods for both tracking and shape reconstruction. |
5 |
Poster Session 1 & 2
|
MeronymNet: A Unified Framework for Part and Category Controllable Generation of Objects Abstract: We introduce MeronymNet, a novel hierarchical approach for controllable, part-based generation of multicategory objects using a single unified model. We adopt a guided coarse-to-fine strategy involving semantically conditioned generation of bounding box layouts, pixel-level part layouts and ultimately, the object depictions. We use Graph Convolutional Networks, Deep Recurrent Networks along with custom-designed Conditional Variational Autoencoders to enable flexible, diverse and category-aware generation of 2-D objects in a controlled manner. The performance scores and generations reflect MeronymNet’s superior performance compared to scene generation baselines and ablative variants. |
6 |
Poster Session 2
|
Multi-Person 3D Motion Prediction with Multi-Range Transformers Abstract: We propose a novel framework for multi-person 3D motion trajectory prediction. Our key observation is that a human’s action and behaviors may highly depend on the other persons around. Thus, instead of predicting each human pose trajectory in isolation, we introduce a Multi-Range Transformer model which contains of a localrange encoder for individual motion and a global-range encoder for social interactions. The Transformer decoder then performs prediction for each person by taking a corresponding pose as a query which attends to both local and global-range encoder features. Our model not only outperforms state-of-the-art methods on long-term 3D motion prediction, but also generates diverse social interactions. More interestingly, our model can even predict 15- person motion simultaneously by automatically dividing the persons into different interaction groups. |
7 |
Poster Session 2
|
DexMV: Imitation Learning for Dexterous Manipulation from Human Videos Abstract: While significant progress has been made on understanding hand-object interactions in computer vision, it is still very challenging for robots to perform complex dexterous manipulation. In this paper, we propose a new platform and pipeline, DexMV (Dexterous Manipulation from Videos), for imitation learning to bridge the gap between computer vision and robot learning. We design a platform with: (i) a simulation system for complex dexterous manipulation tasks with a multi-finger robot hand and (ii) a computer vision system to record large-scale demonstrations of a human hand conducting the same tasks. In our new pipeline, we extract 3D hand and object poses from the videos, and convert them to robot demonstrations via motion retargeting. We then apply and compare multiple imitation learning algorithms with the demonstrations. We show that the demonstrations can indeed improve robot learning by a large margin and solve the complex tasks. |
8 |
Poster Session 1 & 2
|
Hand-Object Contact Consistency Reasoning for Human Grasps Generation Abstract: While predicting robot grasps with parallel jaw grippers have been well studied and widely applied in robot manipulation tasks, the study on natural human grasp generation with a multi-finger hand remains a very challenging problem. In this paper, we propose to generate human grasps given a 3D object in the world. Our key observation is that it is crucial to model the consistency between the hand contact points and object contact regions. That is, we encourage the prior hand contact points to be close to the object surface and the object common contact regions to be touched by the hand at the same time. Based on the hand-object contact consistency, we design novel objectives in training the human grasp generation model and also a new self-supervised task which allows the grasp generation network to be adjusted even during test time. Our experiments show significant improvement in human grasp generation over state-of-the-art approaches by a large margin. More interestingly, by optimizing the model during test time with the self-supervised task, it helps achieve larger gain on unseen and out-of-domain objects |
9 |
Poster Session 1 & 2
|
A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation Abstract: Recent work has made significant progress on using implicit functions, as a continuous representation for 3D rigid object shape reconstruction. However, much less effort has been devoted to modeling general articulated objects. Compared to rigid objects, articulated objects have higher degrees of freedom, which makes it hard to generalize to unseen shapes. To deal with the large shape variance, we introduce Articulated Signed Distance Functions (A-SDF) to represent articulated shapes with a disentangled latent space, where we have separate codes for encoding shape and articulation. With this disentangled continuous representation, we demonstrate that we can control the articulation input and animate unseen instances with unseen joint angles. Furthermore, we propose a Test-Time Adaptation inference algorithm to adjust our model during inference. We demonstrate our model generalize well to out-ofdistribution and unseen data, e.g., partial point clouds and real-world depth images. |
10 |
Poster Session 1
|
Towards Part-Based Understanding of RGB-D Scans Abstract: Recent advances in 3D semantic scene understanding have shown impressive progress in 3D instance segmentation, enabling object-level reasoning about 3D scenes; however, a finer-grained understanding is required to enable interactions with objects and their functional understanding. Thus, we propose the task of part-based scene understanding of real-world 3D environments: from an RGB-D scan of a scene, we detect objects, and for each object predict its decomposition into geometric part masks, which composed together form the complete geometry of the observed object. We leverage an intermediary part graph representation to enable robust completion as well as building of part priors, which we use to construct the final part mask predictions. Our experiments demonstrate that guiding part understanding through part graph to part prior-based predictions significantly outperforms alternative approaches to the task of semantic part completion. |
11 |
Poster Session 1 & 2
|
VLGrammar: Grounded Grammar Induction of Vision and Language Abstract: While grammar is an essential representation of natural language, it also exists ubiquitously in vision to represent the hierarchical part-whole structure. In this work, we study grounded grammar induction of vision and language in a joint learning framework. Specifically, we present VLGrammar, a method that uses compound probabilistic contextfree grammars (compound PCFGs) to induce the language grammar and the image grammar simultaneously. We propose a novel contrastive learning framework to guide the joint learning of both modules. We collect a large-scale dataset, PARTIT, which contains human-written sentences that describe part-level semantics for 3D objects. Experiments on the PARTIT dataset show that VLGrammar outperforms all baselines in image grammar induction and language grammar induction. The learned VLGrammar naturally benefits related downstream tasks. Specifically, it improves the image unsupervised clustering accuracy by 30%, and performs well in image retrieval and text retrieval. Notably, the induced grammar shows superior generalizability by easily generalizing to unseen categories. |
12 |
Poster Session 2
|
Joint Learning of 3D Shape Retrieval and Deformation Abstract: We propose a novel technique for producing high-quality 3D models that match a given target object image or scan. Our method is based on retrieving an existing shape from a database of 3D models and then deforming its parts to match the target shape. Unlike previous approaches that independently focus on either shape retrieval or deformation, we propose a joint learning procedure that simultaneously trains the neural deformation module along with the embedding space used by the retrieval module. This enables our network to learn a deformation-aware embedding space, so that retrieved models are more amenable to match the target after an appropriate deformation. In fact, we use the embedding space to guide the shape pairs used to train the deformation module, so that it invests its capacity in learning deformations between meaningful shape pairs. Furthermore, our novel part-aware deformation module can work with inconsistent and diverse part-structures on the source shapes. We demonstrate the benefits of our joint training not only on our novel framework, but also on other stateof-the-art neural deformation modules proposed in recent years. Lastly, we also show that our jointly-trained method outperforms various non-joint baselines. |
13 |
Poster Session 1 & 2
|
DeepMetaHandles: Learning Deformation Meta-Handles of 3D Meshes with Biharmonic Coordinates Abstract: We propose DeepMetaHandles, a 3D conditional generative model based on mesh deformation. Given a collection of 3D meshes of a category and their deformation handles (control points), our method learns a set of metahandles for each shape, which are represented as combinations of the given handles. The disentangled meta-handles factorize all the plausible deformations of the shape, while each of them corresponds to an intuitive deformation. A new deformation can then be generated by sampling the coefficients of the meta-handles in a specific range. We employ biharmonic coordinates as the deformation function, which can smoothly propagate the control points’ translations to the entire mesh. To avoid learning zero deformation as meta-handles, we incorporate a target-fitting module which deforms the input mesh to match a random target. To enhance deformations’ plausibility, we employ a soft-rasterizer-based discriminator that projects the meshes to a 2D space. Our experiments demonstrate the superiority of the generated deformations as well as the interpretability and consistency of the learned metahandles. |
14 |
Poster Session 1 & 2
|
Multi-Plane Program Induction with 3D Box Priors Abstract: We consider two important aspects in understanding and editing images: modeling regular, program-like texture or patterns in 2D planes, and 3D posing of these planes in the scene. Unlike prior work on image-based program synthesis, which assumes the image contains a single visible 2D plane, we present Box Program Induction (BPI), which infers a program-like scene representation that simultaneously models repeated structure on multiple 2D planes, the 3D position and orientation of the planes, and camera parameters, all from a single image. Our model assumes a box prior, i.e., that the image captures either an inner view or an outer view of a box in 3D. It uses neural networks to infer visual cues such as vanishing points or wireframe lines to guide a search-based algorithm to find the program that best explains the image. Such a holistic, structured scene representation enables 3D-aware interactive image editing operations such as inpainting missing pixels, changing camera parameters, and extrapolate the image contents. |
Contact Info
E-mail: struco3d@googlegroups.com or kaichun@cs.stanford.edu