134 research outputs found
Generative factorization for object-centric representation learning
Empowering machines to understand compositionality is considered by many (Lake et al., 2017; Lake and Baroni, 2018; Schölkopf et al., 2021) a promising path towards improved representational interpretability and out-of-distribution generalization. Yet, discovering the compositional structures of raw sensory data requires solving a factorization problem, i.e. decomposing the unstructured observations into modular components. Handling the factorization problem presents numerous technical challenges, especially in unsupervised settings which we explore to avoid the heavy burden of human annotation. In this thesis, we approach the factorization problem from a generative perspective. Specifically, we develop unsupervised machine learning models to recover the compositional data-generation mechanisms around objects from visual scene observations.
First, we present MulMON as the first feasible unsupervised solution to the multi-view object-centric representation learning problem. MulMON resolves the spatial ambiguities arising from single-image observations of static scenes, e.g. optical illusions and occlusion, with a multi-view inference design. We demonstrate that not only can MulMON perform better scene object factorization with less uncertainty than single-view methods, but it can also predict a scene's appearances and object segmentations for novel viewpoints. Next, we present a technique, namely for latent duplicate suppression (abbr. LDS), and demonstrate its effectiveness in fixing a common scene object factorization issue that exists in various unsupervised object-centric learning models---i.e. inferring duplicate representations for the same objects. Finally, we present DyMON as the first unsupervised learner that can recover object-centric compositional generative mechanism from moving-view-dynamic-scene observational data. We demonstrate that not only can DyMON factorize dynamic scenes in terms of objects, but it can also factorize the entangled effects of observer motions and object dynamics that function independently. Furthermore, we demonstrate that DyMON can predict a scene's appearances and segmentations at arbitrary times (querying across time) and from arbitrary viewpoints (querying across space)---i.e. answer counterfactual questions.
The scene modeling explored in this thesis is a proof of concept, which we hope will inspire: 1) a broader range of downstream applications (e.g. "world modelling'' and environment interactions) and 2) generative factorization research that targets more complex compositional structures (e.g. complex textures, multi-granularity compositions)
Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views
Learning object-centric representations of multi-object scenes is a promising
approach towards machine intelligence, facilitating high-level reasoning and
control from visual sensory data. However, current approaches for unsupervised
object-centric scene representation are incapable of aggregating information
from multiple observations of a scene. As a result, these "single-view" methods
form their representations of a 3D scene based only on a single 2D observation
(view). Naturally, this leads to several inaccuracies, with these methods
falling victim to single-view spatial ambiguities. To address this, we propose
The Multi-View and Multi-Object Network (MulMON) -- a method for learning
accurate, object-centric representations of multi-object scenes by leveraging
multiple views. In order to sidestep the main technical difficulty of the
multi-object-multi-view scenario -- maintaining object correspondences across
views -- MulMON iteratively updates the latent object representations for a
scene over multiple views. To ensure that these iterative updates do indeed
aggregate spatial information to form a complete 3D scene understanding, MulMON
is asked to predict the appearance of the scene from novel viewpoints during
training. Through experiments, we show that MulMON better-resolves spatial
ambiguities than single-view methods -- learning more accurate and disentangled
object representations -- and also achieves new functionality in predicting
object segmentations for novel viewpoints.Comment: Accepted at NeurIPS 2020 (Spotlight
A Robust Deformable Linear Object Perception Pipeline in 3D: From Segmentation to Reconstruction
3D perception of deformable linear objects (DLOs) is crucial for DLO manipulation. However, perceiving DLOs in 3D from a single RGBD image is challenging. Previous DLO perception methods fail to extract a decent 3D DLO model due to different textures, occlusions, sparse and false depth information. To address these problems and provide a more robust DLO perception initialization for downstream tasks like tracking and manipulation in complex scenarios, this paper proposes a 3D DLO perception pipeline to first segment a DLO in 2D images and post-process masks to eliminate false positive segmentation, reconstruct the DLO in 3D space to predict the occluded part of the DLO, and physically smooth the reconstructed DLO. By testing on a synthetic DLO dataset and further validating on a real-world dataset with seven different DLOs, we demonstrate that the proposed method is an effective and robust 3D perception pipeline solution with better performance on 2D DLO segmentation and 3D DLO reconstruction compared to State-of-the-Art algorithms
DUGMA: Dynamic Uncertainty-Based Gaussian Mixture Alignment
Registering accurately point clouds from a cheap low-resolution sensor is a
challenging task. Existing rigid registration methods failed to use the
physical 3D uncertainty distribution of each point from a real sensor in the
dynamic alignment process mainly because the uncertainty model for a point is
static and invariant and it is hard to describe the change of these physical
uncertainty models in the registration process. Additionally, the existing
Gaussian mixture alignment architecture cannot be efficiently implement these
dynamic changes.
This paper proposes a simple architecture combining error estimation from
sample covariances and dual dynamic global probability alignment using the
convolution of uncertainty-based Gaussian Mixture Models (GMM) from point
clouds. Firstly, we propose an efficient way to describe the change of each 3D
uncertainty model, which represents the structure of the point cloud much
better. Unlike the invariant GMM (representing a fixed point cloud) in
traditional Gaussian mixture alignment, we use two uncertainty-based GMMs that
change and interact with each other in each iteration. In order to have a wider
basin of convergence than other local algorithms, we design a more robust
energy function by convolving efficiently the two GMMs over the whole 3D space.
Tens of thousands of trials have been conducted on hundreds of models from
multiple datasets to demonstrate the proposed method's superior performance
compared with the current state-of-the-art methods. The new dataset and code is
available from https://github.com/Canpu999Comment: Accepted by 3DV 2018. 9 pages. arXiv admin note: text overlap with
arXiv:1707.0862
SDF-MAN: Semi-supervised Disparity Fusion with Multi-scale Adversarial Networks
Refining raw disparity maps from different algorithms to exploit their
complementary advantages is still challenging. Uncertainty estimation and
complex disparity relationships among pixels limit the accuracy and robustness
of existing methods and there is no standard method for fusion of different
kinds of depth data. In this paper, we introduce a new method to fuse disparity
maps from different sources, while incorporating supplementary information
(intensity, gradient, etc.) into a refiner network to better refine raw
disparity inputs. A discriminator network classifies disparities at different
receptive fields and scales. Assuming a Markov Random Field for the refined
disparity map produces better estimates of the true disparity distribution.
Both fully supervised and semi-supervised versions of the algorithm are
proposed. The approach includes a more robust loss function to inpaint invalid
disparity values and requires much less labeled data to train in the
semi-supervised learning mode. The algorithm can be generalized to fuse depths
from different kinds of depth sources. Experiments explored different fusion
opportunities: stereo-monocular fusion, stereo-ToF fusion and stereo-stereo
fusion. The experiments show the superiority of the proposed algorithm compared
with the most recent algorithms on public synthetic datasets (Scene Flow,
SYNTH3, our synthetic garden dataset) and real datasets (Kitti2015 dataset and
Trimbot2020 Garden dataset).Comment: This is our draft and accepted by the journal Remote Sensing. There
is a little difference between the title on Arxiv and that on Remote Sensing.
Two small corrections have been made in "Performance on Kitti2015 Dataset" in
this latest version (which is slightly different from the previous version in
Remote Sensing
- …