134 research outputs found

    Generative factorization for object-centric representation learning

    Get PDF
    Empowering machines to understand compositionality is considered by many (Lake et al., 2017; Lake and Baroni, 2018; Schölkopf et al., 2021) a promising path towards improved representational interpretability and out-of-distribution generalization. Yet, discovering the compositional structures of raw sensory data requires solving a factorization problem, i.e. decomposing the unstructured observations into modular components. Handling the factorization problem presents numerous technical challenges, especially in unsupervised settings which we explore to avoid the heavy burden of human annotation. In this thesis, we approach the factorization problem from a generative perspective. Specifically, we develop unsupervised machine learning models to recover the compositional data-generation mechanisms around objects from visual scene observations. First, we present MulMON as the first feasible unsupervised solution to the multi-view object-centric representation learning problem. MulMON resolves the spatial ambiguities arising from single-image observations of static scenes, e.g. optical illusions and occlusion, with a multi-view inference design. We demonstrate that not only can MulMON perform better scene object factorization with less uncertainty than single-view methods, but it can also predict a scene's appearances and object segmentations for novel viewpoints. Next, we present a technique, namely for latent duplicate suppression (abbr. LDS), and demonstrate its effectiveness in fixing a common scene object factorization issue that exists in various unsupervised object-centric learning models---i.e. inferring duplicate representations for the same objects. Finally, we present DyMON as the first unsupervised learner that can recover object-centric compositional generative mechanism from moving-view-dynamic-scene observational data. We demonstrate that not only can DyMON factorize dynamic scenes in terms of objects, but it can also factorize the entangled effects of observer motions and object dynamics that function independently. Furthermore, we demonstrate that DyMON can predict a scene's appearances and segmentations at arbitrary times (querying across time) and from arbitrary viewpoints (querying across space)---i.e. answer counterfactual questions. The scene modeling explored in this thesis is a proof of concept, which we hope will inspire: 1) a broader range of downstream applications (e.g. "world modelling'' and environment interactions) and 2) generative factorization research that targets more complex compositional structures (e.g. complex textures, multi-granularity compositions)

    Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

    Get PDF
    Learning object-centric representations of multi-object scenes is a promising approach towards machine intelligence, facilitating high-level reasoning and control from visual sensory data. However, current approaches for unsupervised object-centric scene representation are incapable of aggregating information from multiple observations of a scene. As a result, these "single-view" methods form their representations of a 3D scene based only on a single 2D observation (view). Naturally, this leads to several inaccuracies, with these methods falling victim to single-view spatial ambiguities. To address this, we propose The Multi-View and Multi-Object Network (MulMON) -- a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views. In order to sidestep the main technical difficulty of the multi-object-multi-view scenario -- maintaining object correspondences across views -- MulMON iteratively updates the latent object representations for a scene over multiple views. To ensure that these iterative updates do indeed aggregate spatial information to form a complete 3D scene understanding, MulMON is asked to predict the appearance of the scene from novel viewpoints during training. Through experiments, we show that MulMON better-resolves spatial ambiguities than single-view methods -- learning more accurate and disentangled object representations -- and also achieves new functionality in predicting object segmentations for novel viewpoints.Comment: Accepted at NeurIPS 2020 (Spotlight

    A Robust Deformable Linear Object Perception Pipeline in 3D: From Segmentation to Reconstruction

    Get PDF
    3D perception of deformable linear objects (DLOs) is crucial for DLO manipulation. However, perceiving DLOs in 3D from a single RGBD image is challenging. Previous DLO perception methods fail to extract a decent 3D DLO model due to different textures, occlusions, sparse and false depth information. To address these problems and provide a more robust DLO perception initialization for downstream tasks like tracking and manipulation in complex scenarios, this paper proposes a 3D DLO perception pipeline to first segment a DLO in 2D images and post-process masks to eliminate false positive segmentation, reconstruct the DLO in 3D space to predict the occluded part of the DLO, and physically smooth the reconstructed DLO. By testing on a synthetic DLO dataset and further validating on a real-world dataset with seven different DLOs, we demonstrate that the proposed method is an effective and robust 3D perception pipeline solution with better performance on 2D DLO segmentation and 3D DLO reconstruction compared to State-of-the-Art algorithms

    DUGMA: Dynamic Uncertainty-Based Gaussian Mixture Alignment

    Get PDF
    Registering accurately point clouds from a cheap low-resolution sensor is a challenging task. Existing rigid registration methods failed to use the physical 3D uncertainty distribution of each point from a real sensor in the dynamic alignment process mainly because the uncertainty model for a point is static and invariant and it is hard to describe the change of these physical uncertainty models in the registration process. Additionally, the existing Gaussian mixture alignment architecture cannot be efficiently implement these dynamic changes. This paper proposes a simple architecture combining error estimation from sample covariances and dual dynamic global probability alignment using the convolution of uncertainty-based Gaussian Mixture Models (GMM) from point clouds. Firstly, we propose an efficient way to describe the change of each 3D uncertainty model, which represents the structure of the point cloud much better. Unlike the invariant GMM (representing a fixed point cloud) in traditional Gaussian mixture alignment, we use two uncertainty-based GMMs that change and interact with each other in each iteration. In order to have a wider basin of convergence than other local algorithms, we design a more robust energy function by convolving efficiently the two GMMs over the whole 3D space. Tens of thousands of trials have been conducted on hundreds of models from multiple datasets to demonstrate the proposed method's superior performance compared with the current state-of-the-art methods. The new dataset and code is available from https://github.com/Canpu999Comment: Accepted by 3DV 2018. 9 pages. arXiv admin note: text overlap with arXiv:1707.0862

    SDF-MAN: Semi-supervised Disparity Fusion with Multi-scale Adversarial Networks

    Get PDF
    Refining raw disparity maps from different algorithms to exploit their complementary advantages is still challenging. Uncertainty estimation and complex disparity relationships among pixels limit the accuracy and robustness of existing methods and there is no standard method for fusion of different kinds of depth data. In this paper, we introduce a new method to fuse disparity maps from different sources, while incorporating supplementary information (intensity, gradient, etc.) into a refiner network to better refine raw disparity inputs. A discriminator network classifies disparities at different receptive fields and scales. Assuming a Markov Random Field for the refined disparity map produces better estimates of the true disparity distribution. Both fully supervised and semi-supervised versions of the algorithm are proposed. The approach includes a more robust loss function to inpaint invalid disparity values and requires much less labeled data to train in the semi-supervised learning mode. The algorithm can be generalized to fuse depths from different kinds of depth sources. Experiments explored different fusion opportunities: stereo-monocular fusion, stereo-ToF fusion and stereo-stereo fusion. The experiments show the superiority of the proposed algorithm compared with the most recent algorithms on public synthetic datasets (Scene Flow, SYNTH3, our synthetic garden dataset) and real datasets (Kitti2015 dataset and Trimbot2020 Garden dataset).Comment: This is our draft and accepted by the journal Remote Sensing. There is a little difference between the title on Arxiv and that on Remote Sensing. Two small corrections have been made in "Performance on Kitti2015 Dataset" in this latest version (which is slightly different from the previous version in Remote Sensing
    corecore