283 research outputs found
The attentive robot companion: learning spatial information from observation and verbal interaction
Ziegler L. The attentive robot companion: learning spatial information from observation and verbal interaction. Bielefeld: Universität Bielefeld; 2015.This doctoral thesis investigates how a robot companion can gain a certain degree of situational awareness through observation and interaction with its surroundings. The focus lies on the representation of the spatial knowledge gathered constantly over time in an indoor environment. However, from the background of research on an interactive service robot, methods for deployment in inference and verbal communication tasks are presented. The design and application of the models are guided by the requirements of referential communication. The approach here involves the analysis of the dynamic properties of structures in the robot’s field of view allowing it to distinguish objects of interest from other agents and background structures. The use of multiple persistent models representing these dynamic properties enables the robot to track changes in multiple scenes over time to establish spatial and temporal references. This work includes building a coherent representation considering allocentric and egocentric aspects of spatial knowledge for these models. Spatial analysis is extended with a semantic interpretation of objects and regions. This top-down approach for generating additional context information enhances the grounding process in communication. A holistic, boosting-based classification approach using a wide range of 2D and 3D visual features anchored in the spatial representation allows the system to identify room types. The process of grounding referential descriptions from a human interlocutor in the spatial representation is evaluated through referencing furniture. This method uses a probabilistic network for handling ambiguities in the descriptions and employs a strategy for resolving conflicts. In order to approve the real-world applicability of these approaches, this system was deployed on the mobile robot BIRON in a realistic apartment scenario involving observation and verbal interaction with an interlocutor
ToBI - Team of Bielefeld: The Human-Robot Interaction System for RoboCup@Home 2014
Ziegler L, Wittrowski J, Meyer zu Borgsen S, Wachsmuth S. ToBI - Team of Bielefeld: The Human-Robot Interaction System for RoboCup@Home 2014. Presented at the RoboCup 2014, João Pessoa, Brasil
Label Efficient 3D Scene Understanding
3D scene understanding models are becoming increasingly integrated into modern society. With applications ranging from autonomous driving, Augmented Real- ity, Virtual Reality, robotics and mapping, the demand for well-behaved models is rapidly increasing. A key requirement for training modern 3D models is high- quality manually labelled training data. Collecting training data is often the time and monetary bottleneck, limiting the size of datasets. As modern data-driven neu- ral networks require very large datasets to achieve good generalisation, finding al- ternative strategies to manual labelling is sought after for many industries.
In this thesis, we present a comprehensive study on achieving 3D scene under- standing with fewer labels. Specifically, we evaluate 4 approaches: existing data, synthetic data, weakly-supervised and self-supervised. Existing data looks at the potential of using readily available national mapping data as coarse labels for train- ing a building segmentation model. We further introduce an energy-based active contour snake algorithm to improve label quality by utilising co-registered LiDAR data. This is attractive as whilst the models may still require manual labels, these labels already exist. Synthetic data also exploits already existing data which was not originally designed for training neural networks. We demonstrate a pipeline for generating a synthetic Mobile Laser Scanner dataset. We experimentally evalu- ate if such a synthetic dataset can be used to pre-train smaller real-world datasets, increasing the generalisation with less data.
A weakly-supervised approach is presented which allows for competitive per- formance on challenging real-world benchmark 3D scene understanding datasets with up to 95% less data. We propose a novel learning approach where the loss function is learnt. Our key insight is that the loss function is a local function and therefore can be trained with less data on a simpler task. Once trained our loss function can be used to train a 3D object detector using only unlabelled scenes. Our method is both flexible and very scalable, even performing well across datasets.
Finally, we propose a method which only requires a single geometric represen- tation of each object class as supervision for 3D monocular object detection. We discuss why typical L2-like losses do not work for 3D object detection when us- ing differentiable renderer-based optimisation. We show that the undesirable local- minimas that the L2-like losses fall into can be avoided with the inclusion of a Generative Adversarial Network-like loss. We achieve state-of-the-art performance on the challenging 6DoF LineMOD dataset, without any scene level labels
Advances in Data-Driven Analysis and Synthesis of 3D Indoor Scenes
This report surveys advances in deep learning-based modeling techniques that
address four different 3D indoor scene analysis tasks, as well as synthesis of
3D indoor scenes. We describe different kinds of representations for indoor
scenes, various indoor scene datasets available for research in the
aforementioned areas, and discuss notable works employing machine learning
models for such scene modeling tasks based on these representations.
Specifically, we focus on the analysis and synthesis of 3D indoor scenes. With
respect to analysis, we focus on four basic scene understanding tasks -- 3D
object detection, 3D scene segmentation, 3D scene reconstruction and 3D scene
similarity. And for synthesis, we mainly discuss neural scene synthesis works,
though also highlighting model-driven methods that allow for human-centric,
progressive scene synthesis. We identify the challenges involved in modeling
scenes for these tasks and the kind of machinery that needs to be developed to
adapt to the data representation, and the task setting in general. For each of
these tasks, we provide a comprehensive summary of the state-of-the-art works
across different axes such as the choice of data representation, backbone,
evaluation metric, input, output, etc., providing an organized review of the
literature. Towards the end, we discuss some interesting research directions
that have the potential to make a direct impact on the way users interact and
engage with these virtual scene models, making them an integral part of the
metaverse.Comment: Published in Computer Graphics Forum, Aug 202
High-level environment representations for mobile robots
In most robotic applications we are faced with the problem of building
a digital representation of the environment that allows the robot to
autonomously complete its tasks. This internal representation can be
used by the robot to plan a motion trajectory for its mobile base
and/or end-effector. For most man-made environments we do not have
a digital representation or it is inaccurate. Thus, the robot must
have the capability of building it autonomously. This is done by
integrating into an internal data structure incoming sensor
measurements. For this purpose, a common solution consists in solving
the Simultaneous Localization and Mapping (SLAM) problem. The map
obtained by solving a SLAM problem is called ``metric'' and it
describes the geometric structure of the environment. A metric map is
typically made up of low-level primitives (like points or
voxels). This means that even though it represents the shape of the
objects in the robot workspace it lacks the information of which
object a surface belongs to. Having an object-level representation of
the environment has the advantage of augmenting the set of possible
tasks that a robot may accomplish. To this end, in this thesis we
focus on two aspects. We propose a formalism to represent in a uniform
manner 3D scenes consisting of different geometric primitives,
including points, lines and planes. Consequently, we derive a local
registration and a global optimization algorithm that can exploit this
representation for robust estimation. Furthermore, we present a
Semantic Mapping system capable of building an \textit{object-based}
map that can be used for complex task planning and execution. Our
system exploits effective reconstruction and recognition techniques
that require no a-priori information about the environment and can be
used under general conditions
Recommended from our members
Representation Learning for Shape Decomposition, By Shape Decomposition
The ability to parse 3D objects into their constituent parts is essential for humans to understand and interact with the surrounding world. Imparting this skill in machines is important for various computer graphics, computer vision, and robotics tasks. Machines endowed with this skill can better interact with its surroundings, perform shape editing, texturing, recomposing, tracking, and animation. In this thesis, we ask two questions. First, how can machines decompose 3D shapes into their fundamental parts? Second, does the ability to decompose the 3D shape into these parts help learn useful 3D shape representations?
In this thesis, we focus on parsing the shape into compact representations, such as parametric surface patches and Constructive Solid Geometry (CSG) primitives, which are also widely used representations in 3D modeling in computer graphics. Inspired by the advances in neural networks for 3D shape processing, we develop neural network approaches to tackle shape decomposition. First, we present CSGNet, a network architecture to parse shapes into CSG programs, which is trained using combination of supervised and reinforcement learning. Second, we present ParSeNet, a network architecture to decompose a shape into parametric surface patches (B-Spline) and geometric primitives (plane, cone, cylinder and sphere), trained on a large set of CAD models using supervised learning.
The training of deep neural network architectures for 3D recognition and generation tasks requires a large amount of labeled datasets. We explore ways to alleviate this problem by relying on shape decomposition methods to guide the learning process. Towards that end, we first study the use of freely available metadata, albeit inconsistent, from shape repositories to learn 3D shape features. Later we show that learning to decompose a 3D shape into geometric primitives also helps in learning shape representations useful for semantic segmentation tasks. Finally, since most 3D shapes encountered in real life are textured, consisting of several fine-grained semantic parts, we propose a method to learn fine-grained representations for textured 3D shapes in a self-supervised manner by incorporating 3D geometric priors
3D scene graph inference and refinement for vision-as-inverse-graphics
The goal of scene understanding is to interpret images,
so as to infer the objects present in a scene, their poses
and fine-grained details. This thesis focuses on methods that
can provide a much more detailed explanation of the scene than
standard bounding-boxes or pixel-level segmentation - we infer
the underlying 3D scene given only its
projection in the form of a single image.
We employ the Vision-as-Inverse-Graphics (VIG) paradigm,
which (a) infers the latent variables of a scene such
as the objects present and their properties as well as the lighting
and the camera, and (b) renders these
latent variables to reconstruct the input image.
One highly attractive aspect of the VIG approach is that it produces
a compact and interpretable representation of the 3D scene in
terms of an arbitrary number of objects, called a 'scene graph'.
This representation is of a key importance, as it
can be useful e.g. if we wish to edit, refine,
interpret the scene or interact with it.
First, we investigate how the recognition models can be used to infer
the scene graph given only a single RGB image. These models are
trained using realistic synthetic images and corresponding ground
truth scene graphs, obtained from a rich stochastic scene
generator. Once the objects have been detected, each object detection
is further processed using neural networks to predict
the object and global latent variables.
This allows computing of object poses
and sizes in 3D scene coordinates, given the camera parameters. This
inference of the latent variables in the form of a 3D scene graph acts
like the encoder of an autoencoder, with graphics
rendering as the decoder.
One of the major challenges is the problem of placing the
detected objects in 3D at a reasonable size and distance with
respect to the single camera, the parameters of
which are unknown. Previous VIG approaches for
multiple objects usually only considered a fixed camera,
while we allow for variable camera pose. To infer the camera
parameters given the votes cast by the detected objects,
we introduce a Probabilistic HoughNets framework for combining
probabilistic votes, robustified with an outlier model.
Each detection provides one noisy low-dimensional manifold
in the Hough space, and by intersecting them
probabilistically we reduce the uncertainty on the camera parameters.
Given an initialization of a scene graph, its refinement typically
involves computationally expensive and inefficient
search through the latent space. Since optimization of the 3D scene
corresponding to an image is a challenging task even for a few LVs,
previous work for multi-object scenes considered only refinement of
the geometry, but not the appearance or illumination. To overcome this
issue, we develop a framework called 'Learning Direct Optimization'
(LiDO) for optimization of the latent variables of a multi-object
scene. Instead of minimizing an error metric that compares observed
image and the render, this optimization is driven by neural networks
that make use of the auto-context in the form of a current scene graph
and its render to predict the LV update.
Our experiments show that the LiDO method converges rapidly
as it does not need to perform a search on the error landscape,
produces better solutions than error-based competitors, and is able
to handle the mismatch between the data and the fitted scene model.
We apply LiDO to a realistic synthetic dataset, and show
that the method transfers to work well with real images.
The advantages of LiDO mean that it could be a critical component
in the development of future vision-as-inverse-graphics systems
Exploring Natural User Abstractions For Shared Perceptual Manipulator Task Modeling & Recovery
State-of-the-art domestic robot assistants are essentially autonomous mobile manipulators capable of exerting human-scale precision grasps. To maximize utility and economy, non-technical end-users would need to be nearly as efficient as trained roboticists in control and collaboration of manipulation task behaviors. However, it remains a significant challenge given that many WIMP-style tools require superficial proficiency in robotics, 3D graphics, and computer science for rapid task modeling and recovery. But research on robot-centric collaboration has garnered momentum in recent years; robots are now planning in partially observable environments that maintain geometries and semantic maps, presenting opportunities for non-experts to cooperatively control task behavior with autonomous-planning agents exploiting the knowledge. However, as autonomous systems are not immune to errors under perceptual difficulty, a human-in-the-loop is needed to bias autonomous-planning towards recovery conditions that resume the task and avoid similar errors. In this work, we explore interactive techniques allowing non-technical users to model task behaviors and perceive cooperatively with a service robot under robot-centric collaboration. We evaluate stylus and touch modalities that users can intuitively and effectively convey natural abstractions of high-level tasks, semantic revisions, and geometries about the world. Experiments are conducted with \u27pick-and-place\u27 tasks in an ideal \u27Blocks World\u27 environment using a Kinova JACO six degree-of-freedom manipulator. Possibilities for the architecture and interface are demonstrated with the following features; (1) Semantic \u27Object\u27 and \u27Location\u27 grounding that describe function and ambiguous geometries (2) Task specification with an unordered list of goal predicates, and (3) Guiding task recovery with implied scene geometries and trajectory via symmetry cues and configuration space abstraction. Empirical results from four user studies show our interface was much preferred than the control condition, demonstrating high learnability and ease-of-use that enable our non-technical participants to model complex tasks, provide effective recovery assistance, and teleoperative control
An intelligent robotic vision system with environment perception
Ever since the dawn of computer vision[1, 2], 3D environment reconstruction and object 6D pose estimation have been a core problem. This thesis attempts to develop a novel 3D intelligent robotic vision system integrating environment reconstruction and object detection techniques to solve practical problems. Chapter 2 reviews current state-of-the art of 3D vision techniques from environment reconstruction and 6D pose estimation.In Chapter 3 a novel environment reconstruction system is proposed by using coloured point clouds. The evaluation experiment indicates that the proposed algorithm 2 is effective for small-scale and large scale and textureless scenes. Chapter 4 presents Image-6D (that is section 4.2), a learning-based object pose estimation algorithm from a single RGB image. Contour-alignment is introduced as an efficient algorithm for pose refinement in an RGB image. This new method is evaluated on two widely used benchmark image data bases, LINEMOD and Occlusion-LINEMOD. Experiments show that the proposed method surpasses other state-of-the-art RGB based prediction approaches. Chapter 5 describes Point-6D (defined in section 5.2), a novel 6D pose estimation method using coloured point clouds as input. The performance of this new method is demonstrated on LineMOD [3] and YCB-Video [4] dataset. Chapter 6 summarizes contributions and discusses potential future research directions. In addition, we presents an intelligent 3D robotic vision system deployed in a simulated/laboratory nuclear waste disposal scenario in Appendices B. To verify the results, a simulated nuclear waste handling experiment has been successfully completed via the proposed robotic system
- …