1,061 research outputs found
Grounding Symbols in Multi-Modal Instructions
As robots begin to cohabit with humans in semi-structured environments, the
need arises to understand instructions involving rich variability---for
instance, learning to ground symbols in the physical world. Realistically, this
task must cope with small datasets consisting of a particular users' contextual
assignment of meaning to terms. We present a method for processing a raw stream
of cross-modal input---i.e., linguistic instructions, visual perception of a
scene and a concurrent trace of 3D eye tracking fixations---to produce the
segmentation of objects with a correspondent association to high-level
concepts. To test our framework we present experiments in a table-top object
manipulation scenario. Our results show our model learns the user's notion of
colour and shape from a small number of physical demonstrations, generalising
to identifying physical referents for novel combinations of the words.Comment: 9 pages, 8 figures, To appear in the Proceedings of the ACL workshop
Language Grounding for Robotics, Vancouver, Canad
ilastik: interactive machine learning for (bio)image analysis
We present ilastik, an easy-to-use interactive tool that brings machine-learning-based (bio)image analysis to end users without substantial computational expertise. It contains pre-defined workflows for image segmentation, object classification, counting and tracking. Users adapt the workflows to the problem at hand by interactively providing sparse training annotations for a nonlinear classifier. ilastik can process data in up to five dimensions (3D, time and number of channels). Its computational back end runs operations on-demand wherever possible, allowing for interactive prediction on data larger than RAM. Once the classifiers are trained, ilastik workflows can be applied to new data from the command line without further user interaction. We describe all ilastik workflows in detail, including three
case studies and a discussion on the expected performance
Change blindness: eradication of gestalt strategies
Arrays of eight, texture-defined rectangles were used as stimuli in a one-shot change blindness (CB) task where there was a 50% chance that one rectangle would change orientation between two successive presentations separated by an interval. CB was eliminated by cueing the target rectangle in the first stimulus, reduced by cueing in the interval and unaffected by cueing in the second presentation. This supports the idea that a representation was formed that persisted through the interval before being 'overwritten' by the second presentation (Landman et al, 2003 Vision Research 43149–164]. Another possibility is that participants used some kind of grouping or Gestalt strategy. To test this we changed the spatial position of the rectangles in the second presentation by shifting them along imaginary spokes (by ±1 degree) emanating from the central fixation point. There was no significant difference seen in performance between this and the standard task [F(1,4)=2.565, p=0.185]. This may suggest two things: (i) Gestalt grouping is not used as a strategy in these tasks, and (ii) it gives further weight to the argument that objects may be stored and retrieved from a pre-attentional store during this task
Reinforcement Learning for Active Visual Perception
Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of training images may perform poorly in real-world settings, where it may view objects from unusual angles or explore poorly illuminated environments. The robot should then ideally be able to actively position itself to observe the scene from viewpoints where it is more confident, or refine its perception with only a limited amount of training data for its present operating conditions.In this thesis we demonstrate how reinforcement learning (RL) can be integrated with three fundamental visual perception tasks -- object detection, human pose estimation, and semantic segmentation -- in order to make the resulting pipelines more adaptive, accurate and/or faster. In the first part we provide object detectors with the capacity to actively select what parts of a given image to analyze and when to terminate the detection process. Several ideas are proposed and empirically evaluated, such as explicitly including the speed-accuracy trade-off in the training process, which makes it possible to specify this trade-off during inference. In the second part we consider active multi-view 3d human pose estimation in complex scenarios with multiple people. We explore this in two different contexts: i) active triangulation, which requires carefully observing each body joint from multiple viewpoints, and ii) active viewpoint selection for monocular 3d estimators, which requires considering which viewpoints yield accurate fused estimates when combined. In both settings the viewpoint selection systems face several challenges, such as partial observability resulting e.g. from occlusions. We show that RL-based methods outperform heuristic ones in accuracy, with negligible computational overhead. Finally, the thesis concludes with establishing a framework for embodied visual active learning in the context of semantic segmentation, where an agent should explore a 3d environment and actively query annotations to refine its visual perception. Our empirical results suggest that reinforcement learning can be successfully applied within this framework as well
Data-Driven Grasp Synthesis - A Survey
We review the work on data-driven grasp synthesis and the methodologies for
sampling and ranking candidate grasps. We divide the approaches into three
groups based on whether they synthesize grasps for known, familiar or unknown
objects. This structure allows us to identify common object representations and
perceptual processes that facilitate the employed data-driven grasp synthesis
technique. In the case of known objects, we concentrate on the approaches that
are based on object recognition and pose estimation. In the case of familiar
objects, the techniques use some form of a similarity matching to a set of
previously encountered objects. Finally for the approaches dealing with unknown
objects, the core part is the extraction of specific features that are
indicative of good grasps. Our survey provides an overview of the different
methodologies and discusses open problems in the area of robot grasping. We
also draw a parallel to the classical approaches that rely on analytic
formulations.Comment: 20 pages, 30 Figures, submitted to IEEE Transactions on Robotic
Gaze Embeddings for Zero-Shot Image Classification
Zero-shot image classification using auxiliary information, such as
attributes describing discriminative object properties, requires time-consuming
annotation by domain experts. We instead propose a method that relies on human
gaze as auxiliary information, exploiting that even non-expert users have a
natural ability to judge class membership. We present a data collection
paradigm that involves a discrimination task to increase the information
content obtained from gaze data. Our method extracts discriminative descriptors
from the data and learns a compatibility function between image and gaze using
three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid
(GFG) and Gaze Features with Sequence (GFS). We introduce two new
gaze-annotated datasets for fine-grained image classification and show that
human gaze data is indeed class discriminative, provides a competitive
alternative to expert-annotated attributes, and outperforms other baselines for
zero-shot image classification
- …