1,301 research outputs found

    CompILE: Compositional Imitation Learning and Execution

    Get PDF
    We introduce Compositional Imitation Learning and Execution (CompILE): a framework for learning reusable, variable-length segments of hierarchically-structured behavior from demonstration data. CompILE uses a novel unsupervised, fully-differentiable sequence segmentation module to learn latent encodings of sequential data that can be re-composed and executed to perform new tasks. Once trained, our model generalizes to sequences of longer length and from environment instances not seen during training. We evaluate CompILE in a challenging 2D multi-task environment and a continuous control task, and show that it can find correct task boundaries and event encodings in an unsupervised manner. Latent codes and associated behavior policies discovered by CompILE can be used by a hierarchical agent, where the high-level policy selects actions in the latent code space, and the low-level, task-specific policies are simply the learned decoders. We found that our CompILE-based agent could learn given only sparse rewards, where agents without task-specific policies struggle.Comment: ICML (2019

    Crowdsourcing in Computer Vision

    Full text link
    Computer vision systems require large amounts of manually annotated data to properly learn challenging visual concepts. Crowdsourcing platforms offer an inexpensive method to capture human knowledge and understanding, for a vast number of visual perception tasks. In this survey, we describe the types of annotations computer vision researchers have collected using crowdsourcing, and how they have ensured that this data is of high quality while annotation effort is minimized. We begin by discussing data collection on both classic (e.g., object recognition) and recent (e.g., visual story-telling) vision tasks. We then summarize key design decisions for creating effective data collection interfaces and workflows, and present strategies for intelligently selecting the most important data instances to annotate. Finally, we conclude with some thoughts on the future of crowdsourcing in computer vision.Comment: A 69-page meta review of the field, Foundations and Trends in Computer Graphics and Vision, 201

    Robust Temporally Coherent Laplacian Protrusion Segmentation of 3D Articulated Bodies

    Get PDF
    In motion analysis and understanding it is important to be able to fit a suitable model or structure to the temporal series of observed data, in order to describe motion patterns in a compact way, and to discriminate between them. In an unsupervised context, i.e., no prior model of the moving object(s) is available, such a structure has to be learned from the data in a bottom-up fashion. In recent times, volumetric approaches in which the motion is captured from a number of cameras and a voxel-set representation of the body is built from the camera views, have gained ground due to attractive features such as inherent view-invariance and robustness to occlusions. Automatic, unsupervised segmentation of moving bodies along entire sequences, in a temporally-coherent and robust way, has the potential to provide a means of constructing a bottom-up model of the moving body, and track motion cues that may be later exploited for motion classification. Spectral methods such as locally linear embedding (LLE) can be useful in this context, as they preserve "protrusions", i.e., high-curvature regions of the 3D volume, of articulated shapes, while improving their separation in a lower dimensional space, making them in this way easier to cluster. In this paper we therefore propose a spectral approach to unsupervised and temporally-coherent body-protrusion segmentation along time sequences. Volumetric shapes are clustered in an embedding space, clusters are propagated in time to ensure coherence, and merged or split to accommodate changes in the body's topology. Experiments on both synthetic and real sequences of dense voxel-set data are shown. This supports the ability of the proposed method to cluster body-parts consistently over time in a totally unsupervised fashion, its robustness to sampling density and shape quality, and its potential for bottom-up model constructionComment: 31 pages, 26 figure

    Which structures are out there? : Learning predictive compositional concepts based on social sensorimotor explorations

    Get PDF
    How do we learn to think about our world in a flexible, compositional manner? What is the actual content of a particular thought? How do we become language ready? I argue that free energy-based inference processes, which determine the learning of predictive encodings, need to incorporate additional structural learning biases that reflect those structures of our world that are behaviorally relevant for us. In particular, I argue that the inference processes and thus the resulting predictive encodings should enable (i) the distinction of space from entities, with their perceptually and behaviorally relevant properties, (ii) the flexible, temporary activation of relative spatial relations between different entities, (iii) the dynamic adaptation of the involved, distinct encodings while executing, observing, or imagining particular interactions, and (iv) the development of a – probably motor-grounded – concept of forces, which predictively encodes the results of relative spatial and property manipulations dynamically over time. Furthermore, seeing that entity interactions typically have a beginning and an end, free energy-based inference should be additionally biased towards the segmentation of continuous sensorimotor interactions and sensory experiences into events and event boundaries. Thereby, events may be characterized by particular sets of active predictive encodings. Event boundaries, on the other hand, identify those situational aspects that are critical for the commencement or the termination of a particular event, such as the establishment of object contact and contact release. I argue that the development of predictive event encodings naturally lead to the development of conceptual encodings and the possibility of composing these encodings in a highly flexible, semantic manner. Behavior is generated by means of active inference. The addition of internal motivations in the form of homeostatic variables focusses our behavior – including attention and thought – on those environmental interactions that are motivationally-relevant, thus continuously striving for internal homeostasis in a goal-directed manner. As a consequence, behavior focusses cognitive development towards (believed) bodily and cognitively (including socially) relevant aspects. The capacity to integrate tools and other humans into our minds, as well as the motivation to flexibly interact with them, seem to open up the possibility of assigning roles – such as actors, instruments, and recipients – when observing, executing, or imagining particular environmental interactions. Moreover, in conjunction with predictive event encodings, this tool- and socially-oriented mental flexibilization fosters perspective taking, reasoning, and other forms of mentalizing. Finally, I discuss how these structures and mechanisms are exactly those that seem necessary to make our minds language ready

    CEST: a Cognitive Event based Semi-automatic Technique for behavior segmentation

    Get PDF
    This work introduces CEST, a Cognitive Event based Semiautomatic Technique for behavior segmentation. The technique was inspired by an everyday cognitive process. Humans, in fact, make sense of what happens to them by breaking the continuous stream of activity into smaller units, through a process known as segmentation. A cognitive theory, the Event Segmentation Theory, provides a computational and neurophysiological account of this process, describing how the detection of changes in the current situation drive boundary perception. CEST was designed with the aim of providing affective researchers with a tool to semi-automatically segment behavior. Researchers investigating behavior, as a matter of fact, often need to parse their research data into simpler units, either manually or automatically. To perform segmentation, the technique combines manual annotations and the output of change-point detection algorithms, techniques from time-series research that afford the detection of abrupt changes in time-series. CEST is inherently multidisciplinary: it is, to the best of our knowledge, the first attempt to adopt a cognitive science perspective on the issue of (semi) automatic behavior segmentation. CEST is a general-purpose technique, as it aims at providing a tool for segmenting behavior across research areas. In this manuscript, we detail the theories behind the design of CEST and the results of two experimental studies aimed at assessing the feasibility of the approach on both single and group scenarios. Most importantly, we present the results of the evaluation of CEST on a data-set of dance performances. We explore seven different techniques for change-point detection that could be leveraged to achieve semi-automatic segmentation through CEST and illustrate how two different bayesian algorithms led to the highest scores. Upon selecting the best algorithms, we measured the effect of the temporal grain of the analysis on the performance. Overall, our results support the idea of a semiautomatic segmentation technique for behavior segmentation. The output of the analysis mirrors cognitive science research on segmentation and on event structure perception. The work also tackles new challenges that may arise from our approach

    Unsupervised Discovery of Extreme Weather Events Using Universal Representations of Emergent Organization

    Full text link
    Spontaneous self-organization is ubiquitous in systems far from thermodynamic equilibrium. While organized structures that emerge dominate transport properties, universal representations that identify and describe these key objects remain elusive. Here, we introduce a theoretically-grounded framework for describing emergent organization that, via data-driven algorithms, is constructive in practice. Its building blocks are spacetime lightcones that embody how information propagates across a system through local interactions. We show that predictive equivalence classes of lightcones -- local causal states -- capture organized behaviors and coherent structures in complex spatiotemporal systems. Employing an unsupervised physics-informed machine learning algorithm and a high-performance computing implementation, we demonstrate automatically discovering coherent structures in two real world domain science problems. We show that local causal states identify vortices and track their power-law decay behavior in two-dimensional fluid turbulence. We then show how to detect and track familiar extreme weather events -- hurricanes and atmospheric rivers -- and discover other novel coherent structures associated with precipitation extremes in high-resolution climate data at the grid-cell level

    Visual Question Answering: A Survey of Methods and Datasets

    Full text link
    Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.Comment: 25 page
    • …
    corecore