1,301 research outputs found
Recommended from our members
Learning Behavior-Grounded Event Segmentations
The event segmentation theory (EST) postulates that humanssystematically segment the continuous sensorimotor informa-tion flow into events and event boundaries. The basis for theobserved segmentation tendencies, however, remains largelyunknown. We introduce a computational model that groundsEST in the interaction abilities of a system. The model learnsevents and event boundaries based on actively gathered senso-rimotor signals. It segments the signals based on principles ofprobabilistic predictive coding and surprise. The implementedmodel essentially simulates, anticipates, and learns event pro-gressions and event transitions online while interacting withthe environment by means of dynamic, predictive Bayesianmodels. Besides the model’s event segmentation capabilities,we show that the learned encodings can be used for higher-order planning. Moreover, the encodings systematically con-ceptualize environmental interactions and they help to identifythe factors that are critical for ensuring interaction success
CompILE: Compositional Imitation Learning and Execution
We introduce Compositional Imitation Learning and Execution (CompILE): a
framework for learning reusable, variable-length segments of
hierarchically-structured behavior from demonstration data. CompILE uses a
novel unsupervised, fully-differentiable sequence segmentation module to learn
latent encodings of sequential data that can be re-composed and executed to
perform new tasks. Once trained, our model generalizes to sequences of longer
length and from environment instances not seen during training. We evaluate
CompILE in a challenging 2D multi-task environment and a continuous control
task, and show that it can find correct task boundaries and event encodings in
an unsupervised manner. Latent codes and associated behavior policies
discovered by CompILE can be used by a hierarchical agent, where the high-level
policy selects actions in the latent code space, and the low-level,
task-specific policies are simply the learned decoders. We found that our
CompILE-based agent could learn given only sparse rewards, where agents without
task-specific policies struggle.Comment: ICML (2019
Crowdsourcing in Computer Vision
Computer vision systems require large amounts of manually annotated data to
properly learn challenging visual concepts. Crowdsourcing platforms offer an
inexpensive method to capture human knowledge and understanding, for a vast
number of visual perception tasks. In this survey, we describe the types of
annotations computer vision researchers have collected using crowdsourcing, and
how they have ensured that this data is of high quality while annotation effort
is minimized. We begin by discussing data collection on both classic (e.g.,
object recognition) and recent (e.g., visual story-telling) vision tasks. We
then summarize key design decisions for creating effective data collection
interfaces and workflows, and present strategies for intelligently selecting
the most important data instances to annotate. Finally, we conclude with some
thoughts on the future of crowdsourcing in computer vision.Comment: A 69-page meta review of the field, Foundations and Trends in
Computer Graphics and Vision, 201
Robust Temporally Coherent Laplacian Protrusion Segmentation of 3D Articulated Bodies
In motion analysis and understanding it is important to be able to fit a
suitable model or structure to the temporal series of observed data, in order
to describe motion patterns in a compact way, and to discriminate between them.
In an unsupervised context, i.e., no prior model of the moving object(s) is
available, such a structure has to be learned from the data in a bottom-up
fashion. In recent times, volumetric approaches in which the motion is captured
from a number of cameras and a voxel-set representation of the body is built
from the camera views, have gained ground due to attractive features such as
inherent view-invariance and robustness to occlusions. Automatic, unsupervised
segmentation of moving bodies along entire sequences, in a temporally-coherent
and robust way, has the potential to provide a means of constructing a
bottom-up model of the moving body, and track motion cues that may be later
exploited for motion classification. Spectral methods such as locally linear
embedding (LLE) can be useful in this context, as they preserve "protrusions",
i.e., high-curvature regions of the 3D volume, of articulated shapes, while
improving their separation in a lower dimensional space, making them in this
way easier to cluster. In this paper we therefore propose a spectral approach
to unsupervised and temporally-coherent body-protrusion segmentation along time
sequences. Volumetric shapes are clustered in an embedding space, clusters are
propagated in time to ensure coherence, and merged or split to accommodate
changes in the body's topology. Experiments on both synthetic and real
sequences of dense voxel-set data are shown. This supports the ability of the
proposed method to cluster body-parts consistently over time in a totally
unsupervised fashion, its robustness to sampling density and shape quality, and
its potential for bottom-up model constructionComment: 31 pages, 26 figure
Which structures are out there? : Learning predictive compositional concepts based on social sensorimotor explorations
How do we learn to think about our world in a flexible, compositional manner? What is the actual content of a particular thought? How do we become language ready? I argue that free energy-based inference processes, which determine the learning of predictive encodings, need to incorporate additional structural learning biases that reflect those structures of our world that are behaviorally relevant for us. In particular, I argue that the inference processes and thus the resulting predictive encodings should enable (i) the distinction of space from entities, with their perceptually and behaviorally relevant properties, (ii) the flexible, temporary activation of relative spatial relations between different entities, (iii) the dynamic adaptation of the involved, distinct encodings while executing, observing, or imagining particular interactions, and (iv) the development of a – probably motor-grounded – concept of forces, which predictively encodes the results of relative spatial and property manipulations dynamically over time. Furthermore, seeing that entity interactions typically have a beginning and an end, free energy-based inference should be additionally biased towards the segmentation of continuous sensorimotor interactions and sensory experiences into events and event boundaries. Thereby, events may be characterized by particular sets of active predictive encodings. Event boundaries, on the other hand, identify those situational aspects that are critical for the commencement or the termination of a particular event, such as the establishment of object contact and contact release. I argue that the development of predictive event encodings naturally lead to the development of conceptual encodings and the possibility of composing these encodings in a highly flexible, semantic manner. Behavior is generated by means of active inference. The addition of internal motivations in the form of homeostatic variables focusses our behavior – including attention and thought – on those environmental interactions that are motivationally-relevant, thus continuously striving for internal homeostasis in a goal-directed manner. As a consequence, behavior focusses cognitive development towards (believed) bodily and cognitively (including socially) relevant aspects. The capacity to integrate tools and other humans into our minds, as well as the motivation to flexibly interact with them, seem to open up the possibility of assigning roles – such as actors, instruments, and recipients – when observing, executing, or imagining particular environmental interactions. Moreover, in conjunction with predictive event encodings, this tool- and socially-oriented mental flexibilization fosters perspective taking, reasoning, and other forms of mentalizing. Finally, I discuss how these structures and mechanisms are exactly those that seem necessary to make our minds language ready
CEST: a Cognitive Event based Semi-automatic Technique for behavior segmentation
This work introduces CEST, a Cognitive Event based Semiautomatic Technique for behavior segmentation. The technique was inspired by an everyday cognitive process. Humans, in fact, make sense of what happens to them by breaking the continuous stream of activity into smaller units, through a process known as segmentation. A cognitive theory, the Event Segmentation Theory, provides a computational and neurophysiological account of this process, describing how the detection of changes in the current situation drive boundary perception. CEST was designed with the aim of providing affective researchers with a tool to semi-automatically segment behavior. Researchers investigating behavior, as a matter of fact, often need to parse their research data into simpler units, either manually or automatically. To perform segmentation, the technique combines manual annotations and the output of change-point detection algorithms, techniques from time-series research that afford the detection of abrupt changes in time-series. CEST is inherently multidisciplinary: it is, to the best of our knowledge, the first attempt to adopt a cognitive science perspective on the issue of (semi) automatic behavior segmentation. CEST is a general-purpose technique, as it aims at providing a tool for segmenting behavior across research areas. In this manuscript, we detail the theories behind the design of CEST and the results of two experimental studies aimed at assessing the feasibility of the approach on both single and group scenarios. Most importantly, we present the results of the evaluation of CEST on a data-set of dance performances. We explore seven different techniques for change-point detection that could be leveraged to achieve semi-automatic segmentation through CEST and illustrate how two different bayesian algorithms led to the highest scores. Upon selecting the best algorithms, we measured the effect of the temporal grain of the analysis on the performance. Overall, our results support the idea of a semiautomatic segmentation technique for behavior segmentation. The output of the analysis mirrors cognitive science research on segmentation and on event structure perception. The work also tackles new challenges that may arise from our approach
Unsupervised Discovery of Extreme Weather Events Using Universal Representations of Emergent Organization
Spontaneous self-organization is ubiquitous in systems far from thermodynamic
equilibrium. While organized structures that emerge dominate transport
properties, universal representations that identify and describe these key
objects remain elusive. Here, we introduce a theoretically-grounded framework
for describing emergent organization that, via data-driven algorithms, is
constructive in practice. Its building blocks are spacetime lightcones that
embody how information propagates across a system through local interactions.
We show that predictive equivalence classes of lightcones -- local causal
states -- capture organized behaviors and coherent structures in complex
spatiotemporal systems. Employing an unsupervised physics-informed machine
learning algorithm and a high-performance computing implementation, we
demonstrate automatically discovering coherent structures in two real world
domain science problems. We show that local causal states identify vortices and
track their power-law decay behavior in two-dimensional fluid turbulence. We
then show how to detect and track familiar extreme weather events -- hurricanes
and atmospheric rivers -- and discover other novel coherent structures
associated with precipitation extremes in high-resolution climate data at the
grid-cell level
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
- …