1,135 research outputs found
Statistical Analysis of Dynamic Actions
Real-world action recognition applications require the development of systems which are fast, can handle a large variety of actions without a priori knowledge of the type of actions, need a minimal number of parameters, and necessitate as short as possible learning stage. In this paper, we suggest such an approach. We regard dynamic activities as long-term temporal objects, which are characterized by spatio-temporal features at multiple temporal scales. Based on this, we design a simple statistical distance measure between video sequences which captures the similarities in their behavioral content. This measure is nonparametric and can thus handle a wide range of complex dynamic actions. Having a behavior-based distance measure between sequences, we use it for a variety of tasks, including: video indexing, temporal segmentation, and action-based video clustering. These tasks are performed without prior knowledge of the types of actions, their models, or their temporal extents
Robot Learning and Execution of Collaborative Manipulation Plans from YouTube Cooking Videos
People often watch videos on the web to learn how to cook new recipes,
assemble furniture or repair a computer. We wish to enable robots with the very
same capability. This is challenging; there is a large variation in
manipulation actions and some videos even involve multiple persons, who
collaborate by sharing and exchanging objects and tools. Furthermore, the
learned representations need to be general enough to be transferable to robotic
systems. On the other hand, previous work has shown that the space of human
manipulation actions has a linguistic, hierarchical structure that relates
actions to manipulated objects and tools. Building upon this theory of language
for action, we propose a framework for understanding and executing demonstrated
action sequences from full-length, unconstrained cooking videos on the web. The
framework takes as input a cooking video annotated with object labels and
bounding boxes, and outputs a collaborative manipulation action plan for one or
more robotic arms. We demonstrate performance of the system in a standardized
dataset of 100 YouTube cooking videos, as well as in three full-length Youtube
videos that include collaborative actions between two participants. We
additionally propose an open-source platform for executing the learned plans in
a simulation environment as well as with an actual robotic arm
Modeling Dynamic Swarms
This paper proposes the problem of modeling video sequences of dynamic swarms
(DS). We define DS as a large layout of stochastically repetitive spatial
configurations of dynamic objects (swarm elements) whose motions exhibit local
spatiotemporal interdependency and stationarity, i.e., the motions are similar
in any small spatiotemporal neighborhood. Examples of DS abound in nature,
e.g., herds of animals and flocks of birds. To capture the local spatiotemporal
properties of the DS, we present a probabilistic model that learns both the
spatial layout of swarm elements and their joint dynamics that are modeled as
linear transformations. To this end, a spatiotemporal neighborhood is
associated with each swarm element, in which local stationarity is enforced
both spatially and temporally. We assume that the prior on the swarm dynamics
is distributed according to an MRF in both space and time. Embedding this model
in a MAP framework, we iterate between learning the spatial layout of the swarm
and its dynamics. We learn the swarm transformations using ICM, which iterates
between estimating these transformations and updating their distribution in the
spatiotemporal neighborhoods. We demonstrate the validity of our method by
conducting experiments on real video sequences. Real sequences of birds, geese,
robot swarms, and pedestrians evaluate the applicability of our model to real
world data.Comment: 11 pages, 17 figures, conference paper, computer visio
Across space and time: infants learn from backward and forward visual statistics
within temporal and spatial visual streams. Two groups of 8-month-old infants were familiarized with an artificial grammar of shapes, comprised of backward and forward base pairs (i.e., two shapes linked by strong backward or forward transitional probability) and part-pairs (i.e., two shapes with weak transitional probabilities in both directions). One group viewed the continuous visual stream as a temporal sequence, while the other group viewed the same stream as a spatial array. Following familiarization, infants looked longer at test trials containing part- pairs than base pairs, though they had appeared with equal frequency during familiarization. This pattern of looking time was evident for both forward and backward pairs, in both the temporal and spatial conditions. Further, differences in looking time to part-pairs that were consistent or inconsistent with the predictive direction of the base pairs (forward or backward) indicated that infants were indeed sensitive to direction when presented with temporal sequences, but not when presented with spatial arrays. These results suggest that visual statistical learning is flexible in infancy and depends on the nature of visual input
SEGMENTATION, RECOGNITION, AND ALIGNMENT OF COLLABORATIVE GROUP MOTION
Modeling and recognition of human motion in videos has broad applications in behavioral biometrics, content-based visual data analysis, security and surveillance, as well as designing interactive environments. Significant progress has been made in the past two decades by way of new models, methods, and implementations. In this dissertation, we focus our attention on a relatively less investigated sub-area called collaborative group motion analysis. Collaborative group motions are those that typically involve multiple objects, wherein the motion patterns of individual objects may vary significantly in both space and time, but the collective motion pattern of the ensemble allows characterization in terms of geometry and statistics. Therefore, the motions or activities of an individual object constitute local information. A framework to synthesize all local information into a holistic view, and to explicitly characterize interactions among objects, involves large scale global reasoning, and is of significant complexity. In this dissertation, we first review relevant previous contributions on human motion/activity modeling and recognition, and then propose several approaches to answer a sequence of traditional vision questions including 1) which of the motion elements among all are the ones relevant to a group motion pattern of interest (Segmentation); 2) what is the underlying motion pattern (Recognition); and 3) how two motion ensembles are similar and how we can 'optimally' transform one to match the other (Alignment). Our primary practical scenario is American football play, where the corresponding problems are 1) who are offensive players; 2) what are the offensive strategy they are using; and 3) whether two plays are using the same strategy and how we can remove the spatio-temporal misalignment between them due to internal or external factors. The proposed approaches discard traditional modeling paradigm but explore either concise descriptors, hierarchies, stochastic mechanism, or compact generative model to achieve both effectiveness and efficiency.
In particular, the intrinsic geometry of the spaces of the involved features/descriptors/quantities is exploited and statistical tools are established on these nonlinear manifolds. These initial attempts have identified new challenging problems in complex motion analysis, as well as in more general tasks in video dynamics. The insights gained from nonlinear geometric modeling and analysis in this dissertation may hopefully be useful toward a broader class of computer vision applications
- …