2,988 research outputs found
Joint Video and Text Parsing for Understanding Events and Answering Queries
We propose a framework for parsing video and text jointly for understanding
events and answering user queries. Our framework produces a parse graph that
represents the compositional structures of spatial information (objects and
scenes), temporal information (actions and events) and causal information
(causalities between events and fluents) in the video and text. The knowledge
representation of our framework is based on a spatial-temporal-causal And-Or
graph (S/T/C-AOG), which jointly models possible hierarchical compositions of
objects, scenes and events as well as their interactions and mutual contexts,
and specifies the prior probabilistic distribution of the parse graphs. We
present a probabilistic generative model for joint parsing that captures the
relations between the input video/text, their corresponding parse graphs and
the joint parse graph. Based on the probabilistic model, we propose a joint
parsing system consisting of three modules: video parsing, text parsing and
joint inference. Video parsing and text parsing produce two parse graphs from
the input video and text respectively. The joint inference module produces a
joint parse graph by performing matching, deduction and revision on the video
and text parse graphs. The proposed framework has the following objectives:
Firstly, we aim at deep semantic parsing of video and text that goes beyond the
traditional bag-of-words approaches; Secondly, we perform parsing and reasoning
across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG
representation; Thirdly, we show that deep joint parsing facilitates subsequent
applications such as generating narrative text descriptions and answering
queries in the forms of who, what, when, where and why. We empirically
evaluated our system based on comparison against ground-truth as well as
accuracy of query answering and obtained satisfactory results
Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions
In this paper, we present a general framework for learning social affordance
grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human
interactions, and transfer the grammar to humanoids to enable a real-time
motion inference for human-robot interaction (HRI). Based on Gibbs sampling,
our weakly supervised grammar learning can automatically construct a
hierarchical representation of an interaction with long-term joint sub-tasks of
both agents and short term atomic actions of individual agents. Based on a new
RGB-D video dataset with rich instances of human interactions, our experiments
of Baxter simulation, human evaluation, and real Baxter test demonstrate that
the model learned from limited training data successfully generates human-like
behaviors in unseen scenarios and outperforms both baselines.Comment: The 2017 IEEE International Conference on Robotics and Automation
(ICRA
Predicting Deeper into the Future of Semantic Segmentation
The ability to predict and therefore to anticipate the future is an important
attribute of intelligence. It is also of utmost importance in real-time
systems, e.g. in robotics or autonomous driving, which depend on visual scene
understanding for decision making. While prediction of the raw RGB pixel values
in future video frames has been studied in previous work, here we introduce the
novel task of predicting semantic segmentations of future frames. Given a
sequence of video frames, our goal is to predict segmentation maps of not yet
observed video frames that lie up to a second or further in the future. We
develop an autoregressive convolutional neural network that learns to
iteratively generate multiple frames. Our results on the Cityscapes dataset
show that directly predicting future segmentations is substantially better than
predicting and then segmenting future RGB frames. Prediction results up to half
a second in the future are visually convincing and are much more accurate than
those of a baseline based on warping semantic segmentations using optical flow.Comment: Accepted to ICCV 2017. Supplementary material available on the
authors' webpage
What Will I Do Next? The Intention from Motion Experiment
In computer vision, video-based approaches have been widely explored for the
early classification and the prediction of actions or activities. However, it
remains unclear whether this modality (as compared to 3D kinematics) can still
be reliable for the prediction of human intentions, defined as the overarching
goal embedded in an action sequence. Since the same action can be performed
with different intentions, this problem is more challenging but yet affordable
as proved by quantitative cognitive studies which exploit the 3D kinematics
acquired through motion capture systems. In this paper, we bridge cognitive and
computer vision studies, by demonstrating the effectiveness of video-based
approaches for the prediction of human intentions. Precisely, we propose
Intention from Motion, a new paradigm where, without using any contextual
information, we consider instantaneous grasping motor acts involving a bottle
in order to forecast why the bottle itself has been reached (to pass it or to
place in a box, or to pour or to drink the liquid inside). We process only the
grasping onsets casting intention prediction as a classification framework.
Leveraging on our multimodal acquisition (3D motion capture data and 2D optical
videos), we compare the most commonly used 3D descriptors from cognitive
studies with state-of-the-art video-based techniques. Since the two analyses
achieve an equivalent performance, we demonstrate that computer vision tools
are effective in capturing the kinematics and facing the cognitive problem of
human intention prediction.Comment: 2017 IEEE Conference on Computer Vision and Pattern Recognition
Workshop
RED: Reinforced Encoder-Decoder Networks for Action Anticipation
Action anticipation aims to detect an action before it happens. Many real
world applications in robotics and surveillance are related to this predictive
capability. Current methods address this problem by first anticipating visual
representations of future frames and then categorizing the anticipated
representations to actions. However, anticipation is based on a single past
frame's representation, which ignores the history trend. Besides, it can only
anticipate a fixed future time. We propose a Reinforced Encoder-Decoder (RED)
network for action anticipation. RED takes multiple history representations as
input and learns to anticipate a sequence of future representations. One
salient aspect of RED is that a reinforcement module is adopted to provide
sequence-level supervision; the reward function is designed to encourage the
system to make correct predictions as early as possible. We test RED on
TVSeries, THUMOS-14 and TV-Human-Interaction datasets for action anticipation
and achieve state-of-the-art performance on all datasets
- …