66 research outputs found
Segmenting the Future
Predicting the future is an important aspect for decision-making in robotics
or autonomous driving systems, which heavily rely upon visual scene
understanding. While prior work attempts to predict future video pixels,
anticipate activities or forecast future scene semantic segments from
segmentation of the preceding frames, methods that predict future semantic
segmentation solely from the previous frame RGB data in a single end-to-end
trainable model do not exist. In this paper, we propose a temporal
encoder-decoder network architecture that encodes RGB frames from the past and
decodes the future semantic segmentation. The network is coupled with a new
knowledge distillation training framework specific for the forecasting task.
Our method, only seeing preceding video frames, implicitly models the scene
segments while simultaneously accounting for the object dynamics to infer the
future scene semantic segments. Our results on Cityscapes and Apolloscape
outperform the baseline and current state-of-the-art methods. Code is available
at https://github.com/eddyhkchiu/segmenting_the_future/
Dense-Captioning Events in Videos
Most natural videos contain numerous events. For example, in a video of a
"man playing a piano", the video might also contain "another man dancing" or "a
crowd clapping". We introduce the task of dense-captioning events, which
involves both detecting and describing events in a video. We propose a new
model that is able to identify all events in a single pass of the video while
simultaneously describing the detected events with natural language. Our model
introduces a variant of an existing proposal module that is designed to capture
both short as well as long events that span minutes. To capture the
dependencies between the events in a video, our model introduces a new
captioning module that uses contextual information from past and future events
to jointly describe all events. We also introduce ActivityNet Captions, a
large-scale benchmark for dense-captioning events. ActivityNet Captions
contains 20k videos amounting to 849 video hours with 100k total descriptions,
each with it's unique start and end time. Finally, we report performances of
our model for dense-captioning events, video retrieval and localization.Comment: 16 pages, 16 figure
Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos
We propose an unsupervised method for reference resolution in instructional
videos, where the goal is to temporally link an entity (e.g., "dressing") to
the action (e.g., "mix yogurt") that produced it. The key challenge is the
inevitable visual-linguistic ambiguities arising from the changes in both
visual appearance and referring expression of an entity in the video. This
challenge is amplified by the fact that we aim to resolve references with no
supervision. We address these challenges by learning a joint visual-linguistic
model, where linguistic cues can help resolve visual ambiguities and vice
versa. We verify our approach by learning our model unsupervisedly using more
than two thousand unstructured cooking videos from YouTube, and show that our
visual-linguistic model can substantially improve upon state-of-the-art
linguistic only model on reference resolution in instructional videos.Comment: CVPR 201
A Deep Learning Based Behavioral Approach to Indoor Autonomous Navigation
We present a semantically rich graph representation for indoor robotic
navigation. Our graph representation encodes: semantic locations such as
offices or corridors as nodes, and navigational behaviors such as enter office
or cross a corridor as edges. In particular, our navigational behaviors operate
directly from visual inputs to produce motor controls and are implemented with
deep learning architectures. This enables the robot to avoid explicit
computation of its precise location or the geometry of the environment, and
enables navigation at a higher level of semantic abstraction. We evaluate the
effectiveness of our representation by simulating navigation tasks in a large
number of virtual environments. Our results show that using a simple sets of
perceptual and navigational behaviors, the proposed approach can successfully
guide the way of the robot as it completes navigational missions such as going
to a specific office. Furthermore, our implementation shows to be effective to
control the selection and switching of behaviors
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
A key aspect of VQA models that are interpretable is their ability to ground
their answers to relevant regions in the image. Current approaches with this
capability rely on supervised learning and human annotated groundings to train
attention mechanisms inside the VQA architecture. Unfortunately, obtaining
human annotations specific for visual grounding is difficult and expensive. In
this work, we demonstrate that we can effectively train a VQA architecture with
grounding supervision that can be automatically obtained from available region
descriptions and object annotations. We also show that our model trained with
this mined supervision generates visual groundings that achieve a higher
correlation with respect to manually-annotated groundings, meanwhile achieving
state-of-the-art VQA accuracy.Comment: 8 pages, 4 figure
Peeking into the Future: Predicting Future Person Activities and Locations in Videos
Deciphering human behaviors to predict their future paths/trajectories and
what they would do from videos is important in many applications. Motivated by
this idea, this paper studies predicting a pedestrian's future path jointly
with future activities. We propose an end-to-end, multi-task learning system
utilizing rich visual features about human behavioral information and
interaction with their surroundings. To facilitate the training, the network is
learned with an auxiliary task of predicting future location in which the
activity will happen. Experimental results demonstrate our state-of-the-art
performance over two public benchmarks on future trajectory prediction.
Moreover, our method is able to produce meaningful future activity prediction
in addition to the path. The result provides the first empirical evidence that
joint modeling of paths and activities benefits future path prediction.Comment: In CVPR 2019. Code, models and more results are available at:
https://next.cs.cmu.edu
Action-Agnostic Human Pose Forecasting
Predicting and forecasting human dynamics is a very interesting but
challenging task with several prospective applications in robotics,
health-care, etc. Recently, several methods have been developed for human pose
forecasting; however, they often introduce a number of limitations in their
settings. For instance, previous work either focused only on short-term or
long-term predictions, while sacrificing one or the other. Furthermore, they
included the activity labels as part of the training process, and require them
at testing time. These limitations confine the usage of pose forecasting models
for real-world applications, as often there are no activity-related annotations
for testing scenarios. In this paper, we propose a new action-agnostic method
for short- and long-term human pose forecasting. To this end, we propose a new
recurrent neural network for modeling the hierarchical and multi-scale
characteristics of the human dynamics, denoted by triangular-prism RNN
(TP-RNN). Our model captures the latent hierarchical structure embedded in
temporal human pose sequences by encoding the temporal dependencies with
different time-scales. For evaluation, we run an extensive set of experiments
on Human 3.6M and Penn Action datasets and show that our method outperforms
baseline and state-of-the-art methods quantitatively and qualitatively. Codes
are available at https://github.com/eddyhkchiu/pose_forecast_wacv/Comment: Accepted for publication in WACV 201
Visual Forecasting by Imitating Dynamics in Natural Sequences
We introduce a general framework for visual forecasting, which directly
imitates visual sequences without additional supervision. As a result, our
model can be applied at several semantic levels and does not require any domain
knowledge or handcrafted features. We achieve this by formulating visual
forecasting as an inverse reinforcement learning (IRL) problem, and directly
imitate the dynamics in natural sequences from their raw pixel values. The key
challenge is the high-dimensional and continuous state-action space that
prohibits the application of previous IRL algorithms. We address this
computational bottleneck by extending recent progress in model-free imitation
with trainable deep feature representations, which (1) bypasses the exhaustive
state-action pair visits in dynamic programming by using a dual formulation and
(2) avoids explicit state sampling at gradient computation using a deep feature
reparametrization. This allows us to apply IRL at scale and directly imitate
the dynamics in high-dimensional continuous visual sequences from the raw pixel
values. We evaluate our approach at three different level-of-abstraction, from
low level pixels to higher level semantics: future frame generation, action
anticipation, visual story forecasting. At all levels, our approach outperforms
existing methods.Comment: 10 pages, 9 figures, accepted to ICCV 201
Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization
For survival, a living agent must have the ability to assess risk (1) by
temporally anticipating accidents before they occur, and (2) by spatially
localizing risky regions in the environment to move away from threats. In this
paper, we take an agent-centric approach to study the accident anticipation and
risky region localization tasks. We propose a novel soft-attention Recurrent
Neural Network (RNN) which explicitly models both spatial and appearance-wise
non-linear interaction between the agent triggering the event and another agent
or static-region involved. In order to test our proposed method, we introduce
the Epic Fail (EF) dataset consisting of 3000 viral videos capturing various
accidents. In the experiments, we evaluate the risk assessment accuracy both in
the temporal domain (accident anticipation) and spatial domain (risky region
localization) on our EF dataset and the Street Accident (SA) dataset. Our
method consistently outperforms other baselines on both datasets
Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation
We propose an end-to-end deep learning model for translating free-form
natural language instructions to a high-level plan for behavioral robot
navigation. We use attention models to connect information from both the user
instructions and a topological representation of the environment. We evaluate
our model's performance on a new dataset containing 10,050 pairs of navigation
instructions. Our model significantly outperforms baseline approaches.
Furthermore, our results suggest that it is possible to leverage the
environment map as a relevant knowledge base to facilitate the translation of
free-form navigational instruction
- …