20 research outputs found
Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation
In this work, we address the problem of spatio-temporal action detection in
temporally untrimmed videos. It is an important and challenging task as finding
accurate human actions in both temporal and spatial space is important for
analyzing large-scale video data. To tackle this problem, we propose a cascade
proposal and location anticipation (CPLA) model for frame-level action
detection. There are several salient points of our model: (1) a cascade region
proposal network (casRPN) is adopted for action proposal generation and shows
better localization accuracy compared with single region proposal network
(RPN); (2) action spatio-temporal consistencies are exploited via a location
anticipation network (LAN) and thus frame-level action detection is not
conducted independently. Frame-level detections are then linked by solving an
linking score maximization problem, and temporally trimmed into spatio-temporal
action tubes. We demonstrate the effectiveness of our model on the challenging
UCF101 and LIRIS-HARL datasets, both achieving state-of-the-art performance.Comment: Accepted at BMVC 2017 (oral
Cascaded Boundary Regression for Temporal Action Detection
Temporal action detection in long videos is an important problem.
State-of-the-art methods address this problem by applying action classifiers on
sliding windows. Although sliding windows may contain an identifiable portion
of the actions, they may not necessarily cover the entire action instance,
which would lead to inferior performance. We adapt a two-stage temporal action
detection pipeline with Cascaded Boundary Regression (CBR) model.
Class-agnostic proposals and specific actions are detected respectively in the
first and the second stage. CBR uses temporal coordinate regression to refine
the temporal boundaries of the sliding windows. The salient aspect of the
refinement process is that, inside each stage, the temporal boundaries are
adjusted in a cascaded way by feeding the refined windows back to the system
for further boundary refinement. We test CBR on THUMOS-14 and TVSeries, and
achieve state-of-the-art performance on both datasets. The performance gain is
especially remarkable under high IoU thresholds, e.g. map@tIoU=0.5 on THUMOS-14
is improved from 19.0% to 31.0%
RED: Reinforced Encoder-Decoder Networks for Action Anticipation
Action anticipation aims to detect an action before it happens. Many real
world applications in robotics and surveillance are related to this predictive
capability. Current methods address this problem by first anticipating visual
representations of future frames and then categorizing the anticipated
representations to actions. However, anticipation is based on a single past
frame's representation, which ignores the history trend. Besides, it can only
anticipate a fixed future time. We propose a Reinforced Encoder-Decoder (RED)
network for action anticipation. RED takes multiple history representations as
input and learns to anticipate a sequence of future representations. One
salient aspect of RED is that a reinforcement module is adopted to provide
sequence-level supervision; the reward function is designed to encourage the
system to make correct predictions as early as possible. We test RED on
TVSeries, THUMOS-14 and TV-Human-Interaction datasets for action anticipation
and achieve state-of-the-art performance on all datasets
Occlusion Aware Unsupervised Learning of Optical Flow
It has been recently shown that a convolutional neural network can learn
optical flow estimation with unsupervised learning. However, the performance of
the unsupervised methods still has a relatively large gap compared to its
supervised counterpart. Occlusion and large motion are some of the major
factors that limit the current unsupervised learning of optical flow methods.
In this work we introduce a new method which models occlusion explicitly and a
new warping way that facilitates the learning of large motion. Our method shows
promising results on Flying Chairs, MPI-Sintel and KITTI benchmark datasets.
Especially on KITTI dataset where abundant unlabeled samples exist, our
unsupervised method outperforms its counterpart trained with supervised
learning.Comment: CVPR 2018 Camera-read
Activity Driven Weakly Supervised Object Detection
Weakly supervised object detection aims at reducing the amount of supervision
required to train detection models. Such models are traditionally learned from
images/videos labelled only with the object class and not the object bounding
box. In our work, we try to leverage not only the object class labels but also
the action labels associated with the data. We show that the action depicted in
the image/video can provide strong cues about the location of the associated
object. We learn a spatial prior for the object dependent on the action (e.g.
"ball" is closer to "leg of the person" in "kicking ball"), and incorporate
this prior to simultaneously train a joint object detection and action
classification model. We conducted experiments on both video datasets and image
datasets to evaluate the performance of our weakly supervised object detection
model. Our approach outperformed the current state-of-the-art (SOTA) method by
more than 6% in mAP on the Charades video dataset.Comment: CVPR'19 camera read
LEGO: Learning Edge with Geometry all at Once by Watching Videos
Learning to estimate 3D geometry in a single image by watching unlabeled
videos via deep convolutional network is attracting significant attention. In
this paper, we introduce a "3D as-smooth-as-possible (3D-ASAP)" prior inside
the pipeline, which enables joint estimation of edges and 3D scene, yielding
results with significant improvement in accuracy for fine detailed structures.
Specifically, we define the 3D-ASAP prior by requiring that any two points
recovered in 3D from an image should lie on an existing planar surface if no
other cues provided. We design an unsupervised framework that Learns Edges and
Geometry (depth, normal) all at Once (LEGO). The predicted edges are embedded
into depth and surface normal smoothness terms, where pixels without edges
in-between are constrained to satisfy the prior. In our framework, the
predicted depths, normals and edges are forced to be consistent all the time.
We conduct experiments on KITTI to evaluate our estimated geometry and
CityScapes to perform edge evaluation. We show that in all of the tasks,
i.e.depth, normal and edge, our algorithm vastly outperforms other
state-of-the-art (SOTA) algorithms, demonstrating the benefits of our approach.Comment: Accepted to CVPR 2018 as spotlight; Camera ready plus supplementary
material. Code will com