34 research outputs found
Learning to track for spatio-temporal action localization
We propose an effective approach for spatio-temporal action localization in
realistic videos. The approach first detects proposals at the frame-level and
scores them with a combination of static and motion CNN features. It then
tracks high-scoring proposals throughout the video using a
tracking-by-detection approach. Our tracker relies simultaneously on
instance-level and class-level detectors. The tracks are scored using a
spatio-temporal motion histogram, a descriptor at the track level, in
combination with the CNN features. Finally, we perform temporal localization of
the action using a sliding-window approach at the track level. We present
experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB
and UCF-101 action localization datasets, where our approach outperforms the
state of the art with a margin of 15%, 7% and 12% respectively in mAP
Action Tubelet Detector for Spatio-Temporal Action Localization
Current state-of-the-art approaches for spatio-temporal action localization
rely on detections at the frame level that are then linked or tracked across
time. In this paper, we leverage the temporal continuity of videos instead of
operating at the frame level. We propose the ACtion Tubelet detector
(ACT-detector) that takes as input a sequence of frames and outputs tubelets,
i.e., sequences of bounding boxes with associated scores. The same way
state-of-the-art object detectors rely on anchor boxes, our ACT-detector is
based on anchor cuboids. We build upon the SSD framework. Convolutional
features are extracted for each frame, while scores and regressions are based
on the temporal stacking of these features, thus exploiting information from a
sequence. Our experimental results show that leveraging sequences of frames
significantly improves detection performance over using individual frames. The
gain of our tubelet detector can be explained by both more accurate scores and
more precise localization. Our ACT-detector outperforms the state-of-the-art
methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in
particular at high overlap thresholds.Comment: 9 page
EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow
We propose a novel approach for optical flow estimation , targeted at large
displacements with significant oc-clusions. It consists of two steps: i) dense
matching by edge-preserving interpolation from a sparse set of matches; ii)
variational energy minimization initialized with the dense matches. The
sparse-to-dense interpolation relies on an appropriate choice of the distance,
namely an edge-aware geodesic distance. This distance is tailored to handle
occlusions and motion boundaries -- two common and difficult issues for optical
flow computation. We also propose an approximation scheme for the geodesic
distance to allow fast computation without loss of performance. Subsequent to
the dense interpolation step, standard one-level variational energy
minimization is carried out on the dense matches to obtain the final flow
estimation. The proposed approach, called Edge-Preserving Interpolation of
Correspondences (EpicFlow) is fast and robust to large displacements. It
significantly outperforms the state of the art on MPI-Sintel and performs on
par on Kitti and Middlebury
DeepMatching: Hierarchical Deformable Dense Matching
We introduce a novel matching algorithm, called DeepMatching, to compute
dense correspondences between images. DeepMatching relies on a hierarchical,
multi-layer, correlational architecture designed for matching images and was
inspired by deep convolutional approaches. The proposed matching algorithm can
handle non-rigid deformations and repetitive textures and efficiently
determines dense correspondences in the presence of significant changes between
images. We evaluate the performance of DeepMatching, in comparison with
state-of-the-art matching algorithms, on the Mikolajczyk (Mikolajczyk et al
2005), the MPI-Sintel (Butler et al 2012) and the Kitti (Geiger et al 2013)
datasets. DeepMatching outperforms the state-of-the-art algorithms and shows
excellent results in particular for repetitive textures.We also propose a
method for estimating optical flow, called DeepFlow, by integrating
DeepMatching in the large displacement optical flow (LDOF) approach of Brox and
Malik (2011). Compared to existing matching algorithms, additional robustness
to large displacements and complex motion is obtained thanks to our matching
approach. DeepFlow obtains competitive performance on public benchmarks for
optical flow estimation
Action Tubelet Detector for Spatio-Temporal Action Localization
International audienceCurrent state-of-the-art approaches for spatio-temporal action detection rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, ie., sequences of bounding boxes with associated scores. The same way state-of-the-art object detectors rely on anchor boxes, our ACT-detector is based on anchor cuboids. We build upon the state-of-the-art SSD framework. Convolutional features are extracted for each frame, while scores and regressions are based on the temporal stacking of these features, thus exploiting information from a sequence. Our experimental results show that leveraging sequences of frames significantly improves detection performance over using individual frames. The gain of our tubelet detector can be explained by both more relevant scores and more precise localization. Our ACT-detector outperforms the state of the art methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds
DeepFlow: Large displacement optical flow with deep matching
International audienceOptical flow computation is a key component in many computer vision systems designed for tasks such as action detection or activity recognition. However, despite several major advances over the last decade, handling large displacement in optical flow remains an open problem. Inspired by the large displacement optical flow of Brox and Malik, our approach, termed DeepFlow, blends a matching algorithm with a variational approach for optical flow. We propose a descriptor matching algorithm, tailored to the optical flow problem, that allows to boost performance on fast motions. The matching algorithm builds upon a multi-stage architecture with 6 layers, interleaving convolutions and max-pooling, a construction akin to deep convolutional nets. Using dense sampling, it allows to efficiently retrieve quasi-dense correspondences, and enjoys a built-in smoothing effect on descriptors matches, a valuable asset for integration into an energy minimization framework for optical flow estimation. DeepFlow efficiently handles large displacements occurring in realistic videos, and shows competitive performance on optical flow benchmarks. Furthermore, it sets a new state-of-the-art on the MPI-Sintel dataset
Joint learning of object and action detectors
International audienceWhile most existing approaches for detection in videos focus on objects or human actions separately, we aim at jointly detecting objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting objects-actions in videos, and show that both tasks of object and action detection benefit from this joint learning. Moreover, the proposed architecture can be used for zero-shot learning of actions: our multitask objective leverages the commonalities of an action performed by different objects, e.g. dog and cat jumping , enabling to detect actions of an object without training with these object-actions pairs. In experiments on the A2D dataset [50], we obtain state-of-the-art results on segmentation of object-action pairs. We finally apply our multitask architecture to detect visual relationships between objects in images of the VRD dataset [24]
End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon
Most recent work in goal oriented visual navigation resorts to large-scale
machine learning in simulated environments. The main challenge lies in learning
compact representations generalizable to unseen environments and in learning
high-capacity perception modules capable of reasoning on high-dimensional
input. The latter is particularly difficult when the goal is not given as a
category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception
module needs to learn a comparison strategy requiring to solve an underlying
visual correspondence problem. This has been shown to be difficult from reward
alone or with standard auxiliary tasks. We address this problem through a
sequence of two pretext tasks, which serve as a prior for what we argue is one
of the main bottleneck in perception, extremely wide-baseline relative pose
estimation and visibility prediction in complex scenes. The first pretext task,
cross-view completion is a proxy for the underlying visual correspondence
problem, while the second task addresses goal detection and finding directly.
We propose a new dual encoder with a large-capacity binocular ViT model and
show that correspondence solutions naturally emerge from the training signals.
Experiments show significant improvements and SOTA performance on the two
benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics
and height differ between observation and goal
PoseScript: Linking 3D Human Poses and Natural Language
Natural language plays a critical role in many computer vision applications,
such as image captioning, visual question answering, and cross-modal retrieval,
to provide fine-grained semantic information. Unfortunately, while human pose
is key to human understanding, current 3D human pose datasets lack detailed
language descriptions. To address this issue, we have introduced the PoseScript
dataset. This dataset pairs more than six thousand 3D human poses from AMASS
with rich human-annotated descriptions of the body parts and their spatial
relationships. Additionally, to increase the size of the dataset to a scale
that is compatible with data-hungry learning algorithms, we have proposed an
elaborate captioning process that generates automatic synthetic descriptions in
natural language from given 3D keypoints. This process extracts low-level pose
information, known as "posecodes", using a set of simple but generic rules on
the 3D keypoints. These posecodes are then combined into higher level textual
descriptions using syntactic rules. With automatic annotations, the amount of
available data significantly scales up (100k), making it possible to
effectively pretrain deep models for finetuning on human captions. To showcase
the potential of annotated poses, we present three multi-modal learning tasks
that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps
3D poses and textual descriptions into a joint embedding space, allowing for
cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we
establish a baseline for a text-conditioned model generating 3D poses. Thirdly,
we present a learned process for generating pose descriptions. These
applications demonstrate the versatility and usefulness of annotated poses in
various tasks and pave the way for future research in the field.Comment: Extended version of the ECCV 2022 pape