75 research outputs found
Action Tubelet Detector for Spatio-Temporal Action Localization
Current state-of-the-art approaches for spatio-temporal action localization
rely on detections at the frame level that are then linked or tracked across
time. In this paper, we leverage the temporal continuity of videos instead of
operating at the frame level. We propose the ACtion Tubelet detector
(ACT-detector) that takes as input a sequence of frames and outputs tubelets,
i.e., sequences of bounding boxes with associated scores. The same way
state-of-the-art object detectors rely on anchor boxes, our ACT-detector is
based on anchor cuboids. We build upon the SSD framework. Convolutional
features are extracted for each frame, while scores and regressions are based
on the temporal stacking of these features, thus exploiting information from a
sequence. Our experimental results show that leveraging sequences of frames
significantly improves detection performance over using individual frames. The
gain of our tubelet detector can be explained by both more accurate scores and
more precise localization. Our ACT-detector outperforms the state-of-the-art
methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in
particular at high overlap thresholds.Comment: 9 page
Joint learning of object and action detectors
International audienceWhile most existing approaches for detection in videos focus on objects or human actions separately, we aim at jointly detecting objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting objects-actions in videos, and show that both tasks of object and action detection benefit from this joint learning. Moreover, the proposed architecture can be used for zero-shot learning of actions: our multitask objective leverages the commonalities of an action performed by different objects, e.g. dog and cat jumping , enabling to detect actions of an object without training with these object-actions pairs. In experiments on the A2D dataset [50], we obtain state-of-the-art results on segmentation of object-action pairs. We finally apply our multitask architecture to detect visual relationships between objects in images of the VRD dataset [24]
Learning the What and How of Annotation in Video Object Segmentation
Video Object Segmentation (VOS) is crucial for several applications, from
video editing to video data generation. Training a VOS model requires an
abundance of manually labeled training videos. The de-facto traditional way of
annotating objects requires humans to draw detailed segmentation masks on the
target objects at each video frame. This annotation process, however, is
tedious and time-consuming. To reduce this annotation cost, in this paper, we
propose EVA-VOS, a human-in-the-loop annotation framework for video object
segmentation. Unlike the traditional approach, we introduce an agent that
predicts iteratively both which frame ("What") to annotate and which annotation
type ("How") to use. Then, the annotator annotates only the selected frame that
is used to update a VOS module, leading to significant gains in annotation
time. We conduct experiments on the MOSE and the DAVIS datasets and we show
that: (a) EVA-VOS leads to masks with accuracy close to the human agreement
3.5x faster than the standard way of annotating videos; (b) our frame selection
achieves state-of-the-art performance; (c) EVA-VOS yields significant
performance gains in terms of annotation time compared to all other methods and
baselines.Comment: Accepted to WACV 202
Action Tubelet Detector for Spatio-Temporal Action Localization
International audienceCurrent state-of-the-art approaches for spatio-temporal action detection rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, ie., sequences of bounding boxes with associated scores. The same way state-of-the-art object detectors rely on anchor boxes, our ACT-detector is based on anchor cuboids. We build upon the state-of-the-art SSD framework. Convolutional features are extracted for each frame, while scores and regressions are based on the temporal stacking of these features, thus exploiting information from a sequence. Our experimental results show that leveraging sequences of frames significantly improves detection performance over using individual frames. The gain of our tubelet detector can be explained by both more relevant scores and more precise localization. Our ACT-detector outperforms the state of the art methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds
BluNF: Blueprint Neural Field
Neural Radiance Fields (NeRFs) have revolutionized scene novel view
synthesis, offering visually realistic, precise, and robust implicit
reconstructions. While recent approaches enable NeRF editing, such as object
removal, 3D shape modification, or material property manipulation, the manual
annotation prior to such edits makes the process tedious. Additionally,
traditional 2D interaction tools lack an accurate sense of 3D space, preventing
precise manipulation and editing of scenes. In this paper, we introduce a novel
approach, called Blueprint Neural Field (BluNF), to address these editing
issues. BluNF provides a robust and user-friendly 2D blueprint, enabling
intuitive scene editing. By leveraging implicit neural representation, BluNF
constructs a blueprint of a scene using prior semantic and depth information.
The generated blueprint allows effortless editing and manipulation of NeRF
representations. We demonstrate BluNF's editability through an intuitive
click-and-change mechanism, enabling 3D manipulations, such as masking,
appearance modification, and object removal. Our approach significantly
contributes to visual content creation, paving the way for further research in
this area.Comment: ICCV-W (AI3DCC) 2023. Project page with videos and code:
https://www.lix.polytechnique.fr/vista/projects/2023_iccvw_courant
Reward Function Design for Crowd Simulation via Reinforcement Learning
Crowd simulation is important for video-games design, since it enables to
populate virtual worlds with autonomous avatars that navigate in a human-like
manner. Reinforcement learning has shown great potential in simulating virtual
crowds, but the design of the reward function is critical to achieving
effective and efficient results. In this work, we explore the design of reward
functions for reinforcement learning-based crowd simulation. We provide
theoretical insights on the validity of certain reward functions according to
their analytical properties, and evaluate them empirically using a range of
scenarios, using the energy efficiency as the metric. Our experiments show that
directly minimizing the energy usage is a viable strategy as long as it is
paired with an appropriately scaled guiding potential, and enable us to study
the impact of the different reward components on the behavior of the simulated
crowd. Our findings can inform the development of new crowd simulation
techniques, and contribute to the wider study of human-like navigation
One-shot Unsupervised Domain Adaptation with Personalized Diffusion Models
Adapting a segmentation model from a labeled source domain to a target
domain, where a single unlabeled datum is available, is one the most
challenging problems in domain adaptation and is otherwise known as one-shot
unsupervised domain adaptation (OSUDA). Most of the prior works have addressed
the problem by relying on style transfer techniques, where the source images
are stylized to have the appearance of the target domain. Departing from the
common notion of transferring only the target ``texture'' information, we
leverage text-to-image diffusion models (e.g., Stable Diffusion) to generate a
synthetic target dataset with photo-realistic images that not only faithfully
depict the style of the target domain, but are also characterized by novel
scenes in diverse contexts. The text interface in our method Data AugmenTation
with diffUsion Models (DATUM) endows us with the possibility of guiding the
generation of images towards desired semantic concepts while respecting the
original spatial context of a single training image, which is not possible in
existing OSUDA methods. Extensive experiments on standard benchmarks show that
our DATUM surpasses the state-of-the-art OSUDA methods by up to +7.1%. The
implementation is available at https://github.com/yasserben/DATUMComment: 13 pages, 8 figures, 5 table
Understanding reinforcement learned crowds
Simulating trajectories of virtual crowds is a commonly encountered task in
Computer Graphics. Several recent works have applied Reinforcement Learning
methods to animate virtual agents, however they often make different design
choices when it comes to the fundamental simulation setup. Each of these
choices comes with a reasonable justification for its use, so it is not obvious
what is their real impact, and how they affect the results. In this work, we
analyze some of these arbitrary choices in terms of their impact on the
learning performance, as well as the quality of the resulting simulation
measured in terms of the energy efficiency. We perform a theoretical analysis
of the properties of the reward function design, and empirically evaluate the
impact of using certain observation and action spaces on a variety of
scenarios, with the reward function and energy usage as metrics. We show that
directly using the neighboring agents' information as observation generally
outperforms the more widely used raycasting. Similarly, using nonholonomic
controls with egocentric observations tends to produce more efficient behaviors
than holonomic controls with absolute observations. Each of these choices has a
significant, and potentially nontrivial impact on the results, and so
researchers should be mindful about choosing and reporting them in their work.Comment: Accepted for publication at MIG 202
- …