64 research outputs found
Action Tubelet Detector for Spatio-Temporal Action Localization
Current state-of-the-art approaches for spatio-temporal action localization
rely on detections at the frame level that are then linked or tracked across
time. In this paper, we leverage the temporal continuity of videos instead of
operating at the frame level. We propose the ACtion Tubelet detector
(ACT-detector) that takes as input a sequence of frames and outputs tubelets,
i.e., sequences of bounding boxes with associated scores. The same way
state-of-the-art object detectors rely on anchor boxes, our ACT-detector is
based on anchor cuboids. We build upon the SSD framework. Convolutional
features are extracted for each frame, while scores and regressions are based
on the temporal stacking of these features, thus exploiting information from a
sequence. Our experimental results show that leveraging sequences of frames
significantly improves detection performance over using individual frames. The
gain of our tubelet detector can be explained by both more accurate scores and
more precise localization. Our ACT-detector outperforms the state-of-the-art
methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in
particular at high overlap thresholds.Comment: 9 page
SCAM! Transferring humans between images with Semantic Cross Attention Modulation
A large body of recent work targets semantically conditioned image
generation. Most such methods focus on the narrower task of pose transfer and
ignore the more challenging task of subject transfer that consists in not only
transferring the pose but also the appearance and background. In this work, we
introduce SCAM (Semantic Cross Attention Modulation), a system that encodes
rich and diverse information in each semantic region of the image (including
foreground and background), thus achieving precise generation with emphasis on
fine details. This is enabled by the Semantic Attention Transformer Encoder
that extracts multiple latent vectors for each semantic region, and the
corresponding generator that exploits these multiple latents by using semantic
cross attention modulation. It is trained only using a reconstruction setup,
while subject transfer is performed at test time. Our analysis shows that our
proposed architecture is successful at encoding the diversity of appearance in
each semantic region. Extensive experiments on the iDesigner and CelebAMask-HD
datasets show that SCAM outperforms SEAN and SPADE; moreover, it sets the new
state of the art on subject transfer.Comment: Accepted at ECCV 202
Joint learning of object and action detectors
International audienceWhile most existing approaches for detection in videos focus on objects or human actions separately, we aim at jointly detecting objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting objects-actions in videos, and show that both tasks of object and action detection benefit from this joint learning. Moreover, the proposed architecture can be used for zero-shot learning of actions: our multitask objective leverages the commonalities of an action performed by different objects, e.g. dog and cat jumping , enabling to detect actions of an object without training with these object-actions pairs. In experiments on the A2D dataset [50], we obtain state-of-the-art results on segmentation of object-action pairs. We finally apply our multitask architecture to detect visual relationships between objects in images of the VRD dataset [24]
Learning the What and How of Annotation in Video Object Segmentation
Video Object Segmentation (VOS) is crucial for several applications, from
video editing to video data generation. Training a VOS model requires an
abundance of manually labeled training videos. The de-facto traditional way of
annotating objects requires humans to draw detailed segmentation masks on the
target objects at each video frame. This annotation process, however, is
tedious and time-consuming. To reduce this annotation cost, in this paper, we
propose EVA-VOS, a human-in-the-loop annotation framework for video object
segmentation. Unlike the traditional approach, we introduce an agent that
predicts iteratively both which frame ("What") to annotate and which annotation
type ("How") to use. Then, the annotator annotates only the selected frame that
is used to update a VOS module, leading to significant gains in annotation
time. We conduct experiments on the MOSE and the DAVIS datasets and we show
that: (a) EVA-VOS leads to masks with accuracy close to the human agreement
3.5x faster than the standard way of annotating videos; (b) our frame selection
achieves state-of-the-art performance; (c) EVA-VOS yields significant
performance gains in terms of annotation time compared to all other methods and
baselines.Comment: Accepted to WACV 202
Action Tubelet Detector for Spatio-Temporal Action Localization
International audienceCurrent state-of-the-art approaches for spatio-temporal action detection rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, ie., sequences of bounding boxes with associated scores. The same way state-of-the-art object detectors rely on anchor boxes, our ACT-detector is based on anchor cuboids. We build upon the state-of-the-art SSD framework. Convolutional features are extracted for each frame, while scores and regressions are based on the temporal stacking of these features, thus exploiting information from a sequence. Our experimental results show that leveraging sequences of frames significantly improves detection performance over using individual frames. The gain of our tubelet detector can be explained by both more relevant scores and more precise localization. Our ACT-detector outperforms the state of the art methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds
BluNF: Blueprint Neural Field
Neural Radiance Fields (NeRFs) have revolutionized scene novel view
synthesis, offering visually realistic, precise, and robust implicit
reconstructions. While recent approaches enable NeRF editing, such as object
removal, 3D shape modification, or material property manipulation, the manual
annotation prior to such edits makes the process tedious. Additionally,
traditional 2D interaction tools lack an accurate sense of 3D space, preventing
precise manipulation and editing of scenes. In this paper, we introduce a novel
approach, called Blueprint Neural Field (BluNF), to address these editing
issues. BluNF provides a robust and user-friendly 2D blueprint, enabling
intuitive scene editing. By leveraging implicit neural representation, BluNF
constructs a blueprint of a scene using prior semantic and depth information.
The generated blueprint allows effortless editing and manipulation of NeRF
representations. We demonstrate BluNF's editability through an intuitive
click-and-change mechanism, enabling 3D manipulations, such as masking,
appearance modification, and object removal. Our approach significantly
contributes to visual content creation, paving the way for further research in
this area.Comment: ICCV-W (AI3DCC) 2023. Project page with videos and code:
https://www.lix.polytechnique.fr/vista/projects/2023_iccvw_courant
Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding
Recent advances in vision-language models have significantly propelled video
understanding. Existing datasets and tasks, however, have notable limitations.
Most datasets are confined to short videos with limited events and narrow
narratives. For example, datasets with instructional and egocentric videos
often document the activities of one person in a single scene. Although some
movie datasets offer richer content, they are often limited to short-term
tasks, lack publicly available videos and frequently encounter data leakage
given the use of movie forums and other resources in LLM training. To address
the above limitations, we propose the Short Film Dataset (SFD) with 1,078
publicly available amateur movies, a wide variety of genres and minimal data
leakage issues. SFD offers long-term story-oriented video tasks in the form of
multiple-choice and open-ended question answering. Our extensive experiments
emphasize the need for long-term reasoning to solve SFD tasks. Notably, we find
strong signals in movie transcripts leading to the on-par performance of people
and LLMs. We also show significantly lower performance of current models
compared to people when using vision data alone
Don't drop your samples! Coherence-aware training benefits Conditional diffusion
Conditional diffusion models are powerful generative models that can leverage
various types of conditional information, such as class labels, segmentation
masks, or text captions. However, in many real-world scenarios, conditional
information may be noisy or unreliable due to human annotation errors or weak
alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a
novel method that integrates coherence in conditional information into
diffusion models, allowing them to learn from noisy annotations without
discarding data. We assume that each data point has an associated coherence
score that reflects the quality of the conditional information. We then
condition the diffusion model on both the conditional information and the
coherence score. In this way, the model learns to ignore or discount the
conditioning when the coherence is low. We show that CAD is theoretically sound
and empirically effective on various conditional generation tasks. Moreover, we
show that leveraging coherence generates realistic and diverse samples that
respect conditional information better than models trained on cleaned datasets
where samples with low coherence have been discarded.Comment: Accepted at CVPR 2024 as a Highlight. Project page:
https://nicolas-dufour.github.io/cad.htm
Reward Function Design for Crowd Simulation via Reinforcement Learning
Crowd simulation is important for video-games design, since it enables to
populate virtual worlds with autonomous avatars that navigate in a human-like
manner. Reinforcement learning has shown great potential in simulating virtual
crowds, but the design of the reward function is critical to achieving
effective and efficient results. In this work, we explore the design of reward
functions for reinforcement learning-based crowd simulation. We provide
theoretical insights on the validity of certain reward functions according to
their analytical properties, and evaluate them empirically using a range of
scenarios, using the energy efficiency as the metric. Our experiments show that
directly minimizing the energy usage is a viable strategy as long as it is
paired with an appropriately scaled guiding potential, and enable us to study
the impact of the different reward components on the behavior of the simulated
crowd. Our findings can inform the development of new crowd simulation
techniques, and contribute to the wider study of human-like navigation
Collaborating Foundation Models for Domain Generalized Semantic Segmentation
Domain Generalized Semantic Segmentation (DGSS) deals with training a model
on a labeled source domain with the aim of generalizing to unseen domains
during inference. Existing DGSS methods typically effectuate robust features by
means of Domain Randomization (DR). Such an approach is often limited as it can
only account for style diversification and not content. In this work, we take
an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative
FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In
detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP
backbone for its robust feature representation, (ii) generative models to
diversify the content, thereby covering various modes of the possible target
distribution, and (iii) Segment Anything Model (SAM) for iteratively refining
the predictions of the segmentation model. Extensive experiments show that our
CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under
varying weather conditions, notably outperforming prior methods by 5.6% and
6.7% on averaged miou, respectively. The code is available at :
https://github.com/yasserben/CLOUDSComment: https://github.com/yasserben/CLOUDS ; Accepted to CVPR 202
- …