3,552 research outputs found
BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis
Generative models for audio-conditioned dance motion synthesis map music
features to dance movements. Models are trained to associate motion patterns to
audio patterns, usually without an explicit knowledge of the human body. This
approach relies on a few assumptions: strong music-dance correlation,
controlled motion data and relatively simple poses and movements. These
characteristics are found in all existing datasets for dance motion synthesis,
and indeed recent methods can achieve good results.We introduce a new dataset
aiming to challenge these common assumptions, compiling a set of dynamic dance
sequences displaying complex human poses. We focus on breakdancing which
features acrobatic moves and tangled postures. We source our data from the Red
Bull BC One competition videos. Estimating human keypoints from these videos is
difficult due to the complexity of the dance, as well as the multiple moving
cameras recording setup. We adopt a hybrid labelling pipeline leveraging deep
estimation models as well as manual annotations to obtain good quality keypoint
sequences at a reduced cost. Our efforts produced the BRACE dataset, which
contains over 3 hours and 30 minutes of densely annotated poses. We test
state-of-the-art methods on BRACE, showing their limitations when evaluated on
complex sequences. Our dataset can readily foster advance in dance motion
synthesis. With intricate poses and swift movements, models are forced to go
beyond learning a mapping between modalities and reason more effectively about
body structure and movements.Comment: ECCV 2022. Dataset available at https://github.com/dmoltisanti/brac
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
This paper introduces a video dataset of spatio-temporally localized Atomic
Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual
actions in 430 15-minute video clips, where actions are localized in space and
time, resulting in 1.58M action labels with multiple labels per person
occurring frequently. The key characteristics of our dataset are: (1) the
definition of atomic visual actions, rather than composite actions; (2) precise
spatio-temporal annotations with possibly multiple annotations for each person;
(3) exhaustive annotation of these atomic actions over 15-minute video clips;
(4) people temporally linked across consecutive segments; and (5) using movies
to gather a varied set of action representations. This departs from existing
datasets for spatio-temporal action recognition, which typically provide sparse
annotations for composite actions in short video clips. We will release the
dataset publicly.
AVA, with its realistic scene and action complexity, exposes the intrinsic
difficulty of action recognition. To benchmark this, we present a novel
approach for action localization that builds upon the current state-of-the-art
methods, and demonstrates better performance on JHMDB and UCF101-24 categories.
While setting a new state of the art on existing datasets, the overall results
on AVA are low at 15.6% mAP, underscoring the need for developing new
approaches for video understanding.Comment: To appear in CVPR 2018. Check dataset page
https://research.google.com/ava/ for detail
Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA
Deep learning has successfully shown excellent performance in learning joint
representations between different data modalities. Unfortunately, little
research focuses on cross-modal correlation learning where temporal structures
of different data modalities, such as audio and video, should be taken into
account. Music video retrieval by given musical audio is a natural way to
search and interact with music contents. In this work, we study cross-modal
music video retrieval in terms of emotion similarity. Particularly, audio of an
arbitrary length is used to retrieve a longer or full-length music video. To
this end, we propose a novel audio-visual embedding algorithm by Supervised
Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into
a shared space to bridge the semantic gap between audio and video. This also
preserves the similarity between audio and visual contents from different
videos with the same class label and the temporal structure. The contribution
of our approach is mainly manifested in the two aspects: i) We propose to
select top k audio chunks by attention-based Long Short-Term Memory
(LSTM)model, which can represent good audio summarization with local
properties. ii) We propose an end-to-end deep model for cross-modal
audio-visual learning where S-DCCA is trained to learn the semantic correlation
between audio and visual modalities. Due to the lack of music video dataset, we
construct 10K music video dataset from YouTube 8M dataset. Some promising
results such as MAP and precision-recall show that our proposed model can be
applied to music video retrieval.Comment: 8 pages, 9 figures. Accepted by ISM 201
- …