19,869 research outputs found
SALSA: A Novel Dataset for Multimodal Group Behavior Analysis
Studying free-standing conversational groups (FCGs) in unstructured social
settings (e.g., cocktail party ) is gratifying due to the wealth of information
available at the group (mining social networks) and individual (recognizing
native behavioral and personality traits) levels. However, analyzing social
scenes involving FCGs is also highly challenging due to the difficulty in
extracting behavioral cues such as target locations, their speaking activity
and head/body pose due to crowdedness and presence of extreme occlusions. To
this end, we propose SALSA, a novel dataset facilitating multimodal and
Synergetic sociAL Scene Analysis, and make two main contributions to research
on automated social interaction analysis: (1) SALSA records social interactions
among 18 participants in a natural, indoor environment for over 60 minutes,
under the poster presentation and cocktail party contexts presenting
difficulties in the form of low-resolution images, lighting variations,
numerous occlusions, reverberations and interfering sound sources; (2) To
alleviate these problems we facilitate multimodal analysis by recording the
social interplay using four static surveillance cameras and sociometric badges
worn by each participant, comprising the microphone, accelerometer, bluetooth
and infrared sensors. In addition to raw data, we also provide annotations
concerning individuals' personality as well as their position, head, body
orientation and F-formation information over the entire event duration. Through
extensive experiments with state-of-the-art approaches, we show (a) the
limitations of current methods and (b) how the recorded multiple cues
synergetically aid automatic analysis of social interactions. SALSA is
available at http://tev.fbk.eu/salsa.Comment: 14 pages, 11 figure
Windows Vista media center controlled by speech
Trabalho de projecto de mestrado em Engenharia Informática, apresentado à Universidade de Lisboa, através da Faculdade de Ciências, 200
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
We propose a self-supervised approach for learning to perform audio source
separation in videos based on natural language queries, using only unlabeled
video and audio pairs as training data. A key challenge in this task is
learning to associate the linguistic description of a sound-emitting object to
its visual features and the corresponding components of the audio waveform, all
without access to annotations during training. To overcome this challenge, we
adapt off-the-shelf vision-language foundation models to provide pseudo-target
supervision via two novel loss functions and encourage a stronger alignment
between the audio, visual and natural language modalities. During inference,
our approach can separate sounds given text, video and audio input, or given
text and audio input alone. We demonstrate the effectiveness of our
self-supervised approach on three audio-visual separation datasets, including
MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly
supervised approaches despite not using object detectors or text labels during
training.Comment: Accepted at CVPR 202
Teaching a Robotic Child - Machine Learning Strategies for a Humanoid Robot from Social Interactions
- …