32,078 research outputs found
Audio-Visual Learning for Scene Understanding
Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world.
Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues.
However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time.
As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning.
Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization.
Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound.
In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time
Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
The audio-visual sound separation field assumes visible sources in videos,
but this excludes invisible sounds beyond the camera's view. Current methods
struggle with such sounds lacking visible cues. This paper introduces a novel
"Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a
semantic parser for visible and invisible sounds and a separator for
scene-informed separation. AVSA-Sep successfully separates both sound types,
with joint training and cross-modal alignment enhancing effectiveness.Comment: Accepted at ICCV 2023 - AV4D, 4 figures, 3 table
Audiovisual Saliency Prediction in Uncategorized Video Sequences based on Audio-Video Correlation
Substantial research has been done in saliency modeling to develop
intelligent machines that can perceive and interpret their surroundings. But
existing models treat videos as merely image sequences excluding any audio
information, unable to cope with inherently varying content. Based on the
hypothesis that an audiovisual saliency model will be an improvement over
traditional saliency models for natural uncategorized videos, this work aims to
provide a generic audio/video saliency model augmenting a visual saliency map
with an audio saliency map computed by synchronizing low-level audio and visual
features. The proposed model was evaluated using different criteria against eye
fixations data for a publicly available DIEM video dataset. The results show
that the model outperformed two state-of-the-art visual saliency models.Comment: 9 pages, 2 figures, 4 table
High resolution imaging at large telescopes
Image recovery at a resolution limited only by diffraction is now possible at large telescopes. The theory of speckle image reconstruction is explained and the current status of a video recording and digitization system for the reconstruction procedure is described. Potential applications of the process when used with very large telescopes are discussed. The constraints on telescope design imposed by these techniques are listed
Recommended from our members
Geometry-aware multi-task learning for binaural audio generation from video
Human audio perception is inherently spatial and videos with binaural audio simulate the spatial experience by delivering different sounds to both ears. However, videos are typically recorded with mono audio and hence generally do not offer the rich audio experience of binaural audio. We propose an audio spatialization method that uses the visual information in videos to convert mono audio to binaural. We leverage the spatial and geometric information about the audio present in the visuals of the video to guide the learning process. We learn these geometry aware features in visuals in a multi-task manner to generate rich binaural audio. We also generate a large video dataset with binaural audio in photorealistic environments to better understand and evaluate the task. We demonstrate the efficacy of our method to generate better binaural audio by learning more spatially coherent visual features by extensive evaluation on two datasets.Computer Science
Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation
Binaural stereo audio is recorded by imitating the way the human ear receives
sound, which provides people with an immersive listening experience. Existing
approaches leverage autoencoders and directly exploit visual spatial
information to synthesize binaural stereo, resulting in a limited
representation of visual guidance. For the first time, we propose a visually
guided generative adversarial approach for generating binaural stereo audio
from mono audio. Specifically, we develop a Stereo Audio Generation Model
(SAGM), which utilizes shared spatio-temporal visual information to guide the
generator and the discriminator to work separately. The shared visual
information is updated alternately in the generative adversarial stage,
allowing the generator and discriminator to deliver their respective guided
knowledge while visually sharing. The proposed method learns bidirectional
complementary visual information, which facilitates the expression of visual
guidance in generation. In addition, spatial perception is a crucial attribute
of binaural stereo audio, and thus the evaluation of stereo spatial perception
is essential. However, previous metrics failed to measure the spatial
perception of audio. To this end, a metric to measure the spatial perception of
audio is proposed for the first time. The proposed metric is capable of
measuring the magnitude and direction of spatial perception in the temporal
dimension. Further, considering its function, it is feasible to utilize it
instead of demanding user studies to some extent. The proposed method achieves
state-of-the-art performance on 2 datasets and 5 evaluation metrics.
Qualitative experiments and user studies demonstrate that the method generates
space-realistic stereo audio
- …