2,743 research outputs found
Few-Shot Audio-Visual Learning of Environment Acoustics
Room impulse response (RIR) functions capture how the surrounding physical
environment transforms the sounds heard by a listener, with implications for
various applications in AR, VR, and robotics. Whereas traditional methods to
estimate RIRs assume dense geometry and/or sound measurements throughout the
environment, we explore how to infer RIRs based on a sparse set of images and
echoes observed in the space. Towards that goal, we introduce a
transformer-based method that uses self-attention to build a rich acoustic
context, then predicts RIRs of arbitrary query source-receiver locations
through cross-attention. Additionally, we design a novel training objective
that improves the match in the acoustic signature between the RIR predictions
and the targets. In experiments using a state-of-the-art audio-visual simulator
for 3D environments, we demonstrate that our method successfully generates
arbitrary RIRs, outperforming state-of-the-art methods and -- in a major
departure from traditional methods -- generalizing to novel environments in a
few-shot manner. Project: http://vision.cs.utexas.edu/projects/fs_rir.Comment: Accepted to NeurIPS 202
Semi-Supervised Sound Source Localization Based on Manifold Regularization
Conventional speaker localization algorithms, based merely on the received
microphone signals, are often sensitive to adverse conditions, such as: high
reverberation or low signal to noise ratio (SNR). In some scenarios, e.g. in
meeting rooms or cars, it can be assumed that the source position is confined
to a predefined area, and the acoustic parameters of the environment are
approximately fixed. Such scenarios give rise to the assumption that the
acoustic samples from the region of interest have a distinct geometrical
structure. In this paper, we show that the high dimensional acoustic samples
indeed lie on a low dimensional manifold and can be embedded into a low
dimensional space. Motivated by this result, we propose a semi-supervised
source localization algorithm which recovers the inverse mapping between the
acoustic samples and their corresponding locations. The idea is to use an
optimization framework based on manifold regularization, that involves
smoothness constraints of possible solutions with respect to the manifold. The
proposed algorithm, termed Manifold Regularization for Localization (MRL), is
implemented in an adaptive manner. The initialization is conducted with only
few labelled samples attached with their respective source locations, and then
the system is gradually adapted as new unlabelled samples (with unknown source
locations) are received. Experimental results show superior localization
performance when compared with a recently presented algorithm based on a
manifold learning approach and with the generalized cross-correlation (GCC)
algorithm as a baseline
GWA: A Large High-Quality Acoustic Dataset for Audio Processing
We present the Geometric-Wave Acoustic (GWA) dataset, a large-scale audio
dataset of over 2 million synthetic room impulse responses (IRs) and their
corresponding detailed geometric and simulation configurations. Our dataset
samples acoustic environments from over 6.8K high-quality diverse and
professionally designed houses represented as semantically labeled 3D meshes.
We also present a novel real-world acoustic materials assignment scheme based
on semantic matching that uses a sentence transformer model. We compute
high-quality impulse responses corresponding to accurate low-frequency and
high-frequency wave effects by automatically calibrating geometric acoustic
ray-tracing with a finite-difference time-domain wave solver. We demonstrate
the higher accuracy of our IRs by comparing with recorded IRs from complex
real-world environments. The code and the full dataset will be released at the
time of publication. Moreover, we highlight the benefits of GWA on audio deep
learning tasks such as automated speech recognition, speech enhancement, and
speech separation. We observe significant improvement over prior synthetic IR
datasets in all tasks due to using our dataset.Comment: Project webpage https://gamma.umd.edu/pro/sound/gw
Contrastive Representation Learning for Acoustic Parameter Estimation
A study is presented in which a contrastive learning approach is used to
extract low-dimensional representations of the acoustic environment from
single-channel, reverberant speech signals. Convolution of room impulse
responses (RIRs) with anechoic source signals is leveraged as a data
augmentation technique that offers considerable flexibility in the design of
the upstream task. We evaluate the embeddings across three different downstream
tasks, which include the regression of acoustic parameters reverberation time
RT60 and clarity index C50, and the classification into small and large rooms.
We demonstrate that the learned representations generalize well to unseen data
and perform similarly to a fully-supervised baseline.Comment: Accepted for ICASSP 2023, Camera-ready versio
AdVerb: Visually Guided Audio Dereverberation
We present AdVerb, a novel audio-visual dereverberation framework that uses
visual cues in addition to the reverberant sound to estimate clean audio.
Although audio-only dereverberation is a well-studied problem, our approach
incorporates the complementary visual modality to perform audio
dereverberation. Given an image of the environment where the reverberated sound
signal has been recorded, AdVerb employs a novel geometry-aware cross-modal
transformer architecture that captures scene geometry and audio-visual
cross-modal relationship to generate a complex ideal ratio mask, which, when
applied to the reverberant audio predicts the clean sound. The effectiveness of
our method is demonstrated through extensive quantitative and qualitative
evaluations. Our approach significantly outperforms traditional audio-only and
audio-visual baselines on three downstream tasks: speech enhancement, speech
recognition, and speaker verification, with relative improvements in the range
of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly
satisfactory RT60 error scores on the AVSpeech dataset.Comment: Accepted at ICCV 2023. For project page, see
https://gamma.umd.edu/researchdirections/speech/adver
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Can machines recording an audio-visual scene produce realistic, matching
audio-visual experiences at novel positions and novel view directions? We
answer it by studying a new task -- real-world audio-visual scene synthesis --
and a first-of-its-kind NeRF-based approach for multimodal learning.
Concretely, given a video recording of an audio-visual scene, the task is to
synthesize new videos with spatial audios along arbitrary novel camera
trajectories in that scene. We propose an acoustic-aware audio generation
module that integrates prior knowledge of audio propagation into NeRF, in which
we implicitly associate audio generation with the 3D geometry and material
properties of a visual environment. Furthermore, we present a coordinate
transformation module that expresses a view direction relative to the sound
source, enabling the model to learn sound source-centric acoustic fields. To
facilitate the study of this new task, we collect a high-quality Real-World
Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method
on this real-world dataset and the simulation-based SoundSpaces dataset.Comment: NeurIPS 202
iPhonMatchNet: Zero-Shot User-Defined Keyword Spotting Using Implicit Acoustic Echo Cancellation
In response to the increasing interest in human--machine communication across
various domains, this paper introduces a novel approach called iPhonMatchNet,
which addresses the challenge of barge-in scenarios, wherein user speech
overlaps with device playback audio, thereby creating a self-referencing
problem. The proposed model leverages implicit acoustic echo cancellation
(iAEC) techniques to increase the efficiency of user-defined keyword spotting
models, achieving a remarkable 95% reduction in mean absolute error with a
minimal increase in model size (0.13%) compared to the baseline model,
PhonMatchNet. We also present an efficient model structure and demonstrate its
capability to learn iAEC functionality without requiring a clean signal. The
findings of our study indicate that the proposed model achieves competitive
performance in real-world deployment conditions of smart devices.Comment: Submitted to ICASSP 202
Learning sound representations using trainable COPE feature extractors
Sound analysis research has mainly been focused on speech and music
processing. The deployed methodologies are not suitable for analysis of sounds
with varying background noise, in many cases with very low signal-to-noise
ratio (SNR). In this paper, we present a method for the detection of patterns
of interest in audio signals. We propose novel trainable feature extractors,
which we call COPE (Combination of Peaks of Energy). The structure of a COPE
feature extractor is determined using a single prototype sound pattern in an
automatic configuration process, which is a type of representation learning. We
construct a set of COPE feature extractors, configured on a number of training
patterns. Then we take their responses to build feature vectors that we use in
combination with a classifier to detect and classify patterns of interest in
audio signals. We carried out experiments on four public data sets: MIVIA audio
events, MIVIA road events, ESC-10 and TU Dortmund data sets. The results that
we achieved (recognition rate equal to 91.71% on the MIVIA audio events, 94% on
the MIVIA road events, 81.25% on the ESC-10 and 94.27% on the TU Dortmund)
demonstrate the effectiveness of the proposed method and are higher than the
ones obtained by other existing approaches. The COPE feature extractors have
high robustness to variations of SNR. Real-time performance is achieved even
when the value of a large number of features is computed.Comment: Accepted for publication in Pattern Recognitio
Deep Learning for Distant Speech Recognition
Deep learning is an emerging technology that is considered one of the most
promising directions for reaching higher levels of artificial intelligence.
Among the other achievements, building computers that understand speech
represents a crucial leap towards intelligent machines. Despite the great
efforts of the past decades, however, a natural and robust human-machine speech
interaction still appears to be out of reach, especially when users interact
with a distant microphone in noisy and reverberant environments. The latter
disturbances severely hamper the intelligibility of a speech signal, making
Distant Speech Recognition (DSR) one of the major open challenges in the field.
This thesis addresses the latter scenario and proposes some novel techniques,
architectures, and algorithms to improve the robustness of distant-talking
acoustic models. We first elaborate on methodologies for realistic data
contamination, with a particular emphasis on DNN training with simulated data.
We then investigate on approaches for better exploiting speech contexts,
proposing some original methodologies for both feed-forward and recurrent
neural networks. Lastly, inspired by the idea that cooperation across different
DNNs could be the key for counteracting the harmful effects of noise and
reverberation, we propose a novel deep learning paradigm called network of deep
neural networks. The analysis of the original concepts were based on extensive
experimental validations conducted on both real and simulated data, considering
different corpora, microphone configurations, environments, noisy conditions,
and ASR tasks.Comment: PhD Thesis Unitn, 201
- …