107 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Accurate recognition of cocktail party speech containing overlapping
speakers, noise and reverberation remains a highly challenging task to date.
Motivated by the invariance of visual modality to acoustic signal corruption,
an audio-visual multi-channel speech separation, dereverberation and
recognition approach featuring a full incorporation of visual information into
all system components is proposed in this paper. The efficacy of the video
input is consistently demonstrated in mask-based MVDR speech separation,
DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and
Conformer ASR back-end. Audio-visual integrated front-end architectures
performing speech separation and dereverberation in a pipelined or joint
fashion via mask-based WPD are investigated. The error cost mismatch between
the speech enhancement front-end and ASR back-end components is minimized by
end-to-end jointly fine-tuning using either the ASR cost function alone, or its
interpolation with the speech enhancement loss. Experiments were conducted on
the mixture overlapped and reverberant speech data constructed using simulation
or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel
speech separation, dereverberation and recognition systems consistently
outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute
(41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech
enhancement improvements were also obtained on PESQ, STOI and SRMR scores.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
Informed algorithms for sound source separation in enclosed reverberant environments
While humans can separate a sound of interest amidst a cacophony of contending sounds in an echoic environment, machine-based methods lag behind in solving this task. This thesis thus aims at improving performance of audio separation algorithms when they are informed i.e. have access to source location information. These locations are assumed to be known a priori in this work, for example by video processing.
Initially, a multi-microphone array based method combined with binary
time-frequency masking is proposed. A robust least squares frequency invariant data independent beamformer designed with the location information is
utilized to estimate the sources. To further enhance the estimated sources, binary time-frequency masking based post-processing is used but cepstral domain smoothing is required to mitigate musical noise.
To tackle the under-determined case and further improve separation performance
at higher reverberation times, a two-microphone based method
which is inspired by human auditory processing and generates soft time-frequency masks is described. In this approach interaural level difference,
interaural phase difference and mixing vectors are probabilistically modeled in the time-frequency domain and the model parameters are learned
through the expectation-maximization (EM) algorithm. A direction vector is estimated for each source, using the location information, which is used as
the mean parameter of the mixing vector model. Soft time-frequency masks are used to reconstruct the sources. A spatial covariance model is then integrated into the probabilistic model framework that encodes the spatial
characteristics of the enclosure and further improves the separation performance
in challenging scenarios i.e. when sources are in close proximity and
when the level of reverberation is high.
Finally, new dereverberation based pre-processing is proposed based on the cascade of three dereverberation stages where each enhances the twomicrophone
reverberant mixture. The dereverberation stages are based on amplitude spectral subtraction, where the late reverberation is estimated and suppressed. The combination of such dereverberation based pre-processing and use of soft mask separation yields the best separation performance. All methods are evaluated with real and synthetic mixtures formed for example from speech signals from the TIMIT database and measured room impulse responses
Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation
Several speech processing systems have demonstrated considerable performance
improvements when deep complex neural networks (DCNN) are coupled with
self-attention (SA) networks. However, the majority of DCNN-based studies on
speech dereverberation that employ self-attention do not explicitly account for
the inter-dependencies between real and imaginary features when computing
attention. In this study, we propose a complex-valued T-F attention (TFA)
module that models spectral and temporal dependencies by computing
two-dimensional attention maps across time and frequency dimensions. We
validate the effectiveness of our proposed complex-valued TFA module with the
deep complex convolutional recurrent network (DCCRN) using the REVERB challenge
corpus. Experimental findings indicate that integrating our complex-TFA module
with DCCRN improves overall speech quality and performance of back-end speech
applications, such as automatic speech recognition, compared to earlier
approaches for self-attention.Comment: Interspeech 2022: ISCA Best Student Paper Award Finalis
Integrating Plug-and-Play Data Priors with Weighted Prediction Error for Speech Dereverberation
Speech dereverberation aims to alleviate the detrimental effects of
late-reverberant components. While the weighted prediction error (WPE) method
has shown superior performance in dereverberation, there is still room for
further improvement in terms of performance and robustness in complex and noisy
environments. Recent research has highlighted the effectiveness of integrating
physics-based and data-driven methods, enhancing the performance of various
signal processing tasks while maintaining interpretability. Motivated by these
advancements, this paper presents a novel dereverberation frame-work, which
incorporates data-driven methods for capturing speech priors within the WPE
framework. The plug-and-play strategy (PnP), specifically the regularization by
denoising (RED) strategy, is utilized to incorporate speech prior information
learnt from data during the optimization problem solving iterations.
Experimental results validate the effectiveness of the proposed approach
Spatial dissection of a soundfield using spherical harmonic decomposition
A real-world soundfield is often contributed by multiple desired and undesired sound sources. The performance of many acoustic systems such as automatic speech recognition, audio surveillance, and teleconference relies on its ability to extract the desired sound components in such a mixed environment. The existing solutions to the above problem are constrained by various fundamental limitations and require to enforce different priors depending on the acoustic condition such as reverberation and spatial distribution of sound sources. With the growing emphasis and integration of audio applications in diverse technologies such as smart home and virtual reality appliances, it is imperative to advance the source separation technology in order to overcome the limitations of the traditional approaches.
To that end, we exploit the harmonic decomposition model to dissect a mixed soundfield into its underlying desired and undesired components based on source and signal characteristics. By analysing the spatial projection of a soundfield, we achieve multiple outcomes such as (i) soundfield separation with respect to distinct source regions, (ii) source separation in a mixed soundfield using modal coherence model, and (iii) direction of arrival (DOA) estimation of multiple overlapping sound sources through pattern recognition of the modal coherence of a soundfield.
We first employ an array of higher order microphones for soundfield separation in order to reduce hardware requirement and implementation complexity. Subsequently, we develop novel mathematical models for modal coherence of noisy and reverberant soundfields that facilitate convenient ways for estimating DOA and power spectral densities leading to robust source separation algorithms. The modal domain approach to the soundfield/source separation allows us to circumvent several practical limitations of the existing techniques and enhance the performance and robustness of the system. The proposed methods are presented with several practical applications and performance evaluations using simulated and real-life dataset
Multi-channel Conversational Speaker Separation via Neural Diarization
When dealing with overlapped speech, the performance of automatic speech
recognition (ASR) systems substantially degrades as they are designed for
single-talker speech. To enhance ASR performance in conversational or meeting
environments, continuous speaker separation (CSS) is commonly employed.
However, CSS requires a short separation window to avoid many speakers inside
the window and sequential grouping of discontinuous speech segments. To address
these limitations, we introduce a new multi-channel framework called "speaker
separation via neural diarization" (SSND) for meeting environments. Our
approach utilizes an end-to-end diarization system to identify the speech
activity of each individual speaker. By leveraging estimated speaker
boundaries, we generate a sequence of embeddings, which in turn facilitate the
assignment of speakers to the outputs of a multi-talker separation model. SSND
addresses the permutation ambiguity issue of talker-independent speaker
separation during the diarization phase through location-based training, rather
than during the separation process. This unique approach allows multiple
non-overlapped speakers to be assigned to the same output stream, making it
possible to efficiently process long segments-a task impossible with CSS.
Additionally, SSND is naturally suitable for speaker-attributed ASR. We
evaluate our proposed diarization and separation methods on the open LibriCSS
dataset, advancing state-of-the-art diarization and ASR results by a large
margin.Comment: 10 pages, 4 figure
GPU-accelerated Guided Source Separation for Meeting Transcription
Guided source separation (GSS) is a type of target-speaker extraction method
that relies on pre-computed speaker activities and blind source separation to
perform front-end enhancement of overlapped speech signals. It was first
proposed during the CHiME-5 challenge and provided significant improvements
over the delay-and-sum beamforming baseline. Despite its strengths, however,
the method has seen limited adoption for meeting transcription benchmarks
primarily due to its high computation time. In this paper, we describe our
improved implementation of GSS that leverages the power of modern GPU-based
pipelines, including batched processing of frequencies and segments, to provide
300x speed-up over CPU-based inference. The improved inference time allows us
to perform detailed ablation studies over several parameters of the GSS
algorithm -- such as context duration, number of channels, and noise class, to
name a few. We provide end-to-end reproducible pipelines for speaker-attributed
transcription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting. Our
code and recipes are publicly available: https://github.com/desh2608/gss.Comment: 7 pages, 4 figures. Code available at https://github.com/desh2608/gs
HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
Real-world audio recordings are often degraded by factors such as noise,
reverberation, and equalization distortion. This paper introduces HiFi-GAN, a
deep learning method to transform recorded speech to sound as though it had
been recorded in a studio. We use an end-to-end feed-forward WaveNet
architecture, trained with multi-scale adversarial discriminators in both the
time domain and the time-frequency domain. It relies on the deep feature
matching losses of the discriminators to improve the perceptual quality of
enhanced speech. The proposed model generalizes well to new speakers, new
speech content, and new environments. It significantly outperforms
state-of-the-art baseline methods in both objective and subjective experiments.Comment: Accepted by INTERSPEECH 202
- …