388 research outputs found
Attention-based Speech Enhancement Using Human Quality Perception Modelling
Perceptually-inspired objective functions such as the perceptual evaluation
of speech quality (PESQ), signal-to-distortion ratio (SDR), and short-time
objective intelligibility (STOI), have recently been used to optimize
performance of deep-learning-based speech enhancement algorithms. These
objective functions, however, do not always strongly correlate with a
listener's assessment of perceptual quality, so optimizing with these measures
often results in poorer performance in real-world scenarios. In this work, we
propose an attention-based enhancement approach that uses learned speech
embedding vectors from a mean-opinion score (MOS) prediction model and a speech
enhancement module to jointly enhance noisy speech. The MOS prediction model
estimates the perceptual MOS of speech quality, as assessed by human listeners,
directly from the audio signal. The enhancement module also employs a quantized
language model that enforces spectral constraints for better speech realism and
performance. We train the model using real-world noisy speech data that has
been captured in everyday environments and test it using unseen corpora. The
results show that our proposed approach significantly outperforms other
approaches that are optimized with objective measures, where the predicted
quality scores strongly correlate with human judgments.Comment: 11 pages, 4 figures, 3 tables, submitted in journal TASLP 202
Customizable End-to-end Optimization of Online Neural Network-supported Dereverberation for Hearing Devices
This work focuses on online dereverberation for hearing devices using the
weighted prediction error (WPE) algorithm. WPE filtering requires an estimate
of the target speech power spectral density (PSD). Recently deep neural
networks (DNNs) have been used for this task. However, these approaches
optimize the PSD estimate which only indirectly affects the WPE output, thus
potentially resulting in limited dereverberation. In this paper, we propose an
end-to-end approach specialized for online processing, that directly optimizes
the dereverberated output signal. In addition, we propose to adapt it to the
needs of different types of hearing-device users by modifying the optimization
target as well as the WPE algorithm characteristics used in training. We show
that the proposed end-to-end approach outperforms the traditional and
conventional DNN-supported WPEs on a noise-free version of the WHAMR! dataset.Comment: \copyright 2022 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Accurate recognition of cocktail party speech containing overlapping
speakers, noise and reverberation remains a highly challenging task to date.
Motivated by the invariance of visual modality to acoustic signal corruption,
an audio-visual multi-channel speech separation, dereverberation and
recognition approach featuring a full incorporation of visual information into
all system components is proposed in this paper. The efficacy of the video
input is consistently demonstrated in mask-based MVDR speech separation,
DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and
Conformer ASR back-end. Audio-visual integrated front-end architectures
performing speech separation and dereverberation in a pipelined or joint
fashion via mask-based WPD are investigated. The error cost mismatch between
the speech enhancement front-end and ASR back-end components is minimized by
end-to-end jointly fine-tuning using either the ASR cost function alone, or its
interpolation with the speech enhancement loss. Experiments were conducted on
the mixture overlapped and reverberant speech data constructed using simulation
or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel
speech separation, dereverberation and recognition systems consistently
outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute
(41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech
enhancement improvements were also obtained on PESQ, STOI and SRMR scores.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
Naaloss: Rethinking the objective of speech enhancement
Reducing noise interference is crucial for automatic speech recognition (ASR)
in a real-world scenario. However, most single-channel speech enhancement (SE)
generates "processing artifacts" that negatively affect ASR performance. Hence,
in this study, we suggest a Noise- and Artifacts-aware loss function, NAaLoss,
to ameliorate the influence of artifacts from a novel perspective. NAaLoss
considers the loss of estimation, de-artifact, and noise ignorance, enabling
the learned SE to individually model speech, artifacts, and noise. We examine
two SE models (simple/advanced) learned with NAaLoss under various input
scenarios (clean/noisy) using two configurations of the ASR system
(with/without noise robustness). Experiments reveal that NAaLoss significantly
improves the ASR performance of most setups while preserving the quality of SE
toward perception and intelligibility. Furthermore, we visualize artifacts
through waveforms and spectrograms, and explain their impact on ASR
A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation
This work proposes and compares perceptually motivated loss functions for deep learning based binary mask estimation for speech separation. Previous loss functions have focused on maximising classification accuracy of mask estimation but we now propose loss functions that aim to maximise the hit mi- nus false-alarm (HIT-FA) rate which is known to correlate more closely to speech intelligibility. The baseline loss function is bi- nary cross-entropy (CE), a standard loss function used in binary mask estimation, which maximises classification accuracy. We propose first a loss function that maximises the HIT-FA rate in- stead of classification accuracy. We then propose a second loss function that is a hybrid between CE and HIT-FA, providing a balance between classification accuracy and HIT-FA rate. Eval- uations of the perceptually motivated loss functions with the GRID database show improvements to HIT-FA rate and ESTOI across babble and factory noises. Further tests then explore ap- plication of the perceptually motivated loss functions to a larger vocabulary dataset
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos líneas de investigación.
En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronía. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las características propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difíciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de características novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincroní que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las características propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
- …