233 research outputs found
Recurrent neural networks for multi-microphone speech separation
This thesis takes the classical signal processing problem of separating the speech of a target speaker from a real-world audio recording containing noise, background interference — from competing speech or other non-speech sources —, and reverberation, and seeks data-driven solutions based on supervised learning methods, particularly recurrent neural networks (RNNs). Such speech separation methods can inject robustness in automatic speech recognition (ASR) systems and have been an active area of research for the past two decades. We particularly focus on applications where multi-channel recordings are available.
Stand-alone beamformers cannot simultaneously suppress diffuse-noise and protect the desired signal from any distortions. Post-filters complement the beamformers in obtaining the minimum mean squared error (MMSE) estimate of the desired signal. Time-frequency (TF) masking — a method having roots in computational auditory scene analysis (CASA) — is a suitable candidate for post-filtering, but the challenge lies in estimating the TF masks. The use of RNNs — in particular the bi-directional long short-term memory (BLSTM) architecture — as a post-filter estimating TF masks for a delay-and-sum beamformer (DSB) — using magnitude spectral and phase-based features — is proposed.
The data—recorded in 4 challenging realistic environments—from the CHiME-3 challenge is used. Two different TF masks — Wiener filter and log-ratio — are identified as suitable targets for learning. The separated speech is evaluated based on objective speech intelligibility measures: short-term objective intelligibility (STOI) and frequency-weighted segmental SNR (fwSNR). The word error rates (WERs) as reported by the previous state-of-the-art ASR back-end — when fed with the test data of the CHiME-3 challenge — are interpreted against the objective scores for understanding the relationships of the latter with the former. Overall, a consistent improvement in the objective scores brought in by the RNNs is observed compared to that of feed-forward neural networks and a baseline MVDR beamformer
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el tÃtulo de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos lÃneas de investigación.
En la primera lÃınea se han propuesto esquemas de extracción de caracterÃsticas novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronÃa. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las caracterÃsticas propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difÃciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de caracterÃsticas novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincronà que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las caracterÃsticas propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando DÃaz de MarÃa.- Vocal: Rubén Solera Ureñ
Deep neural network techniques for monaural speech enhancement: state of the art analysis
Deep neural networks (DNN) techniques have become pervasive in domains such
as natural language processing and computer vision. They have achieved great
success in these domains in task such as machine translation and image
generation. Due to their success, these data driven techniques have been
applied in audio domain. More specifically, DNN models have been applied in
speech enhancement domain to achieve denosing, dereverberation and
multi-speaker separation in monaural speech enhancement. In this paper, we
review some dominant DNN techniques being employed to achieve speech
separation. The review looks at the whole pipeline of speech enhancement from
feature extraction, how DNN based tools are modelling both global and local
features of speech and model training (supervised and unsupervised). We also
review the use of speech-enhancement pre-trained models to boost speech
enhancement process. The review is geared towards covering the dominant trends
with regards to DNN application in speech enhancement in speech obtained via a
single speaker.Comment: conferenc
COMPARISON METRICS AND PERFORMANCE ESTIMATIONS FOR DEEP BEAMFORMING DEEP NEURAL NETWORK BASED AUTOMATIC SPEECH RECOGNITION SYSTEMS USING MICROPHONE-ARRAYS
Automatic Speech Recognition (ASR) functionality, the automatic translation of speech into text, is on the rise today and is required for various use-cases, scenarios, and applications. An ASR engine by itself faces difficulties when encountering live input of audio data, regardless of how sophisticated and advanced it may be. That is especially true, under the circumstances such as a noisy ambient environment, multiple speakers, or faulty microphones. These kinds of challenges characterize a realistic scenario for an ASR system. ASR functionality continues to evolve toward more comprehensive End-to-End (E2E) solutions. E2E solution development focuses on three significant characteristics. The solution has to be robust enough to show endurance against external interferences. Also, it has to maintain flexibility, so it can easily extend in expectation of adapting to new scenarios or in order to achieve better performance. Lastly, we expect the solution to be modular enough to fit into new applications conveniently. Such an E2E ASR solution may include several additional micro-modules of speech enhancements besides the ASR engine, which is very complicated by itself. Adding these micro-modules can enhance the robustness and improve the overall system’s performance. Examples of such possible micro-modules include noise cancellation and speech separation, multi-microphone arrays, and adaptive beamformer(s). Being a comprehensive solution built of numerous micro-modules is technologically challenging to implement and challenging to integrate into resource-limited mobile systems. By offloading the complex computations to a server on the cloud, the system can fit more easily in less capable computing devices. Nevertheless, that compute offloading comes with the cost of giving up on real-time analysis, and increasing the overall system bandwidth. In addition, offloading to a server must have connectivity to the cloud over the internet. To find the optimal trade-offs between performance, Hardware (HW) and Software (SW) requirements or limitations, maximal computation time allowed for real-time analysis, and the detection accuracy, one should first define the different metrics used for the evaluation of such an E2E ASR system. Secondly, one needs to determine the extent of correlation between those metrics, plus the ability to forecast the impact each variation has on the others. This research presents novel progress in optimally designing a robust E2E-ASR system targeted for mobile, resource-limited devices. First, we describe evaluation metrics for each domain of interest, spread over vast engineering subjects. Here, we emphasize any bindings between the metrics across domains and the degree of impact derived from a change in the system’s specifications or constraints. Second, we present the effectiveness of applying machine learning techniques that can generalize and provide results of improved overall performance and robustness. Third, we present an approach of substituting architectures, changing algorithms, and approximating complex computations by utilizing a custom dedicated hardware acceleration in order to replace the traditional state-of-the-art SW-based solutions, thus providing real-time analysis capabilities to resource-limited systems
Using deep learning methods for supervised speech enhancement in noisy and reverberant environments
In real world environments, the speech signals received by our ears are usually a combination of different sounds that include not only the target speech, but also acoustic interference like music, background noise, and competing speakers. This interference has negative effect on speech perception and degrades the performance of speech processing applications such as automatic speech recognition (ASR), speaker identification, and hearing aid devices. One way to solve this problem is using source separation algorithms to separate the desired speech from the interfering sounds. Many source separation algorithms have been proposed to improve the performance of ASR systems and hearing aid devices, but it is still challenging for these systems to work efficiently in noisy and reverberant environments. On the other hand, humans have a remarkable ability to separate desired sounds and listen to a specific talker among noise and other talkers. Inspired by the capabilities of human auditory system, a popular method known as auditory scene analysis (ASA) was proposed to separate different sources in a two stage process of segmentation and grouping. The main goal of source separation in ASA is to estimate time frequency masks that optimally match and separate noise signals from a mixture of speech and noise. In this work, multiple algorithms are proposed to improve upon source separation in noisy and reverberant acoustic environment. First, a simple and novel algorithm is proposed to increase the discriminability between two sound sources by scaling (magnifying) the head-related transfer function of the interfering source. Experimental results from applications of this algorithm show a significant increase in the quality of the recovered target speech. Second, a time frequency masking-based source separation algorithm is proposed that can separate a male speaker from a female speaker in reverberant conditions by using the spatial cues of the source signals. Furthermore, the proposed algorithm has the ability to preserve the location of the sources after separation. Three major aims are proposed for supervised speech separation based on deep neural networks to estimate either the time frequency masks or the clean speech spectrum. Firstly, a novel monaural acoustic feature set based on a gammatone filterbank is presented to be used as the input of the deep neural network (DNN) based speech separation model, which shows significant improvement in objective speech intelligibility and speech quality in different testing conditions. Secondly, a complementary binaural feature set is proposed to increase the ability of source separation in adverse environment with non-stationary background noise and high reverberation using 2-channel recordings. Experimental results show that the combination of spatial features with this complementary feature set improves significantly the speech intelligibility and speech quality in noisy and reverberant conditions. Thirdly, a novel dilated convolution neural network is proposed to improve the generalization of the monaural supervised speech enhancement model to different untrained speakers, unseen noises and simulated rooms. This model increases the speech intelligibility and speech quality of the recovered speech significantly, while being computationally more efficient and requiring less memory in comparison to other models. In addition, the proposed model is modified with recurrent layers and dilated causal convolution layers for real-time processing. This model is causal which makes it suitable for implementation in hearing aid devices and ASR system, while having fewer trainable parameters and using only information about previous time frames in output prediction. The main goal of the proposed algorithms are to increase the intelligibility and the quality of the recovered speech from noisy and reverberant environments, which has the potential to improve both speech processing applications and signal processing strategies for hearing aid and cochlear implant technology
Advanced deep neural networks for speech separation and enhancement
Ph. D. Thesis.Monaural speech separation and enhancement aim to remove noise interference from the noisy speech mixture recorded by a single microphone, which
causes a lack of spatial information. Deep neural network (DNN) dominates speech separation and enhancement. However, there are still challenges in DNN-based methods, including choosing proper training targets
and network structures, refining generalization ability and model capacity
for unseen speakers and noises, and mitigating the reverberations in room
environments. This thesis focuses on improving separation and enhancement
performance in the real-world environment.
The first contribution in this thesis is to address monaural speech separation and enhancement within reverberant room environment by designing
new training targets and advanced network structures. The second contribution to this thesis is on improving the enhancement performance by proposing a multi-scale feature recalibration convolutional bidirectional gate recurrent unit (GRU) network (MCGN). The third contribution is to improve the
model capacity of the network and retain the robustness in the enhancement
performance. A convolutional fusion network (CFN) is proposed, which exploits the group convolutional fusion unit (GCFU).
The proposed speech enhancement methods are evaluated with various
challenging datasets. The proposed methods are assessed with the stateof-the-art techniques and performance measures to confirm that this thesis
contributes novel solution
- …