76 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Light Gated Recurrent Units for Speech Recognition
A field that has directly benefited from the recent advances in deep learning
is Automatic Speech Recognition (ASR). Despite the great achievements of the
past decades, however, a natural and robust human-machine speech interaction
still appears to be out of reach, especially in challenging environments
characterized by significant noise and reverberation. To improve robustness,
modern speech recognizers often employ acoustic models based on Recurrent
Neural Networks (RNNs), that are naturally able to exploit large time contexts
and long-term speech modulations. It is thus of great interest to continue the
study of proper techniques for improving the effectiveness of RNNs in
processing speech signals.
In this paper, we revise one of the most popular RNN models, namely Gated
Recurrent Units (GRUs), and propose a simplified architecture that turned out
to be very effective for ASR. The contribution of this work is two-fold: First,
we analyze the role played by the reset gate, showing that a significant
redundancy with the update gate occurs. As a result, we propose to remove the
former from the GRU design, leading to a more efficient and compact single-gate
model. Second, we propose to replace hyperbolic tangent with ReLU activations.
This variation couples well with batch normalization and could help the model
learn long-term dependencies without numerical issues.
Results show that the proposed architecture, called Light GRU (Li-GRU), not
only reduces the per-epoch training time by more than 30% over a standard GRU,
but also consistently improves the recognition accuracy across different tasks,
input features, noisy conditions, as well as across different ASR paradigms,
ranging from standard DNN-HMM speech recognizers to end-to-end CTC models.Comment: Copyright 2018 IEE
Deep Learning for Distant Speech Recognition
Deep learning is an emerging technology that is considered one of the most
promising directions for reaching higher levels of artificial intelligence.
Among the other achievements, building computers that understand speech
represents a crucial leap towards intelligent machines. Despite the great
efforts of the past decades, however, a natural and robust human-machine speech
interaction still appears to be out of reach, especially when users interact
with a distant microphone in noisy and reverberant environments. The latter
disturbances severely hamper the intelligibility of a speech signal, making
Distant Speech Recognition (DSR) one of the major open challenges in the field.
This thesis addresses the latter scenario and proposes some novel techniques,
architectures, and algorithms to improve the robustness of distant-talking
acoustic models. We first elaborate on methodologies for realistic data
contamination, with a particular emphasis on DNN training with simulated data.
We then investigate on approaches for better exploiting speech contexts,
proposing some original methodologies for both feed-forward and recurrent
neural networks. Lastly, inspired by the idea that cooperation across different
DNNs could be the key for counteracting the harmful effects of noise and
reverberation, we propose a novel deep learning paradigm called network of deep
neural networks. The analysis of the original concepts were based on extensive
experimental validations conducted on both real and simulated data, considering
different corpora, microphone configurations, environments, noisy conditions,
and ASR tasks.Comment: PhD Thesis Unitn, 201
Speech Enhancement and Dereverberation with Diffusion-based Generative Models
In this work, we build upon our previous publication and use diffusion-based
generative models for speech enhancement. We present a detailed overview of the
diffusion process that is based on a stochastic differential equation and delve
into an extensive theoretical examination of its implications. Opposed to usual
conditional generation tasks, we do not start the reverse process from pure
Gaussian noise but from a mixture of noisy speech and Gaussian noise. This
matches our forward process which moves from clean speech to noisy speech by
including a drift term. We show that this procedure enables using only 30
diffusion steps to generate high-quality clean speech estimates. By adapting
the network architecture, we are able to significantly improve the speech
enhancement performance, indicating that the network, rather than the
formalism, was the main limitation of our original approach. In an extensive
cross-dataset evaluation, we show that the improved method can compete with
recent discriminative models and achieves better generalization when evaluating
on a different corpus than used for training. We complement the results with an
instrumental evaluation using real-world noisy recordings and a listening
experiment, in which our proposed method is rated best. Examining different
sampler configurations for solving the reverse process allows us to balance the
performance and computational speed of the proposed method. Moreover, we show
that the proposed method is also suitable for dereverberation and thus not
limited to additive background noise removal. Code and audio examples are
available online, see https://github.com/sp-uhh/sgmseComment: Accepted versio
Learning Feature Representation for Automatic Speech Recognition
Feature extraction in automatic speech recognition (ASR) can be regarded
as learning representations from lower-level to more abstract higher-level features.
Lower-level feature can be viewed as features from the signal domain,
such as perceptual linear predictive (PLP) and Mel-frequency cepstral coefficients
(MFCCs) features. Higher-level feature representations can be considered
as bottleneck features (BNFs) learned using deep neural networks
(DNNs). In this thesis, we focus on improving feature extraction at different
levels mainly for ASR.
The first part of this thesis focuses on learning features from the signal
domain that help ASR. Hand-crafted spectral and cepstral features such as
MFCC are the main features used in most conventional ASR systems; all are
inspired by physiological models of the human auditory system. However, some
aspects of the signal such as pitch cannot be easily extracted from spectral
features, but are found to be useful for ASR. We explore new algorithm to extract
a pitch feature directly from a signal for ASR and show that this feature, appended to the other feature, gives consistent improvements in various languages,
especially tonal languages.
We then investigate replacing the conventional features with jointly training
from the signal domain using time domain, and frequency domain approaches.
The results show that our time-domain joint feature learning setup
achieves state-of-the-art performance using MFCC, while our frequency domain
setup outperforms them in various datasets.
Joint feature extraction results in learning data or language-dependent filter
banks, that can degrade the performance in unseen noise and channel conditions
or other languages. To tackle this, we investigate joint universal feature
learning across different languages using the proposed direct-from-signal
setups. We then investigate the filter banks learned in this setup and propose
a new set of features as an extension to conventional Mel filter banks. The results
show consistent word error rate (WER) improvement, especially in clean
condition.
The second part of this thesis focuses on learning higher-level feature embedding.
We investigate learning and transferring deep feature representations
across different domains using multi-task learning and weight transfer
approaches. They have been adopted to explicitly learn intermediate-level features
that are useful for several different tasks
The PASCAL CHiME Speech Separation and Recognition Challenge
International audienceDistant microphone speech recognition systems that operate with humanlike robustness remain a distant goal. The key difficulty is that operating in everyday listening conditions entails processing a speech signal that is reverberantly mixed into a noise background composed of multiple competing sound sources. This paper describes a recent speech recognition evaluation that was designed to bring together researchers from multiple communities in order to foster novel approaches to this problem. The task was to identify keywords from sentences reverberantly mixed into audio backgrounds binaurally-recorded in a busy domestic environment. The challenge was designed to model the essential difficulties of multisource environment problem while remaining on a scale that would make it accessible to a wide audience. Compared to previous ASR evaluation a particular novelty of the task is that the utterances to be recognised were provided in a continuous audio background rather than as pre-segmented utterances thus allowing a range of background modelling techniques to be employed. The challenge attracted thirteen submissions. This paper describes the challenge problem, provides an overview of the systems that were entered and provides a comparison alongside both a baseline recognition system and human performance. The paper discusses insights gained from the challenge and lessons learnt for the design of future such evaluations
- …