32,429 research outputs found
Environmental Noise Embeddings for Robust Speech Recognition
We propose a novel deep neural network architecture for speech recognition
that explicitly employs knowledge of the background environmental noise within
a deep neural network acoustic model. A deep neural network is used to predict
the acoustic environment in which the system in being used. The discriminative
embedding generated at the bottleneck layer of this network is then
concatenated with traditional acoustic features as input to a deep neural
network acoustic model. Through a series of experiments on Resource Management,
CHiME-3 task, and Aurora4, we show that the proposed approach significantly
improves speech recognition accuracy in noisy and highly reverberant
environments, outperforming multi-condition training, noise-aware training,
i-vector framework, and multi-task learning on both in-domain noise and unseen
noise
Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement
We propose a multi-objective framework to learn both secondary targets not
directly related to the intended task of speech enhancement (SE) and the
primary target of the clean log-power spectra (LPS) features to be used
directly for constructing the enhanced speech signals. In deep neural network
(DNN) based SE we introduce an auxiliary structure to learn secondary
continuous features, such as mel-frequency cepstral coefficients (MFCCs), and
categorical information, such as the ideal binary mask (IBM), and integrate it
into the original DNN architecture for joint optimization of all the
parameters. This joint estimation scheme imposes additional constraints not
available in the direct prediction of LPS, and potentially improves the
learning of the primary target. Furthermore, the learned secondary information
as a byproduct can be used for other purposes, e.g., the IBM-based
post-processing in this work. A series of experiments show that joint LPS and
MFCC learning improves the SE performance, and IBM-based post-processing
further enhances listening quality of the reconstructed speech.Comment: interspeech2015 paper, German
Spoofing Detection Goes Noisy: An Analysis of Synthetic Speech Detection in the Presence of Additive Noise
Automatic speaker verification (ASV) technology is recently finding its way
to end-user applications for secure access to personal data, smart services or
physical facilities. Similar to other biometric technologies, speaker
verification is vulnerable to spoofing attacks where an attacker masquerades as
a particular target speaker via impersonation, replay, text-to-speech (TTS) or
voice conversion (VC) techniques to gain illegitimate access to the system. We
focus on TTS and VC that represent the most flexible, high-end spoofing
attacks. Most of the prior studies on synthesized or converted speech detection
report their findings using high-quality clean recordings. Meanwhile, the
performance of spoofing detectors in the presence of additive noise, an
important consideration in practical ASV implementations, remains largely
unknown. To this end, we analyze the suitability of state-of-the-art synthetic
speech detectors under additive noise with a special focus on front-end
features. Our comparison includes eight acoustic feature sets, five related to
spectral magnitude and three to spectral phase information. Our extensive
experiments on ASVSpoof 2015 corpus reveal several important findings. Firstly,
all the countermeasures break down even at relatively high signal-to-noise
ratios (SNRs) and fail to generalize to noisy conditions. Secondly, speech
enhancement is not found helpful. Thirdly, GMM back-end generally outperforms
the more involved i-vector back-end. Fourthly, concerning the compared
features, the Mel-frequency cepstral coefficients (MFCCs) and subband spectral
centroid magnitude coefficients (SCMCs) perform the best on average though the
winner method depends on SNR and noise type. Finally, a study with two score
fusion strategies shows that combining different feature based systems improves
recognition accuracy for known and unknown attacks in both clean and noisy
conditions.Comment: 23 Pages, 7 figure
Improving Deep Speech Denoising by Noisy2Noisy Signal Mapping
Existing deep learning-based speech denoising approaches require clean speech
signals to be available for training. This paper presents a deep learning-based
approach to improve speech denoising in real-world audio environments by not
requiring the availability of clean speech signals in a self-supervised manner.
A fully convolutional neural network is trained by using two noisy realizations
of the same speech signal, one used as the input and the other as the output of
the network. Extensive experimentations are conducted to show the superiority
of the developed deep speech denoising approach over the conventional
supervised deep speech denoising approach based on four commonly used
performance metrics and also based on actual field-testing outcomes
On the application of reservoir computing networks for noisy image recognition
Reservoir Computing Networks (RCNs) are a special type of single layer recurrent neural networks, in which the input and the recurrent connections are randomly generated and only the output weights are trained. Besides the ability to process temporal information, the key points of RCN are easy training and robustness against noise. Recently, we introduced a simple strategy to tune the parameters of RCNs. Evaluation in the domain of noise robust speech recognition proved that this method was effective. The aim of this work is to extend that study to the field of image processing, by showing that the proposed parameter tuning procedure is equally valid in the field of image processing and conforming that RCNs are apt at temporal modeling and are robust with respect to noise. In particular, we investigate the potential of RCNs in achieving competitive performance on the well-known MNIST dataset by following the aforementioned parameter optimizing strategy. Moreover, we achieve good noise robust recognition by utilizing such a network to denoise images and supplying them to a recognizer that is solely trained on clean images. The experiments demonstrate that the proposed RCN-based handwritten digit recognizer achieves an error rate of 0.81 percent on the clean test data of the MNIST benchmark and that the proposed RCN-based denoiser can effectively reduce the error rate on the various types of noise. (c) 2017 Elsevier B.V. All rights reserved
Speech Recognition Front End Without Information Loss
Speech representation and modelling in high-dimensional spaces of acoustic
waveforms, or a linear transformation thereof, is investigated with the aim of
improving the robustness of automatic speech recognition to additive noise. The
motivation behind this approach is twofold: (i) the information in acoustic
waveforms that is usually removed in the process of extracting low-dimensional
features might aid robust recognition by virtue of structured redundancy
analogous to channel coding, (ii) linear feature domains allow for exact noise
adaptation, as opposed to representations that involve non-linear processing
which makes noise adaptation challenging. Thus, we develop a generative
framework for phoneme modelling in high-dimensional linear feature domains, and
use it in phoneme classification and recognition tasks. Results show that
classification and recognition in this framework perform better than analogous
PLP and MFCC classifiers below 18 dB SNR. A combination of the high-dimensional
and MFCC features at the likelihood level performs uniformly better than either
of the individual representations across all noise levels
Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence
Most speech enhancement algorithms make use of the short-time Fourier
transform (STFT), which is a simple and flexible time-frequency decomposition
that estimates the short-time spectrum of a signal. However, the duration of
short STFT frames are inherently limited by the nonstationarity of speech
signals. The main contribution of this paper is a demonstration of speech
enhancement and automatic speech recognition in the presence of reverberation
and noise by extending the length of analysis windows. We accomplish this
extension by performing enhancement in the short-time fan-chirp transform
(STFChT) domain, an overcomplete time-frequency representation that is coherent
with speech signals over longer analysis window durations than the STFT. This
extended coherence is gained by using a linear model of fundamental frequency
variation of voiced speech signals. Our approach centers around using a
single-channel minimum mean-square error log-spectral amplitude (MMSE-LSA)
estimator proposed by Habets, which scales coefficients in a time-frequency
domain to suppress noise and reverberation. In the case of multiple
microphones, we preprocess the data with either a minimum variance
distortionless response (MVDR) beamformer, or a delay-and-sum beamformer (DSB).
We evaluate our algorithm on both speech enhancement and recognition tasks for
the REVERB challenge dataset. Compared to the same processing done in the STFT
domain, our approach achieves significant improvement in terms of objective
enhancement metrics (including PESQ---the ITU-T standard measurement for speech
quality). In terms of automatic speech recognition (ASR) performance as
measured by word error rate (WER), our experiments indicate that the STFT with
a long window is more effective for ASR.Comment: 22 page
A study on speech enhancement using exponent-only floating point quantized neural network (EOFP-QNN)
Numerous studies have investigated the effectiveness of neural network
quantization on pattern classification tasks. The present study, for the first
time, investigated the performance of speech enhancement (a regression task in
speech processing) using a novel exponent-only floating-point quantized neural
network (EOFP-QNN). The proposed EOFP-QNN consists of two stages:
mantissa-quantization and exponent-quantization. In the mantissa-quantization
stage, EOFP-QNN learns how to quantize the mantissa bits of the model
parameters while preserving the regression accuracy using the least mantissa
precision. In the exponent-quantization stage, the exponent part of the
parameters is further quantized without causing any additional performance
degradation. We evaluated the proposed EOFP quantization technique on two types
of neural networks, namely, bidirectional long short-term memory (BLSTM) and
fully convolutional neural network (FCN), on a speech enhancement task.
Experimental results showed that the model sizes can be significantly reduced
(the model sizes of the quantized BLSTM and FCN models were only 18.75% and
21.89%, respectively, compared to those of the original models) while
maintaining satisfactory speech-enhancement performance
Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition
Conventional deep neural network (DNN)-based speech enhancement (SE)
approaches aim to minimize the mean square error (MSE) between enhanced speech
and clean reference. The MSE-optimized model may not directly improve the
performance of an automatic speech recognition (ASR) system. If the target is
to minimize the recognition error, the recognition results should be used to
design the objective function for optimizing the SE model. However, the
structure of an ASR system, which consists of multiple units, such as acoustic
and language models, is usually complex and not differentiable. In this study,
we proposed to adopt the reinforcement learning algorithm to optimize the SE
model based on the recognition results. We evaluated the propsoed SE system on
the Mandarin Chinese broadcast news corpus (MATBN). Experimental results
demonstrate that the proposed method can effectively improve the ASR results
with a notable 12.40% and 19.23% error rate reductions for signal to noise
ratio at 0 dB and 5 dB conditions, respectively.Comment: Conference paper with 4 pages, reinforcement learning, automatic
speech recognition, speech enhancement, deep neural network, character error
rat
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale
- …