2,795 research outputs found
Exploring the robustness of features and enhancement on speech recognition systems in highly-reverberant real environments
This paper evaluates the robustness of a DNN-HMM-based speech recognition
system in highly-reverberant real environments using the HRRE database. The
performance of locally-normalized filter bank (LNFB) and Mel filter bank
(MelFB) features in combination with Non-negative Matrix Factorization (NMF),
Suppression of Slowly-varying components and the Falling edge (SSF) and
Weighted Prediction Error (WPE) enhancement methods are discussed and
evaluated. Two training conditions were considered: clean and reverberated
(Reverb). With Reverb training the use of WPE and LNFB provides WERs that are
3% and 20% lower in average than SSF and NMF, respectively. WPE and MelFB
provides WERs that are 11% and 24% lower in average than SSF and NMF,
respectively. With clean training, which represents a significant mismatch
between testing and training conditions, LNFB features clearly outperform MelFB
features. The results show that different types of training, parametrization,
and enhancement techniques may work better for a specific combination of
speaker-microphone distance and reverberation time. This suggests that there
could be some degree of complementarity between systems trained with different
enhancement and parametrization methods.Comment: 5 page
Spoofing Detection Goes Noisy: An Analysis of Synthetic Speech Detection in the Presence of Additive Noise
Automatic speaker verification (ASV) technology is recently finding its way
to end-user applications for secure access to personal data, smart services or
physical facilities. Similar to other biometric technologies, speaker
verification is vulnerable to spoofing attacks where an attacker masquerades as
a particular target speaker via impersonation, replay, text-to-speech (TTS) or
voice conversion (VC) techniques to gain illegitimate access to the system. We
focus on TTS and VC that represent the most flexible, high-end spoofing
attacks. Most of the prior studies on synthesized or converted speech detection
report their findings using high-quality clean recordings. Meanwhile, the
performance of spoofing detectors in the presence of additive noise, an
important consideration in practical ASV implementations, remains largely
unknown. To this end, we analyze the suitability of state-of-the-art synthetic
speech detectors under additive noise with a special focus on front-end
features. Our comparison includes eight acoustic feature sets, five related to
spectral magnitude and three to spectral phase information. Our extensive
experiments on ASVSpoof 2015 corpus reveal several important findings. Firstly,
all the countermeasures break down even at relatively high signal-to-noise
ratios (SNRs) and fail to generalize to noisy conditions. Secondly, speech
enhancement is not found helpful. Thirdly, GMM back-end generally outperforms
the more involved i-vector back-end. Fourthly, concerning the compared
features, the Mel-frequency cepstral coefficients (MFCCs) and subband spectral
centroid magnitude coefficients (SCMCs) perform the best on average though the
winner method depends on SNR and noise type. Finally, a study with two score
fusion strategies shows that combining different feature based systems improves
recognition accuracy for known and unknown attacks in both clean and noisy
conditions.Comment: 23 Pages, 7 figure
Speech Recognition Front End Without Information Loss
Speech representation and modelling in high-dimensional spaces of acoustic
waveforms, or a linear transformation thereof, is investigated with the aim of
improving the robustness of automatic speech recognition to additive noise. The
motivation behind this approach is twofold: (i) the information in acoustic
waveforms that is usually removed in the process of extracting low-dimensional
features might aid robust recognition by virtue of structured redundancy
analogous to channel coding, (ii) linear feature domains allow for exact noise
adaptation, as opposed to representations that involve non-linear processing
which makes noise adaptation challenging. Thus, we develop a generative
framework for phoneme modelling in high-dimensional linear feature domains, and
use it in phoneme classification and recognition tasks. Results show that
classification and recognition in this framework perform better than analogous
PLP and MFCC classifiers below 18 dB SNR. A combination of the high-dimensional
and MFCC features at the likelihood level performs uniformly better than either
of the individual representations across all noise levels
GEDI: Gammachirp Envelope Distortion Index for Predicting Intelligibility of Enhanced Speech
In this study, we propose a new concept, the gammachirp envelope distortion
index (GEDI), based on the signal-to-distortion ratio in the auditory envelope,
SDRenv to predict the intelligibility of speech enhanced by nonlinear
algorithms. The objective of GEDI is to calculate the distortion between
enhanced and clean-speech representations in the domain of a temporal envelope
extracted by the gammachirp auditory filterbank and modulation filterbank. We
also extend GEDI with multi-resolution analysis (mr-GEDI) to predict the speech
intelligibility of sounds under non-stationary noise conditions. We evaluate
GEDI in terms of speech intelligibility predictions of speech sounds enhanced
by a classic spectral subtraction and a Wiener filtering method. The
predictions are compared with human results for various signal-to-noise ratio
conditions with additive pink and babble noises. The results showed that
mr-GEDI predicted the intelligibility curves better than short-time objective
intelligibility (STOI) measure, extended-STOI (ESTOI) measure, and hearing-aid
speech perception index (HASPI) under pink-noise conditions, and better than
HASPI under babble-noise conditions. The mr-GEDI method does not present an
overestimation tendency and is considered a more conservative approach than
STOI and ESTOI. Therefore, the evaluation with mr-GEDI may provide additional
information in the development of speech enhancement algorithms.Comment: Preprint, 37 pages, 6 tables, 9 figure
ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks
We present JHU's system submission to the ASVspoof 2019 Challenge:
Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT).
Anti-spoofing has gathered more and more attention since the inauguration of
the ASVspoof Challenges, and ASVspoof 2019 dedicates to address attacks from
all three major types: text-to-speech, voice conversion, and replay. Built upon
previous research work on Deep Neural Network (DNN), ASSERT is a pipeline for
DNN-based approach to anti-spoofing. ASSERT has four components: feature
engineering, DNN models, network optimization and system combination, where the
DNN models are variants of squeeze-excitation and residual networks. We
conducted an ablation study of the effectiveness of each component on the
ASVspoof 2019 corpus, and experimental results showed that ASSERT obtained more
than 93% and 17% relative improvements over the baseline systems in the two
sub-challenges in ASVspooof 2019, ranking ASSERT one of the top performing
systems. Code and pretrained models will be made publicly available.Comment: Submitted to Interspeech 2019, Graz, Austri
A Deep Variational Convolutional Neural Network for Robust Speech Recognition in the Waveform Domain
We investigate the potential of probabilistic neural networks for learning of
robust waveform-based acoustic models. To that end, we consider a deep
convolutional network that first decomposes speech into frequency sub-bands via
an adaptive parametric convolutional block where filters are specified by
cosine modulations of compactly supported windows. The network then employs
standard non-parametric wide-pass filters, i.e., 1D convolutions, to extract
the most relevant spectro-temporal patterns while gradually compressing the
structured high dimensional representation generated by the parametric block.
We rely on a probabilistic parametrization of the proposed architecture and
learn the model using stochastic variational inference. This requires
evaluation of an analytically intractable integral defining the
Kullback-Leibler divergence term responsible for regularization, for which we
propose an effective approximation based on the Gauss-Hermite quadrature. Our
empirical results demonstrate a superior performance of the proposed approach
over relevant waveform-based baselines and indicate that it could lead to
robustness. Moreover, the approach outperforms a recently proposed deep
convolutional network for learning of robust acoustic models with standard
filterbank features
Speech Enhancement Based on Reducing the Detail Portion of Speech Spectrograms in Modulation Domain via Discrete Wavelet Transform
In this paper, we propose a novel speech enhancement (SE) method by
exploiting the discrete wavelet transform (DWT). This new method reduces the
amount of fast time-varying portion, viz. the DWT-wise detail component, in the
spectrogram of speech signals so as to highlight the speech-dominant component
and achieves better speech quality. A particularity of this new method is that
it is completely unsupervised and requires no prior information about the clean
speech and noise in the processed utterance. The presented DWT-based SE method
with various scaling factors for the detail part is evaluated with a subset of
Aurora-2 database, and the PESQ metric is used to indicate the quality of
processed speech signals. The preliminary results show that the processed
speech signals reveal a higher PESQ score in comparison with the original
counterparts. Furthermore, we show that this method can still enhance the
signal by totally discarding the detail part (setting the respective scaling
factor to zero), revealing that the spectrogram can be down-sampled and thus
compressed without the cost of lowered quality. In addition, we integrate this
new method with conventional speech enhancement algorithms, including spectral
subtraction, Wiener filtering, and spectral MMSE estimation, and show that the
resulting integration behaves better than the respective component method. As a
result, this new method is quite effective in improving the speech quality and
well additive to the other SE methods.Comment: 4 pages, 4 figures, to appear in ISCSLP 201
Deep Scattering Spectrum
A scattering transform defines a locally translation invariant representation
which is stable to time-warping deformations. It extends MFCC representations
by computing modulation spectrum coefficients of multiple orders, through
cascades of wavelet convolutions and modulus operators. Second-order scattering
coefficients characterize transient phenomena such as attacks and amplitude
modulation. A frequency transposition invariant representation is obtained by
applying a scattering transform along log-frequency. State-the-of-art
classification results are obtained for musical genre and phone classification
on GTZAN and TIMIT databases, respectively
Speech Recognition by Machine, A Review
This paper presents a brief survey on Automatic Speech Recognition and
discusses the major themes and advances made in the past 60 years of research,
so as to provide a technological perspective and an appreciation of the
fundamental progress that has been accomplished in this important area of
speech communication. After years of research and development the accuracy of
automatic speech recognition remains one of the important research challenges
(e.g., variations of the context, speakers, and environment).The design of
Speech Recognition system requires careful attentions to the following issues:
Definition of various types of speech classes, speech representation, feature
extraction techniques, speech classifiers, database and performance evaluation.
The problems that are existing in ASR and the various techniques to solve these
problems constructed by various research workers have been presented in a
chronological order. Hence authors hope that this work shall be a contribution
in the area of speech recognition. The objective of this review paper is to
summarize and compare some of the well known methods used in various stages of
speech recognition system and identify research topic and applications which
are at the forefront of this exciting and challenging field.Comment: 25 pages IEEE format, International Journal of Computer Science and
Information Security, IJCSIS December 2009, ISSN 1947 5500,
http://sites.google.com/site/ijcsis
Adverse Conditions and ASR Techniques for Robust Speech User Interface
The main motivation for Automatic Speech Recognition (ASR) is efficient
interfaces to computers, and for the interfaces to be natural and truly useful,
it should provide coverage for a large group of users. The purpose of these
tasks is to further improve man-machine communication. ASR systems exhibit
unacceptable degradations in performance when the acoustical environments used
for training and testing the system are not the same. The goal of this research
is to increase the robustness of the speech recognition systems with respect to
changes in the environment. A system can be labeled as environment-independent
if the recognition accuracy for a new environment is the same or higher than
that obtained when the system is retrained for that environment. Attaining such
performance is the dream of the researchers. This paper elaborates some of the
difficulties with Automatic Speech Recognition (ASR). These difficulties are
classified into Speakers characteristics and environmental conditions, and
tried to suggest some techniques to compensate variations in speech signal.
This paper focuses on the robustness with respect to speakers variations and
changes in the acoustical environment. We discussed several different external
factors that change the environment and physiological differences that affect
the performance of a speech recognition system followed by techniques that are
helpful to design a robust ASR system.Comment: 10 pages 2 Table
- …