Search CORE

2,795 research outputs found

Exploring the robustness of features and enhancement on speech recognition systems in highly-reverberant real environments

Author: Escudero Juan Pablo
King Simon
Novoa José
Poblete Victor
Stern Richard
Wuth Jorge
Yoma Néstor Becerra
Publication venue
Publication date: 23/03/2018
Field of study

This paper evaluates the robustness of a DNN-HMM-based speech recognition system in highly-reverberant real environments using the HRRE database. The performance of locally-normalized filter bank (LNFB) and Mel filter bank (MelFB) features in combination with Non-negative Matrix Factorization (NMF), Suppression of Slowly-varying components and the Falling edge (SSF) and Weighted Prediction Error (WPE) enhancement methods are discussed and evaluated. Two training conditions were considered: clean and reverberated (Reverb). With Reverb training the use of WPE and LNFB provides WERs that are 3% and 20% lower in average than SSF and NMF, respectively. WPE and MelFB provides WERs that are 11% and 24% lower in average than SSF and NMF, respectively. With clean training, which represents a significant mismatch between testing and training conditions, LNFB features clearly outperform MelFB features. The results show that different types of training, parametrization, and enhancement techniques may work better for a specific combination of speaker-microphone distance and reverberation time. This suggests that there could be some degree of complementarity between systems trained with different enhancement and parametrization methods.Comment: 5 page

arXiv.org e-Print Archive

Spoofing Detection Goes Noisy: An Analysis of Synthetic Speech Detection in the Presence of Additive Noise

Author: Hanilci Cemal
Kinnunen Tomi
Sahidullah Md
Sizov Aleksandr
Publication venue
Publication date: 14/09/2016
Field of study

Automatic speaker verification (ASV) technology is recently finding its way to end-user applications for secure access to personal data, smart services or physical facilities. Similar to other biometric technologies, speaker verification is vulnerable to spoofing attacks where an attacker masquerades as a particular target speaker via impersonation, replay, text-to-speech (TTS) or voice conversion (VC) techniques to gain illegitimate access to the system. We focus on TTS and VC that represent the most flexible, high-end spoofing attacks. Most of the prior studies on synthesized or converted speech detection report their findings using high-quality clean recordings. Meanwhile, the performance of spoofing detectors in the presence of additive noise, an important consideration in practical ASV implementations, remains largely unknown. To this end, we analyze the suitability of state-of-the-art synthetic speech detectors under additive noise with a special focus on front-end features. Our comparison includes eight acoustic feature sets, five related to spectral magnitude and three to spectral phase information. Our extensive experiments on ASVSpoof 2015 corpus reveal several important findings. Firstly, all the countermeasures break down even at relatively high signal-to-noise ratios (SNRs) and fail to generalize to noisy conditions. Secondly, speech enhancement is not found helpful. Thirdly, GMM back-end generally outperforms the more involved i-vector back-end. Fourthly, concerning the compared features, the Mel-frequency cepstral coefficients (MFCCs) and subband spectral centroid magnitude coefficients (SCMCs) perform the best on average though the winner method depends on SNR and noise type. Finally, a study with two score fusion strategies shows that combining different feature based systems improves recognition accuracy for known and unknown attacks in both clean and noisy conditions.Comment: 23 Pages, 7 figure

arXiv.org e-Print Archive

Speech Recognition Front End Without Information Loss

Author: Ager Matthew
Cvetkovic Zoran
Sollich Peter
Publication venue
Publication date: 30/03/2015
Field of study

Speech representation and modelling in high-dimensional spaces of acoustic waveforms, or a linear transformation thereof, is investigated with the aim of improving the robustness of automatic speech recognition to additive noise. The motivation behind this approach is twofold: (i) the information in acoustic waveforms that is usually removed in the process of extracting low-dimensional features might aid robust recognition by virtue of structured redundancy analogous to channel coding, (ii) linear feature domains allow for exact noise adaptation, as opposed to representations that involve non-linear processing which makes noise adaptation challenging. Thus, we develop a generative framework for phoneme modelling in high-dimensional linear feature domains, and use it in phoneme classification and recognition tasks. Results show that classification and recognition in this framework perform better than analogous PLP and MFCC classifiers below 18 dB SNR. A combination of the high-dimensional and MFCC features at the likelihood level performs uniformly better than either of the individual representations across all noise levels

arXiv.org e-Print Archive

GEDI: Gammachirp Envelope Distortion Index for Predicting Intelligibility of Enhanced Speech

Author: Araki Shoko
Irino Toshio
Kinoshita Keisuke
Nakatani Tomohiro
Yamamoto Katsuhiko
Publication venue: 'Elsevier BV'
Publication date: 19/07/2020
Field of study

In this study, we propose a new concept, the gammachirp envelope distortion index (GEDI), based on the signal-to-distortion ratio in the auditory envelope, SDRenv to predict the intelligibility of speech enhanced by nonlinear algorithms. The objective of GEDI is to calculate the distortion between enhanced and clean-speech representations in the domain of a temporal envelope extracted by the gammachirp auditory filterbank and modulation filterbank. We also extend GEDI with multi-resolution analysis (mr-GEDI) to predict the speech intelligibility of sounds under non-stationary noise conditions. We evaluate GEDI in terms of speech intelligibility predictions of speech sounds enhanced by a classic spectral subtraction and a Wiener filtering method. The predictions are compared with human results for various signal-to-noise ratio conditions with additive pink and babble noises. The results showed that mr-GEDI predicted the intelligibility curves better than short-time objective intelligibility (STOI) measure, extended-STOI (ESTOI) measure, and hearing-aid speech perception index (HASPI) under pink-noise conditions, and better than HASPI under babble-noise conditions. The mr-GEDI method does not present an overestimation tendency and is considered a more conservative approach than STOI and ESTOI. Therefore, the evaluation with mr-GEDI may provide additional information in the development of speech enhancement algorithms.Comment: Preprint, 37 pages, 6 tables, 9 figure

arXiv.org e-Print Archive

ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks

Author: Chen Nanxin
Dehak Najim
Lai Cheng-I
Villalba Jesús
Publication venue
Publication date: 01/04/2019
Field of study

We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT). Anti-spoofing has gathered more and more attention since the inauguration of the ASVspoof Challenges, and ASVspoof 2019 dedicates to address attacks from all three major types: text-to-speech, voice conversion, and replay. Built upon previous research work on Deep Neural Network (DNN), ASSERT is a pipeline for DNN-based approach to anti-spoofing. ASSERT has four components: feature engineering, DNN models, network optimization and system combination, where the DNN models are variants of squeeze-excitation and residual networks. We conducted an ablation study of the effectiveness of each component on the ASVspoof 2019 corpus, and experimental results showed that ASSERT obtained more than 93% and 17% relative improvements over the baseline systems in the two sub-challenges in ASVspooof 2019, ranking ASSERT one of the top performing systems. Code and pretrained models will be made publicly available.Comment: Submitted to Interspeech 2019, Graz, Austri

arXiv.org e-Print Archive

A Deep Variational Convolutional Neural Network for Robust Speech Recognition in the Waveform Domain

Author: Cvetkovic Zoran
Oglic Dino
Sollich Peter
Publication venue
Publication date: 22/06/2020
Field of study

We investigate the potential of probabilistic neural networks for learning of robust waveform-based acoustic models. To that end, we consider a deep convolutional network that first decomposes speech into frequency sub-bands via an adaptive parametric convolutional block where filters are specified by cosine modulations of compactly supported windows. The network then employs standard non-parametric wide-pass filters, i.e., 1D convolutions, to extract the most relevant spectro-temporal patterns while gradually compressing the structured high dimensional representation generated by the parametric block. We rely on a probabilistic parametrization of the proposed architecture and learn the model using stochastic variational inference. This requires evaluation of an analytically intractable integral defining the Kullback-Leibler divergence term responsible for regularization, for which we propose an effective approximation based on the Gauss-Hermite quadrature. Our empirical results demonstrate a superior performance of the proposed approach over relevant waveform-based baselines and indicate that it could lead to robustness. Moreover, the approach outperforms a recently proposed deep convolutional network for learning of robust acoustic models with standard filterbank features

arXiv.org e-Print Archive

Speech Enhancement Based on Reducing the Detail Portion of Speech Spectrograms in Modulation Domain via Discrete Wavelet Transform

Author: Hung Jeih-weih
Lee Shih-kuang
Tsao Yu
Wang Syu-Siang
Publication venue
Publication date: 08/11/2018
Field of study

In this paper, we propose a novel speech enhancement (SE) method by exploiting the discrete wavelet transform (DWT). This new method reduces the amount of fast time-varying portion, viz. the DWT-wise detail component, in the spectrogram of speech signals so as to highlight the speech-dominant component and achieves better speech quality. A particularity of this new method is that it is completely unsupervised and requires no prior information about the clean speech and noise in the processed utterance. The presented DWT-based SE method with various scaling factors for the detail part is evaluated with a subset of Aurora-2 database, and the PESQ metric is used to indicate the quality of processed speech signals. The preliminary results show that the processed speech signals reveal a higher PESQ score in comparison with the original counterparts. Furthermore, we show that this method can still enhance the signal by totally discarding the detail part (setting the respective scaling factor to zero), revealing that the spectrogram can be down-sampled and thus compressed without the cost of lowered quality. In addition, we integrate this new method with conventional speech enhancement algorithms, including spectral subtraction, Wiener filtering, and spectral MMSE estimation, and show that the resulting integration behaves better than the respective component method. As a result, this new method is quite effective in improving the speech quality and well additive to the other SE methods.Comment: 4 pages, 4 figures, to appear in ISCSLP 201

arXiv.org e-Print Archive

Deep Scattering Spectrum

Author: Andén Joakim
Mallat Stéphane
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/01/2014
Field of study

A scattering transform defines a locally translation invariant representation which is stable to time-warping deformations. It extends MFCC representations by computing modulation spectrum coefficients of multiple orders, through cascades of wavelet convolutions and modulus operators. Second-order scattering coefficients characterize transient phenomena such as attacks and amplitude modulation. A frequency transposition invariant representation is obtained by applying a scattering transform along log-frequency. State-the-of-art classification results are obtained for musical genre and phone classification on GTZAN and TIMIT databases, respectively

arXiv.org e-Print Archive

Speech Recognition by Machine, A Review

Author: Anusuya M. A.
Katti S. K.
Publication venue
Publication date: 13/01/2010
Field of study

This paper presents a brief survey on Automatic Speech Recognition and discusses the major themes and advances made in the past 60 years of research, so as to provide a technological perspective and an appreciation of the fundamental progress that has been accomplished in this important area of speech communication. After years of research and development the accuracy of automatic speech recognition remains one of the important research challenges (e.g., variations of the context, speakers, and environment).The design of Speech Recognition system requires careful attentions to the following issues: Definition of various types of speech classes, speech representation, feature extraction techniques, speech classifiers, database and performance evaluation. The problems that are existing in ASR and the various techniques to solve these problems constructed by various research workers have been presented in a chronological order. Hence authors hope that this work shall be a contribution in the area of speech recognition. The objective of this review paper is to summarize and compare some of the well known methods used in various stages of speech recognition system and identify research topic and applications which are at the forefront of this exciting and challenging field.Comment: 25 pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS December 2009, ISSN 1947 5500, http://sites.google.com/site/ijcsis

arXiv.org e-Print Archive

Adverse Conditions and ASR Techniques for Robust Speech User Interface

Author: Shrawankar Urmila
Thakare VM
Publication venue
Publication date: 22/03/2013
Field of study

The main motivation for Automatic Speech Recognition (ASR) is efficient interfaces to computers, and for the interfaces to be natural and truly useful, it should provide coverage for a large group of users. The purpose of these tasks is to further improve man-machine communication. ASR systems exhibit unacceptable degradations in performance when the acoustical environments used for training and testing the system are not the same. The goal of this research is to increase the robustness of the speech recognition systems with respect to changes in the environment. A system can be labeled as environment-independent if the recognition accuracy for a new environment is the same or higher than that obtained when the system is retrained for that environment. Attaining such performance is the dream of the researchers. This paper elaborates some of the difficulties with Automatic Speech Recognition (ASR). These difficulties are classified into Speakers characteristics and environmental conditions, and tried to suggest some techniques to compensate variations in speech signal. This paper focuses on the robustness with respect to speakers variations and changes in the acoustical environment. We discussed several different external factors that change the environment and physiological differences that affect the performance of a speech recognition system followed by techniques that are helpful to design a robust ASR system.Comment: 10 pages 2 Table

arXiv.org e-Print Archive