28,901 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Automotive three-microphone voice activity detector and noise-canceller
This paper addresses issues in improving hands-free speech recognition performance in car
environments. A three-microphone array has been used to form a beamformer with leastmean
squares (LMS) to improve Signal to Noise Ratio (SNR). A three-microphone array
has been paralleled to a Voice Activity Detection (VAD). The VAD uses time-delay
estimation together with magnitude-squared coherence (MSC)
Block-Online Multi-Channel Speech Enhancement Using DNN-Supported Relative Transfer Function Estimates
This work addresses the problem of block-online processing for multi-channel
speech enhancement. Such processing is vital in scenarios with moving speakers
and/or when very short utterances are processed, e.g., in voice assistant
scenarios. We consider several variants of a system that performs beamforming
supported by DNN-based voice activity detection (VAD) followed by
post-filtering. The speaker is targeted through estimating relative transfer
functions between microphones. Each block of the input signals is processed
independently in order to make the method applicable in highly dynamic
environments. Owing to the short length of the processed block, the statistics
required by the beamformer are estimated less precisely. The influence of this
inaccuracy is studied and compared to the processing regime when recordings are
treated as one block (batch processing). The experimental evaluation of the
proposed method is performed on large datasets of CHiME-4 and on another
dataset featuring moving target speaker. The experiments are evaluated in terms
of objective and perceptual criteria (such as signal-to-interference ratio
(SIR) or perceptual evaluation of speech quality (PESQ), respectively).
Moreover, word error rate (WER) achieved by a baseline automatic speech
recognition system is evaluated, for which the enhancement method serves as a
front-end solution. The results indicate that the proposed method is robust
with respect to short length of the processed block. Significant improvements
in terms of the criteria and WER are observed even for the block length of 250
ms.Comment: 10 pages, 8 figures, 4 tables. Modified version of the article
accepted for publication in IET Signal Processing journal. Original results
unchanged, additional experiments presented, refined discussion and
conclusion
The Application of Blind Source Separation to Feature Decorrelation and Normalizations
We apply a Blind Source Separation BSS algorithm to the decorrelation of Mel-warped cepstra. The observed cepstra are modeled as a convolutive mixture of independent source cepstra. The algorithm aims to minimize a cross-spectral correlation at different lags to reconstruct the source cepstra. Results show that using "independent" cepstra as features leads to a reduction in the WER.Finally, we present three different enhancements to the BSS algorithm. We also present some results of these deviations of the original algorithm
BigEAR: Inferring the Ambient and Emotional Correlates from Smartphone-based Acoustic Big Data
This paper presents a novel BigEAR big data framework that employs
psychological audio processing chain (PAPC) to process smartphone-based
acoustic big data collected when the user performs social conversations in
naturalistic scenarios. The overarching goal of BigEAR is to identify moods of
the wearer from various activities such as laughing, singing, crying, arguing,
and sighing. These annotations are based on ground truth relevant for
psychologists who intend to monitor/infer the social context of individuals
coping with breast cancer. We pursued a case study on couples coping with
breast cancer to know how the conversations affect emotional and social well
being. In the state-of-the-art methods, psychologists and their team have to
hear the audio recordings for making these inferences by subjective evaluations
that not only are time-consuming and costly, but also demand manual data coding
for thousands of audio files. The BigEAR framework automates the audio
analysis. We computed the accuracy of BigEAR with respect to the ground truth
obtained from a human rater. Our approach yielded overall average accuracy of
88.76% on real-world data from couples coping with breast cancer.Comment: 6 pages, 10 equations, 1 Table, 5 Figures, IEEE International
Workshop on Big Data Analytics for Smart and Connected Health 2016, June 27,
2016, Washington DC, US
- …