2,164 research outputs found
The listening talker: A review of human and algorithmic context-induced modifications of speech
International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
An evaluation of intrusive instrumental intelligibility metrics
Instrumental intelligibility metrics are commonly used as an alternative to
listening tests. This paper evaluates 12 monaural intrusive intelligibility
metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and
. In addition, this paper investigates the ability of
intelligibility metrics to generalize to new types of distortions and analyzes
why the top performing metrics have high performance. The intelligibility data
were obtained from 11 listening tests described in the literature. The stimuli
included Dutch, Danish, and English speech that was distorted by additive
noise, reverberation, competing talkers, pre-processing enhancement, and
post-processing enhancement. SIIB and HASPI had the highest performance
achieving a correlation with listening test scores on average of
and , respectively. The high performance of SIIB may, in part, be
the result of SIIBs developers having access to all the intelligibility data
considered in the evaluation. The results show that intelligibility metrics
tend to perform poorly on data sets that were not used during their
development. By modifying the original implementations of SIIB and STOI, the
advantage of reducing statistical dependencies between input features is
demonstrated. Additionally, the paper presents a new version of SIIB called
, which has similar performance to SIIB and HASPI,
but takes less time to compute by two orders of magnitude.Comment: Published in IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 201
Effects of noise suppression and envelope dynamic range compression on the intelligibility of vocoded sentences for a tonal language
Vocoder simulation studies have suggested that the carrier signal type employed affects the intelligibility of vocoded speech. The present work further assessed how carrier signal type interacts with additional signal processing, namely, single-channel noise suppression and envelope dynamic range compression, in determining the intelligibility of vocoder simulations. In Experiment 1, Mandarin sentences that had been corrupted by speech spectrum-shaped noise (SSN) or two-talker babble (2TB) were processed by one of four single-channel noise-suppression algorithms before undergoing tone-vocoded (TV) or noise-vocoded (NV) processing. In Experiment 2, dynamic ranges of multiband envelope waveforms were compressed by scaling of the mean-removed envelope waveforms with a compression factor before undergoing TV or NV processing. TV Mandarin sentences yielded higher intelligibility scores with normal-hearing (NH) listeners than did noise-vocoded sentences. The intelligibility advantage of noise-suppressed vocoded speech depended on the masker type (SSN vs 2TB). NV speech was more negatively influenced by envelope dynamic range compression than was TV speech. These findings suggest that an interactional effect exists between the carrier signal type employed in the vocoding process and envelope distortion caused by signal processing
A convolutional neural-network model of human cochlear mechanics and filter tuning for real-time applications
Auditory models are commonly used as feature extractors for automatic
speech-recognition systems or as front-ends for robotics, machine-hearing and
hearing-aid applications. Although auditory models can capture the biophysical
and nonlinear properties of human hearing in great detail, these biophysical
models are computationally expensive and cannot be used in real-time
applications. We present a hybrid approach where convolutional neural networks
are combined with computational neuroscience to yield a real-time end-to-end
model for human cochlear mechanics, including level-dependent filter tuning
(CoNNear). The CoNNear model was trained on acoustic speech material and its
performance and applicability were evaluated using (unseen) sound stimuli
commonly employed in cochlear mechanics research. The CoNNear model accurately
simulates human cochlear frequency selectivity and its dependence on sound
intensity, an essential quality for robust speech intelligibility at negative
speech-to-background-noise ratios. The CoNNear architecture is based on
parallel and differentiable computations and has the power to achieve real-time
human performance. These unique CoNNear features will enable the next
generation of human-like machine-hearing applications
Forward masking threshold estimation using neural networks and its application to parallel speech enhancement
Forward masking models have been used successfully in speech enhancement and audio coding. Presently, forward masking thresholds are estimated using simplified masking models which have been used for audio coding and speech enhancement applications. In this paper, an accurate approximation of forward masking threshold estimation using neural networks is proposed. A performance comparison to the other existing masking models in speech enhancement application is presented. Objective measures using PESQ demonstrates that our proposed forward masking model, provides significant improvements (5-15 %) over four existing models, when tested with speech signals corrupted by various noises at very low signal to noise ratios. Moreover, a parallel implementation of the speech enhancement algorithm was developed using Matlab parallel computing toolbox
Collaborative Deep Learning for Speech Enhancement: A Run-Time Model Selection Method Using Autoencoders
We show that a Modular Neural Network (MNN) can combine various speech
enhancement modules, each of which is a Deep Neural Network (DNN) specialized
on a particular enhancement job. Differently from an ordinary ensemble
technique that averages variations in models, the propose MNN selects the best
module for the unseen test signal to produce a greedy ensemble. We see this as
Collaborative Deep Learning (CDL), because it can reuse various already-trained
DNN models without any further refining. In the proposed MNN selecting the best
module during run time is challenging. To this end, we employ a speech
AutoEncoder (AE) as an arbitrator, whose input and output are trained to be as
similar as possible if its input is clean speech. Therefore, the AE can gauge
the quality of the module-specific denoised result by seeing its AE
reconstruction error, e.g. low error means that the module output is similar to
clean speech. We propose an MNN structure with various modules that are
specialized on dealing with a specific noise type, gender, and input
Signal-to-Noise Ratio (SNR) value, and empirically prove that it almost always
works better than an arbitrarily chosen DNN module and sometimes as good as an
oracle result
- …