479 research outputs found
Intelligibility prediction with a pretrained noise-robust automatic speech recognition model
This paper describes two intelligibility prediction systems derived from a
pretrained noise-robust automatic speech recognition (ASR) model for the second
Clarity Prediction Challenge (CPC2). One system is intrusive and leverages the
hidden representations of the ASR model. The other system is non-intrusive and
makes predictions with derived ASR uncertainty. The ASR model is only
pretrained with a simulated noisy speech corpus and does not take advantage of
the CPC2 data. For that reason, the intelligibility prediction systems are
robust to unseen scenarios given the accurate prediction performance on the
CPC2 evaluation
Recommended from our members
Machine Recognition of Sounds in Mixtures
An overview of work on recognizing speech in mixtures using missing data techniques and searching across possible segmentations
Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction
Non-intrusive intelligibility prediction is important for its application in
realistic scenarios, where a clean reference signal is difficult to access. The
construction of many non-intrusive predictors require either ground truth
intelligibility labels or clean reference signals for supervised learning. In
this work, we leverage an unsupervised uncertainty estimation method for
predicting speech intelligibility, which does not require intelligibility
labels or reference signals to train the predictor. Our experiments demonstrate
that the uncertainty from state-of-the-art end-to-end automatic speech
recognition (ASR) models is highly correlated with speech intelligibility. The
proposed method is evaluated on two databases and the results show that the
unsupervised uncertainty measures of ASR models are more correlated with speech
intelligibility from listening results than the predictions made by widely used
intrusive methods.Comment: Submitted to INTERSPEECH202
Exploiting Hidden Representations from a DNN-based Speech Recogniser for Speech Intelligibility Prediction in Hearing-impaired Listeners
An accurate objective speech intelligibility prediction algorithms is of
great interest for many applications such as speech enhancement for hearing
aids. Most algorithms measures the signal-to-noise ratios or correlations
between the acoustic features of clean reference signals and degraded signals.
However, these hand-picked acoustic features are usually not explicitly
correlated with recognition. Meanwhile, deep neural network (DNN) based
automatic speech recogniser (ASR) is approaching human performance in some
speech recognition tasks. This work leverages the hidden representations from
DNN-based ASR as features for speech intelligibility prediction in
hearing-impaired listeners. The experiments based on a hearing aid
intelligibility database show that the proposed method could make better
prediction than a widely used short-time objective intelligibility (STOI) based
binaural measure.Comment: Submitted to INTERSPEECH202
On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments
This paper introduces a new method for multi-channel time domain speech
separation in reverberant environments. A fully-convolutional neural network
structure has been used to directly separate speech from multiple microphone
recordings, with no need of conventional spatial feature extraction. To reduce
the influence of reverberation on spatial feature extraction, a dereverberation
pre-processing method has been applied to further improve the separation
performance. A spatialized version of wsj0-2mix dataset has been simulated to
evaluate the proposed system. Both source separation and speech recognition
performance of the separated signals have been evaluated objectively.
Experiments show that the proposed fully-convolutional network improves the
source separation metric and the word error rate (WER) by more than 13% and 50%
relative, respectively, over a reference system with conventional features.
Applying dereverberation as pre-processing to the proposed system can further
reduce the WER by 29% relative using an acoustic model trained on clean and
reverberated data.Comment: Presented at IEEE ICASSP 202
- …