4 research outputs found
Acoustics-guided evaluation (AGE): a new measure for estimating performance of speech enhancement algorithms for robust ASR
One challenging problem of robust automatic speech recognition (ASR) is how
to measure the goodness of a speech enhancement algorithm (SEA) without
calculating the word error rate (WER) due to the high costs of manual
transcriptions, language modeling and decoding process. Traditional measures
like PESQ and STOI for evaluating the speech quality and intelligibility were
verified to have relatively low correlations with WER. In this study, a novel
acoustics-guided evaluation (AGE) measure is proposed for estimating
performance of SEAs for robust ASR. AGE consists of three consecutive steps,
namely the low-level representations via the feature extraction, high-level
representations via the nonlinear mapping with the acoustic model (AM), and the
final AGE calculation between the representations of clean speech and degraded
speech. Specifically, state posterior probabilities from neural network based
AM are adopted for the high-level representations and the cross-entropy
criterion is used to calculate AGE. Experiments demonstrate AGE could yield
consistently highest correlations with WER and give the most accurate
estimation of ASR performance compared with PESQ, STOI, and acoustic confidence
measure using Entropy. Potentially, AGE could be adopted to guide the parameter
optimization of deep learning based SEAs to further improve the recognition
performance.Comment: Submitted to ICASSP 201
MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement
The discrepancy between the cost function used for training a speech
enhancement model and human auditory perception usually makes the quality of
enhanced speech unsatisfactory. Objective evaluation metrics which consider
human perception can hence serve as a bridge to reduce the gap. Our previously
proposed MetricGAN was designed to optimize objective metrics by connecting the
metric with a discriminator. Because only the scores of the target evaluation
functions are needed during training, the metrics can even be
non-differentiable. In this study, we propose a MetricGAN+ in which three
training techniques incorporating domain-knowledge of speech processing are
proposed. With these techniques, experimental results on the VoiceBank-DEMAND
dataset show that MetricGAN+ can increase PESQ score by 0.3 compared to the
previous MetricGAN and achieve state-of-the-art results (PESQ score = 3.15).Comment: Accepted by Interspeech 202
Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms
We propose the Fr\'echet Audio Distance (FAD), a novel, reference-free
evaluation metric for music enhancement algorithms. We demonstrate how typical
evaluation metrics for speech enhancement and blind source separation can fail
to accurately measure the perceived effect of a wide variety of distortions. As
an alternative, we propose adapting the Fr\'echet Inception Distance (FID)
metric used to evaluate generative image models to the audio domain. FAD is
validated using a wide variety of artificial distortions and is compared to the
signal based metrics signal to distortion ratio (SDR), cosine distance and
magnitude L2 distance. We show that, with a correlation coefficient of 0.52,
FAD correlates more closely with human perception than either SDR, cosine
distance or magnitude L2 distance, with correlation coefficients of 0.39, -0.15
and -0.01 respectively
Improving the Intelligibility of Electric and Acoustic Stimulation Speech Using Fully Convolutional Networks Based Speech Enhancement
The combined electric and acoustic stimulation (EAS) has demonstrated better
speech recognition than conventional cochlear implant (CI) and yielded
satisfactory performance under quiet conditions. However, when noise signals
are involved, both the electric signal and the acoustic signal may be
distorted, thereby resulting in poor recognition performance. To suppress noise
effects, speech enhancement (SE) is a necessary unit in EAS devices. Recently,
a time-domain speech enhancement algorithm based on the fully convolutional
neural networks (FCN) with a short-time objective intelligibility (STOI)-based
objective function (termed FCN(S) in short) has received increasing attention
due to its simple structure and effectiveness of restoring clean speech signals
from noisy counterparts. With evidence showing the benefits of FCN(S) for
normal speech, this study sets out to assess its ability to improve the
intelligibility of EAS simulated speech. Objective evaluations and listening
tests were conducted to examine the performance of FCN(S) in improving the
speech intelligibility of normal and vocoded speech in noisy environments. The
experimental results show that, compared with the traditional minimum-mean
square-error SE method and the deep denoising autoencoder SE method, FCN(S) can
obtain better gain in the speech intelligibility for normal as well as vocoded
speech. This study, being the first to evaluate deep learning SE approaches for
EAS, confirms that FCN(S) is an effective SE approach that may potentially be
integrated into an EAS processor to benefit users in noisy environments