64,772 research outputs found
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR
Automatic speech recognition (ASR) has gained a remarkable success thanks to
recent advances of deep learning, but it usually degrades significantly under
real-world noisy conditions. Recent works introduce speech enhancement (SE) as
front-end to improve speech quality, which is proved effective but may not be
optimal for downstream ASR due to speech distortion problem. Based on that,
latest works combine SE and currently popular self-supervised learning (SSL) to
alleviate distortion and improve noise robustness. Despite the effectiveness,
the speech distortion caused by conventional SE still cannot be completely
eliminated. In this paper, we propose a self-supervised framework named
Wav2code to implement a generalized SE without distortions for noise-robust
ASR. First, in pre-training stage the clean speech representations from SSL
model are sent to lookup a discrete codebook via nearest-neighbor feature
matching, the resulted code sequence are then exploited to reconstruct the
original clean representations, in order to store them in codebook as prior.
Second, during finetuning we propose a Transformer-based code predictor to
accurately predict clean codes by modeling the global dependency of input noisy
representations, which enables discovery and restoration of high-quality clean
representations without distortions. Furthermore, we propose an interactive
feature fusion network to combine original noisy and the restored clean
representations to consider both fidelity and quality, resulting in even more
informative features for downstream ASR. Finally, experiments on both synthetic
and real noisy datasets demonstrate that Wav2code can solve the speech
distortion and improve ASR performance under various noisy conditions,
resulting in stronger robustness.Comment: 12 pages, 7 figures, Submitted to IEEE/ACM TASL
Adapting self-supervised models to multi-talker speech recognition using speaker embeddings
Self-supervised learning (SSL) methods which learn representations of data
without explicit supervision have gained popularity in speech-processing tasks,
particularly for single-talker applications. However, these models often have
degraded performance for multi-talker scenarios -- possibly due to the domain
mismatch -- which severely limits their use for such applications. In this
paper, we investigate the adaptation of upstream SSL models to the multi-talker
automatic speech recognition (ASR) task under two conditions. First, when
segmented utterances are given, we show that adding a target speaker extraction
(TSE) module based on enrollment embeddings is complementary to mixture-aware
pre-training. Second, for unsegmented mixtures, we propose a novel joint
speaker modeling (JSM) approach, which aggregates information from all speakers
in the mixture through their embeddings. With controlled experiments on
Libri2Mix, we show that using speaker embeddings provides relative WER
improvements of 9.1% and 42.1% over strong baselines for the segmented and
unsegmented cases, respectively. We also demonstrate the effectiveness of our
models for real conversational mixtures through experiments on the AMI dataset.Comment: submitted to ICASSP 202
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.Comment: Under submission as a conference pape
DNN adaptation by automatic quality estimation of ASR hypotheses
In this paper we propose to exploit the automatic Quality Estimation (QE) of
ASR hypotheses to perform the unsupervised adaptation of a deep neural network
modeling acoustic probabilities. Our hypothesis is that significant
improvements can be achieved by: i)automatically transcribing the evaluation
data we are currently trying to recognise, and ii) selecting from it a subset
of "good quality" instances based on the word error rate (WER) scores predicted
by a QE component. To validate this hypothesis, we run several experiments on
the evaluation data sets released for the CHiME-3 challenge. First, we operate
in oracle conditions in which manual transcriptions of the evaluation data are
available, thus allowing us to compute the "true" sentence WER. In this
scenario, we perform the adaptation with variable amounts of data, which are
characterised by different levels of quality. Then, we move to realistic
conditions in which the manual transcriptions of the evaluation data are not
available. In this case, the adaptation is performed on data selected according
to the WER scores "predicted" by a QE component. Our results indicate that: i)
QE predictions allow us to closely approximate the adaptation results obtained
in oracle conditions, and ii) the overall ASR performance based on the proposed
QE-driven adaptation method is significantly better than the strong, most
recent, CHiME-3 baseline.Comment: Computer Speech & Language December 201
- …