20,941 research outputs found
Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems
Speaker adaptation techniques provide a powerful solution to customise
automatic speech recognition (ASR) systems for individual users. Practical
application of unsupervised model-based speaker adaptation techniques to data
intensive end-to-end ASR systems is hindered by the scarcity of speaker-level
data and performance sensitivity to transcription errors. To address these
issues, a set of compact and data efficient speaker-dependent (SD) parameter
representations are used to facilitate both speaker adaptive training and
test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR
systems. The sensitivity to supervision quality is reduced using a confidence
score-based selection of the less erroneous subset of speaker-level adaptation
data. Two lightweight confidence score estimation modules are proposed to
produce more reliable confidence scores. The data sparsity issue, which is
exacerbated by data selection, is addressed by modelling the SD parameter
uncertainty using Bayesian learning. Experiments on the benchmark 300-hour
Switchboard and the 233-hour AMI datasets suggest that the proposed confidence
score-based adaptation schemes consistently outperformed the baseline
speaker-independent (SI) Conformer model and conventional non-Bayesian, point
estimate-based adaptation using no speaker data selection. Similar consistent
performance improvements were retained after external Transformer and LSTM
language model rescoring. In particular, on the 300-hour Switchboard corpus,
statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute
(9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer
on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER
reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also
obtained on the AMI development and evaluation sets.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin
DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation
In previous works, only parameter weights of ASR models are optimized under
fixed-topology architecture. However, the design of successful model
architecture has always relied on human experience and intuition. Besides, many
hyperparameters related to model architecture need to be manually tuned.
Therefore in this paper, we propose an ASR approach with efficient
gradient-based architecture search, DARTS-ASR. In order to examine the
generalizability of DARTS-ASR, we apply our approach not only on many languages
to perform monolingual ASR, but also on a multilingual ASR setting. Following
previous works, we conducted experiments on a multilingual dataset, IARPA
BABEL. The experiment results show that our approach outperformed the baseline
fixed-topology architecture by 10.2% and 10.0% relative reduction on character
error rates under monolingual and multilingual ASR settings respectively.
Furthermore, we perform some analysis on the searched architectures by
DARTS-ASR.Comment: Accepted at INTERSPEECH 202
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Encoder-decoder models provide a generic architecture for
sequence-to-sequence tasks such as speech recognition and translation. While
offline systems are often evaluated on quality metrics like word error rates
(WER) and BLEU, latency is also a crucial factor in many practical use-cases.
We propose three latency reduction techniques for chunk-based incremental
inference and evaluate their efficiency in terms of accuracy-latency trade-off.
On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by
sacrificing 1% WER (6% rel.) compared to offline transcription. Although our
experiments use the Transformer, the hypothesis selection strategies are
applicable to other encoder-decoder models. To avoid expensive re-computation,
we use a unidirectionally-attending encoder. After an adaptation procedure to
partial sequences, the unidirectional model performs on-par with the original
model. We further show that our approach is also applicable to low-latency
speech translation. On How2 English-Portuguese speech translation, we reduce
latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5%
rel.) compared to the offline system
Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework
Speech recognition systems for irregularly-spelled languages like English
normally require hand-written pronunciations. In this paper, we describe a
system for automatically obtaining pronunciations of words for which
pronunciations are not available, but for which transcribed data exists. Our
method integrates information from the letter sequence and from the acoustic
evidence. The novel aspect of the problem that we address is the problem of how
to prune entries from such a lexicon (since, empirically, lexicons with too
many entries do not tend to be good for ASR performance). Experiments on
various ASR tasks show that, with the proposed framework, starting with an
initial lexicon of several thousand words, we are able to learn a lexicon which
performs close to a full expert lexicon in terms of WER performance on test
data, and is better than lexicons built using G2P alone or with a pruning
criterion based on pronunciation probability
VITALAS at TRECVID-2008
In this paper, we present our experiments in TRECVID 2008 about High-Level feature extraction task. This is the first year for our participation in TRECVID, our system adopts some popular approaches that other workgroups proposed before. We proposed 2 advanced low-level features NEW Gabor texture descriptor and the Compact-SIFT Codeword histogram. Our system applied well-known LIBSVM to train the SVM classifier for the basic classifier. In fusion step, some methods were employed such as the Voting, SVM-base, HCRF and Bootstrap Average AdaBoost(BAAB)
Leveraging run-time feedback for efficient ASR acceleration
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.In this work, we propose Locality-AWare-Scheme (LAWS) for an Automatic Speech Recognition (ASR) accelerator in order to significantly reduce its energy consumption and memory requirements, by leveraging the locality among consecutive segments of the speech signal. LAWS diminishes ASR's workload by up to 60% by removing most of the off-chip accesses during the ASR's decoding process. We furthermore improve LAWS's effectiveness by selectively adapting the amount of ASR's workload, based on run-time feedback. In particular, we exploit the fact that the confidence of the ASR system varies along the recognition process. When confidence is high, the ASR system can be more restrictive and reduce the amount of work. The end design provides a saving of 87% in memory requests, 2.3x reduction in energy consumption, and a speedup of 2.1x with respect to the state-of-the-art ASR accelerator.Peer ReviewedPostprint (author's final draft
A Comparison between Deep Neural Nets and Kernel Acoustic Models for Speech Recognition
We study large-scale kernel methods for acoustic modeling and compare to DNNs
on performance metrics related to both acoustic modeling and recognition.
Measuring perplexity and frame-level classification accuracy, kernel-based
acoustic models are as effective as their DNN counterparts. However, on
token-error-rates DNN models can be significantly better. We have discovered
that this might be attributed to DNN's unique strength in reducing both the
perplexity and the entropy of the predicted posterior probabilities. Motivated
by our findings, we propose a new technique, entropy regularized perplexity,
for model selection. This technique can noticeably improve the recognition
performance of both types of models, and reduces the gap between them. While
effective on Broadcast News, this technique could be also applicable to other
tasks.Comment: arXiv admin note: text overlap with arXiv:1411.400
- …