48,589 research outputs found
Adapting End-to-End Speech Recognition for Readable Subtitles
Automatic speech recognition (ASR) systems are primarily evaluated on
transcription accuracy. However, in some use cases such as subtitling, verbatim
transcription would reduce output readability given limited screen size and
reading time. Therefore, this work focuses on ASR with output compression, a
task challenging for supervised approaches due to the scarcity of training
data. We first investigate a cascaded system, where an unsupervised compression
model is used to post-edit the transcribed speech. We then compare several
methods of end-to-end speech recognition under output length constraints. The
experiments show that with limited data far less than needed for training a
model from scratch, we can adapt a Transformer-based ASR model to incorporate
both transcription and compression capabilities. Furthermore, the best
performance in terms of WER and ROUGE scores is achieved by explicitly modeling
the length constraints within the end-to-end ASR system.Comment: IWSLT 202
Speech Development by Imitation
The Double Cone Model (DCM) is a model
of how the brain transforms sensory input to
motor commands through successive stages of
data compression and expansion. We have
tested a subset of the DCM on speech recognition, production and imitation. The experiments show that the DCM is a good candidate
for an artificial speech processing system that
can develop autonomously. We show that the
DCM can learn a repertoire of speech sounds
by listening to speech input. It is also able to
link the individual elements of speech to sequences that can be recognized or reproduced,
thus allowing the system to imitate spoken
language
Recommended from our members
Speech recognition model compression
Speech recognition models are widely deployed in mobile and embedded devices. However, the base architecture with which these models are developed is usually made of neural networks with bigger size and millions of model parameters. In this report, we investigate three compression schemes for these neural network architecture with a trade-off on accuracy and compressed model size. Also, we perform sensitivity analysis on the network parameters with known perturbations to determine the best compression scheme for a particular layer. The first compression scheme deployed is k-means clustering. This helps in generating clusters which are used for weight sharing and hence reduction in the total number of parameters required. Secondly, we employ svd based compression on various network layer parameters and achieve the best compression using svd in the case of a large vocabulary continuous speech recognition model. Finally, a two-stage compression scheme using k-means and Huffman coding is proposed. We have investigated these compression schemes on keyword spotter speech recognition system and the Baidu’s DeepSpeech large vocabulary continuous speech recognition model and have shown 58.3% reduction in size for only a 3.4% drop in accuracy and 45% reduction in size for only a 1.21% drop in accuracy respectively.Electrical and Computer Engineerin
Learning to detect dysarthria from raw speech
Speech classifiers of paralinguistic traits traditionally learn from diverse
hand-crafted low-level features, by selecting the relevant information for the
task at hand. We explore an alternative to this selection, by learning jointly
the classifier, and the feature extraction. Recent work on speech recognition
has shown improved performance over speech features by learning from the
waveform. We extend this approach to paralinguistic classification and propose
a neural network that can learn a filterbank, a normalization factor and a
compression power from the raw speech, jointly with the rest of the
architecture. We apply this model to dysarthria detection from sentence-level
audio recordings. Starting from a strong attention-based baseline on which
mel-filterbanks outperform standard low-level descriptors, we show that
learning the filters or the normalization and compression improves over fixed
features by 10% absolute accuracy. We also observe a gain over OpenSmile
features by learning jointly the feature extraction, the normalization, and the
compression factor with the architecture. This constitutes a first attempt at
learning jointly all these operations from raw audio for a speech
classification task.Comment: 5 pages, 3 figures, submitted to ICASS
On the Compression of Recurrent Neural Networks with an Application to LVCSR acoustic modeling for Embedded Speech Recognition
We study the problem of compressing recurrent neural networks (RNNs). In
particular, we focus on the compression of RNN acoustic models, which are
motivated by the goal of building compact and accurate speech recognition
systems which can be run efficiently on mobile devices. In this work, we
present a technique for general recurrent model compression that jointly
compresses both recurrent and non-recurrent inter-layer weight matrices. We
find that the proposed technique allows us to reduce the size of our Long
Short-Term Memory (LSTM) acoustic model to a third of its original size with
negligible loss in accuracy.Comment: Accepted in ICASSP 201
On the Impact of Quantization and Pruning of Self-Supervised Speech Models for Downstream Speech Recognition Tasks "In-the-Wild''
Recent advances with self-supervised learning have allowed speech recognition
systems to achieve state-of-the-art (SOTA) word error rates (WER) while
requiring only a fraction of the labeled training data needed by its
predecessors. Notwithstanding, while such models achieve SOTA performance in
matched train/test conditions, their performance degrades substantially when
tested in unseen conditions. To overcome this problem, strategies such as data
augmentation and/or domain shift training have been explored. Available models,
however, are still too large to be considered for edge speech applications on
resource-constrained devices, thus model compression tools are needed. In this
paper, we explore the effects that train/test mismatch conditions have on
speech recognition accuracy based on compressed self-supervised speech models.
In particular, we report on the effects that parameter quantization and model
pruning have on speech recognition accuracy based on the so-called robust
wav2vec 2.0 model under noisy, reverberant, and noise-plus-reverberation
conditions
CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders
Large-scale self-supervised pre-trained speech encoders outperform
conventional approaches in speech recognition and translation tasks. Due to the
high cost of developing these large models, building new encoders for new tasks
and deploying them to on-device applications are infeasible. Prior studies
propose model compression methods to address this issue, but those works focus
on smaller models and less realistic tasks. Thus, we propose Contrastive
Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to
compress pre-trained speech encoders by leveraging masked prediction and
contrastive learning to train student models to copy the behavior of a large
teacher model. CoLLD outperforms prior methods and closes the gap between small
and large models on multilingual speech-to-text translation and recognition
benchmarks.Comment: Submitted to ICASSP 202
- …