Search CORE

48,589 research outputs found

Adapting End-to-End Speech Recognition for Readable Subtitles

Author: Liu Danni
Niehues Jan
Spanakis Gerasimos
Publication venue
Publication date: 01/01/2020
Field of study

Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. Therefore, this work focuses on ASR with output compression, a task challenging for supervised approaches due to the scarcity of training data. We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech. We then compare several methods of end-to-end speech recognition under output length constraints. The experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities. Furthermore, the best performance in terms of WER and ROUGE scores is achieved by explicitly modeling the length constraints within the end-to-end ASR system.Comment: IWSLT 202

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

Speech Development by Imitation

Author: Balkenius Christian
Breidegard Bjorn
Publication venue: Lund University Cognitive Studies
Publication date: 01/01/2003
Field of study

The Double Cone Model (DCM) is a model of how the brain transforms sensory input to motor commands through successive stages of data compression and expansion. We have tested a subset of the DCM on speech recognition, production and imitation. The experiments show that the DCM is a good candidate for an artificial speech processing system that can develop autonomously. We show that the DCM can learn a repertoire of speech sounds by listening to speech input. It is also able to link the individual elements of speech to sequences that can be recognized or reproduced, thus allowing the system to imitate spoken language

CiteSeerX

CogPrints Cognitive Sciences Eprint Archive

Recommended from our members

Speech recognition model compression

Author: Sakthi Madhumitha
Publication venue
Publication date: 11/11/2019
Field of study

Speech recognition models are widely deployed in mobile and embedded devices. However, the base architecture with which these models are developed is usually made of neural networks with bigger size and millions of model parameters. In this report, we investigate three compression schemes for these neural network architecture with a trade-off on accuracy and compressed model size. Also, we perform sensitivity analysis on the network parameters with known perturbations to determine the best compression scheme for a particular layer. The first compression scheme deployed is k-means clustering. This helps in generating clusters which are used for weight sharing and hence reduction in the total number of parameters required. Secondly, we employ svd based compression on various network layer parameters and achieve the best compression using svd in the case of a large vocabulary continuous speech recognition model. Finally, a two-stage compression scheme using k-means and Huffman coding is proposed. We have investigated these compression schemes on keyword spotter speech recognition system and the Baidu’s DeepSpeech large vocabulary continuous speech recognition model and have shown 58.3% reduction in size for only a 3.4% drop in accuracy and 45% reduction in size for only a 1.21% drop in accuracy respectively.Electrical and Computer Engineerin

Texas ScholarWorks

Learning to detect dysarthria from raw speech

Author: Millet Juliette
Zeghidour Neil
Publication venue
Publication date: 08/01/2019
Field of study

Speech classifiers of paralinguistic traits traditionally learn from diverse hand-crafted low-level features, by selecting the relevant information for the task at hand. We explore an alternative to this selection, by learning jointly the classifier, and the feature extraction. Recent work on speech recognition has shown improved performance over speech features by learning from the waveform. We extend this approach to paralinguistic classification and propose a neural network that can learn a filterbank, a normalization factor and a compression power from the raw speech, jointly with the rest of the architecture. We apply this model to dysarthria detection from sentence-level audio recordings. Starting from a strong attention-based baseline on which mel-filterbanks outperform standard low-level descriptors, we show that learning the filters or the normalization and compression improves over fixed features by 10% absolute accuracy. We also observe a gain over OpenSmile features by learning jointly the feature extraction, the normalization, and the compression factor with the architecture. This constitutes a first attempt at learning jointly all these operations from raw audio for a speech classification task.Comment: 5 pages, 3 figures, submitted to ICASS

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

On the Compression of Recurrent Neural Networks with an Application to LVCSR acoustic modeling for Embedded Speech Recognition

Author: Alsharif Ouais
Bruguier Antoine
McGraw Ian
Prabhavalkar Rohit
Publication venue
Publication date: 02/05/2016
Field of study

We study the problem of compressing recurrent neural networks (RNNs). In particular, we focus on the compression of RNN acoustic models, which are motivated by the goal of building compact and accurate speech recognition systems which can be run efficiently on mobile devices. In this work, we present a technique for general recurrent model compression that jointly compresses both recurrent and non-recurrent inter-layer weight matrices. We find that the proposed technique allows us to reduce the size of our Long Short-Term Memory (LSTM) acoustic model to a third of its original size with negligible loss in accuracy.Comment: Accepted in ICASSP 201

arXiv.org e-Print Archive

Crossref

On the Impact of Quantization and Pruning of Self-Supervised Speech Models for Downstream Speech Recognition Tasks "In-the-Wild''

Author: Avila Anderson R.
Falk Tiago H.
Guimarães Heitor
Pimentel Arthur
Rezagholizadeh Mehdi
Publication venue
Publication date: 25/09/2023
Field of study

Recent advances with self-supervised learning have allowed speech recognition systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled training data needed by its predecessors. Notwithstanding, while such models achieve SOTA performance in matched train/test conditions, their performance degrades substantially when tested in unseen conditions. To overcome this problem, strategies such as data augmentation and/or domain shift training have been explored. Available models, however, are still too large to be considered for edge speech applications on resource-constrained devices, thus model compression tools are needed. In this paper, we explore the effects that train/test mismatch conditions have on speech recognition accuracy based on compressed self-supervised speech models. In particular, we report on the effects that parameter quantization and model pruning have on speech recognition accuracy based on the so-called robust wav2vec 2.0 model under noisy, reverberant, and noise-plus-reverberation conditions

arXiv.org e-Print Archive

CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders

Author: Chang Heng-Jui
Chung Yu-An
Dong Ning
Mavlyutov Ruslan
Popuri Sravya
Publication venue
Publication date: 14/09/2023
Field of study

Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks. Due to the high cost of developing these large models, building new encoders for new tasks and deploying them to on-device applications are infeasible. Prior studies propose model compression methods to address this issue, but those works focus on smaller models and less realistic tasks. Thus, we propose Contrastive Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to compress pre-trained speech encoders by leveraging masked prediction and contrastive learning to train student models to copy the behavior of a large teacher model. CoLLD outperforms prior methods and closes the gap between small and large models on multilingual speech-to-text translation and recognition benchmarks.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive