17 research outputs found
Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences
Recent progress in Automatic Speech Recognition (ASR) has been coupled with a
substantial increase in the model sizes, which may now contain billions of
parameters, leading to slow inferences even with adapted hardware. In this
context, several ASR models exist in various sizes, with different inference
costs leading to different performance levels. Based on the observation that
smaller models perform optimally on large parts of testing corpora, we propose
to train a decision module, that would allow, given an audio sample, to use the
smallest sufficient model leading to a good transcription. We apply our
approach to two Whisper models with different sizes. By keeping the decision
process computationally efficient, we build a decision module that allows
substantial computational savings with reduced performance drops.Comment: Submitted to ICASSP 202
Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning
International audienceWe introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations [1, 2, 3], this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-byexample task on the LibriSpeech dataset to monitor future improvements in the field
Evaluating the reliability of acoustic speech embeddings
International audienceSpeech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to un-supervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimize the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsu-pervised, and using different loss functions (autoencoders, cor-respondance autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it un-realistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embed-dings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings
DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon
International audienceFinding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-theart in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.
The Zero Resource Speech Challenge 2019: TTS without T
International audienceWe present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or pho-netic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances from novel speakers, similar to the target speaker's voice. We describe the metrics used for evaluation , a baseline system consisting of unsupervised subword unit discovery plus a standard TTS system, and a topline TTS using gold phoneme transcriptions. We present an overview of the 19 submitted systems from 10 teams and discuss the main results
The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units
International audienceWe present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning
STOP: A dataset for Spoken Task Oriented Semantic Parsing
End-to-end spoken language understanding (SLU) predicts intent directly from
audio using a single model. It promises to improve the performance of assistant
systems by leveraging acoustic information lost in the intermediate textual
representation and preventing cascading errors from Automatic Speech
Recognition (ASR). Further, having one unified model has efficiency advantages
when deploying assistant systems on-device. However, the limited number of
public audio datasets with semantic parse labels hinders the research progress
in this area. In this paper, we release the Spoken Task-Oriented semantic
Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly
available. Additionally, we define low-resource splits to establish a benchmark
for improving SLU when limited labeled data is available. Furthermore, in
addition to the human-recorded audio, we are releasing a TTS-generated version
to benchmark the performance for low-resource domain adaptation of end-to-end
SLU systems. Initial experimentation show end-to-end SLU models performing
slightly worse than their cascaded counterparts, which we hope encourages
future work in this direction
Memoria, remoción, olvido del estalinismo en la Rusia postsoviética
El estalinismo - aquĂ entendido como el sistema polĂtico y de gobierno instaurado por Stalin en la UniĂłn SoviĂ©tica y, cronolĂłgicamente, como el perĂodo de la historia soviĂ©tica durante el cual Stalin ejerciĂł un poder casi absoluto- ha representado una experiencia profundamente traumática para la sociedad rusa y soviĂ©tica en general, asĂ como para los paĂses de Europa Central y Oriental incluidos despuĂ©s de 1945 en la esfera de influencia soviĂ©tica. Para Rusia el estalinismo significĂł, por un lado, una transformaciĂłn radical y violenta de la sociedad, por el otro, un terrorismo de Estado y un conjunto de represiones polĂticas que provocaron millones de vĂctimas. El recuerdo de este pasado permanece como un problema no resuelto de la memoria rusa. La sociedad rusa postsoviĂ©tica continĂşa, de hecho, profundamente dividida a propĂłsito de este “pasado que no pasa”, con el que no ha podido realmente lidiar: la divisiĂłn tiene que ver más con la interpretaciĂłn que con los hechos, y quĂ© sentido se le atribuye a ese pasado. Prevalece una actitud ambivalente, tanto en el gobierno cĂłmo en la mayorĂa de la poblaciĂłn. En estas páginas analizamos el recorrido de la memoria del estalinismo en Rusia desde la Perestroika hasta hoy, y los fenĂłmenos de remociĂłn, olvido y silencio que lo acompañaron.“Memoria, rimozione, oblio dello stalinismo nella Russia postsovietica” fue previamente publicado
en italiano en el Giornale di Storia Contemporanea , año XIX, número 1, 2016, pp. 7-32.Facultad de Humanidades y Ciencias de la Educació
XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words
International audienceDue to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent selfsupervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning, we finetune an XLS-R model to predict word boundaries themselves produced by top-tier speech segmentation systems: DPDP, VG-HuBERT, GradSeg and DP-Parse. Once XLS-R is finetuned, it is used to infer new word boundary labels that are used in turn for another finetuning step. Our method consistently improves the performance of each system and sets a new state-of-the-art that is, on average 130% higher than the previous one as measured by the F1 score on correctly discovered word tokens on five corpora featuring different languages. Finally, our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion 1
Fine-tuning strategies for faster inference using speech self-supervised models: a comparative study
International audienceSelf-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger selfsupervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0.81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size