138 research outputs found
Some voices are too common: Building fair speech recognition systems using the Common Voice dataset
Automatic speech recognition (ASR) systems become increasingly efficient
thanks to new advances in neural network training like self-supervised
learning. However, they are known to be unfair toward certain groups, for
instance, people speaking with an accent. In this work, we use the French
Common Voice dataset to quantify the biases of a pre-trained wav2vec~2.0 model
toward several demographic groups. By fine-tuning the pre-trained model on a
variety of fixed-size, carefully crafted training sets, we demonstrate the
importance of speaker diversity. We also run an in-depth analysis of the Common
Voice corpus and identify important shortcomings that should be taken into
account by users of this dataset.Comment: 5 pages, 3 figures. Accepted to Interspeech 202
Improving Accented Speech Recognition with Multi-Domain Training
Thanks to the rise of self-supervised learning, automatic speech recognition
(ASR) systems now achieve near-human performance on a wide variety of datasets.
However, they still lack generalization capability and are not robust to domain
shifts like accent variations. In this work, we use speech audio representing
four different French accents to create fine-tuning datasets that improve the
robustness of pre-trained ASR models. By incorporating various accents in the
training set, we obtain both in-domain and out-of-domain improvements. Our
numerical experiments show that we can reduce error rates by up to 25%
(relative) on African and Belgian accents compared to single-domain training
while keeping a good performance on standard French.Comment: 5 pages, 2 figures. Accepted to ICASSP 202
Archimorphosis
International audienc
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation
Boosted by the simultaneous translation shared task at IWSLT 2020, promising
end-to-end online speech translation approaches were recently proposed. They
consist in incrementally encoding a speech input (in a source language) and
decoding the corresponding text (in a target language) with the best possible
trade-off between latency and translation quality. This paper investigates two
key aspects of end-to-end simultaneous speech translation: (a) how to encode
efficiently the continuous speech flow, and (b) how to segment the speech flow
in order to alternate optimally between reading (R: encoding input) and writing
(W: decoding output) operations. We extend our previously proposed end-to-end
online decoding strategy and show that while replacing BLSTM by ULSTM encoding
degrades performance in offline mode, it actually improves both efficiency and
performance in online mode. We also measure the impact of different methods to
segment the speech signal (using fixed interval boundaries, oracle word
boundaries or randomly set boundaries) and show that our best end-to-end online
decoding strategy is surprisingly the one that alternates R/W operations on
fixed size blocks on our English-German speech translation setup.Comment: Accepted for presentation at Interspeech 202
ASR error management for improving spoken language understanding
This paper addresses the problem of automatic speech recognition (ASR) error
detection and their use for improving spoken language understanding (SLU)
systems. In this study, the SLU task consists in automatically extracting, from
ASR transcriptions , semantic concepts and concept/values pairs in a e.g
touristic information system. An approach is proposed for enriching the set of
semantic labels with error specific labels and by using a recently proposed
neural approach based on word embeddings to compute well calibrated ASR
confidence measures. Experimental results are reported showing that it is
possible to decrease significantly the Concept/Value Error Rate with a state of
the art system, outperforming previously published results performance on the
same experimental data. It also shown that combining an SLU approach based on
conditional random fields with a neural encoder/decoder attention based
architecture , it is possible to effectively identifying confidence islands and
uncertain semantic output segments useful for deciding appropriate error
handling actions by the dialogue manager strategy .Comment: Interspeech 2017, Aug 2017, Stockholm, Sweden. 201
Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech
International audienceThis work investigates speaker adaptation and transfer learning for spoken language understanding (SLU). We focus on the direct extraction of semantic tags from the audio signal using an end-to-end neural network approach. We demonstrate that the learning performance of the target predictive function for the semantic slot filling task can be substantially improved by speaker adaptation and by various knowledge transfer approaches. First, we explore speaker adaptive training (SAT) for end-to-end SLU models and propose to use zero pseudo i-vectors for more efficient model initialization and pretraining in SAT. Second, in order to improve the learning convergence for the target semantic slot filling (SF) task, models trained for different tasks, such as automatic speech recognition and named entity extraction are used to initialize neural end-to-end models trained for the target task. In addition, we explore the impact of the knowledge transfer for SLU from a speech recognition task trained in a different language. These approaches allow to develop end-to-end SLU systems in low-resource data scenarios when there is no enough in-domain semantically labeled data, but other resources, such as word transcriptions for the same or another language or named entity annotation, are available
An Empirical Study of End-to-end Simultaneous Speech Translation Decoding Strategies
This paper proposes a decoding strategy for end-to-end simultaneous speech
translation. We leverage end-to-end models trained in offline mode and conduct
an empirical study for two language pairs (English-to-German and
English-to-Portuguese). We also investigate different output token
granularities including characters and Byte Pair Encoding (BPE) units. The
results show that the proposed decoding approach allows to control BLEU/Average
Lagging trade-off along different latency regimes. Our best decoding settings
achieve comparable results with a strong cascade model evaluated on the
simultaneous translation track of IWSLT 2020 shared task.Comment: This paper has been accepted for presentation at IEEE ICASSP 202
- …