7 research outputs found
More Speaking or More Speakers?
Self-training (ST) and self-supervised learning (SSL) methods have
demonstrated strong improvements in automatic speech recognition (ASR). In
spite of these advances, to the best of our knowledge, there is no analysis of
how the composition of the labelled and unlabelled datasets used in these
methods affects the results. In this work we aim to analyse the effect of
numbers of speakers in the training data on a recent SSL algorithm (wav2vec
2.0), and a recent ST algorithm (slimIPL). We perform a systematic analysis on
both labeled and unlabeled data by varying the number of speakers while keeping
the number of hours fixed and vice versa. Our findings suggest that SSL
requires a large amount of unlabeled data to produce high accuracy results,
while ST requires a sufficient number of speakers in the labelled data,
especially in the low-regime setting. In this manner these two approaches
improve supervised learning in different regimes of dataset composition
Continuous Pseudo-Labeling from the Start
Self-training (ST), or pseudo-labeling has sparked significant interest in
the automatic speech recognition (ASR) community recently because of its
success in harnessing unlabeled data. Unlike prior semi-supervised learning
approaches that relied on iteratively regenerating pseudo-labels (PLs) from a
trained model and using them to train a new model, recent state-of-the-art
methods perform `continuous training' where PLs are generated using a very
recent version of the model being trained. Nevertheless, these approaches still
rely on bootstrapping the ST using an initial supervised learning phase where
the model is trained on labeled data alone. We believe this has the potential
for over-fitting to the labeled dataset in low resource settings and that ST
from the start of training should reduce over-fitting. In this paper we show
how we can do this by dynamically controlling the evolution of PLs during the
training process in ASR. To the best of our knowledge, this is the first study
that shows the feasibility of generating PLs from the very start of the
training. We are able to achieve this using two techniques that avoid
instabilities which lead to degenerate models that do not generalize. Firstly,
we control the evolution of PLs through a curriculum that uses the online
changes in PLs to control the membership of the cache of PLs and improve
generalization. Secondly, we find that by sampling transcriptions from the
predictive distribution, rather than only using the best transcription, we can
stabilize training further. With these techniques, our ST models match prior
works without an external language model.Comment: To appear in ICLR 202
GraphCite: Citation Intent Classification in Scientific Publications via Graph Embeddings
International audienceCitations are crucial in scientific works as they help position a new publication. Each citation carries a particular intent, for example, to highlight the importance of a problem or to compare against results provided by another method. The authors' intent when making a new citation has been studied to understand the evolution of a field over time or to make recommendations for further citations. In this work, we address the task of citation intent prediction from a new perspective. In addition to textual clues present in the citation phrase, we also consider the citation graph, leveraging high-level information of citation patterns. In this novel setting, we perform a thorough experimental evaluation of graph-based models for intent prediction. We show that our model, GraphCite, improves significantly upon models that take into consideration only the citation phrase. Our code is available online
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Pre-training speech models on large volumes of data has achieved remarkable
success. OpenAI Whisper is a multilingual multitask model trained on 680k hours
of supervised speech data. It generalizes well to various speech recognition
and translation benchmarks even in a zero-shot setup. However, the full
pipeline for developing such models (from data collection to training) is not
publicly accessible, which makes it difficult for researchers to further
improve its performance and address training-related issues such as efficiency,
robustness, fairness, and bias. This work presents an Open Whisper-style Speech
Model (OWSM), which reproduces Whisper-style training using an open-source
toolkit and publicly available data. OWSM even supports more translation
directions and can be more efficient to train. We will publicly release all
scripts used for data preparation, training, inference, and scoring as well as
pre-trained models and training logs to promote open science.Comment: Accepted at ASRU 202
GraphCite: Citation Intent Classification in Scientific Publications via Graph Embeddings
International audienceCitations are crucial in scientific works as they help position a new publication. Each citation carries a particular intent, for example, to highlight the importance of a problem or to compare against results provided by another method. The authors' intent when making a new citation has been studied to understand the evolution of a field over time or to make recommendations for further citations. In this work, we address the task of citation intent prediction from a new perspective. In addition to textual clues present in the citation phrase, we also consider the citation graph, leveraging high-level information of citation patterns. In this novel setting, we perform a thorough experimental evaluation of graph-based models for intent prediction. We show that our model, GraphCite, improves significantly upon models that take into consideration only the citation phrase. Our code is available online
ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by
the broadening interests of the spoken language translation community.
ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2)
simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech
translation (S2ST) -- each task is supported with a wide variety of approaches,
differentiating ESPnet-ST-v2 from other open source spoken language translation
toolkits. This toolkit offers state-of-the-art architectures such as
transducers, hybrid CTC/attention, multi-decoders with searchable
intermediates, time-synchronous blockwise CTC/attention, Translatotron models,
and direct discrete unit models. In this paper, we describe the overall design,
example models for each task, and performance benchmarking behind ESPnet-ST-v2,
which is publicly available at https://github.com/espnet/espnet.Comment: There will be some major updates to the paper. Thus, withdraw