36 research outputs found
Scaling Speech Technology to 1,000+ Languages
Expanding the language coverage of speech technology has the potential to
improve access to information for many more people. However, current speech
technology is restricted to about one hundred languages which is a small
fraction of the over 7,000 languages spoken around the world. The Massively
Multilingual Speech (MMS) project increases the number of supported languages
by 10-40x, depending on the task. The main ingredients are a new dataset based
on readings of publicly available religious texts and effectively leveraging
self-supervised learning. We built pre-trained wav2vec 2.0 models covering
1,406 languages, a single multilingual automatic speech recognition model for
1,107 languages, speech synthesis models for the same number of languages, as
well as a language identification model for 4,017 languages. Experiments show
that our multilingual speech recognition model more than halves the word error
rate of Whisper on 54 languages of the FLEURS benchmark while being trained on
a small fraction of the labeled data
MLS: A Large-Scale Multilingual Dataset for Speech Research
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large
multilingual corpus suitable for speech research. The dataset is derived from
read audiobooks from LibriVox and consists of 8 languages, including about
44.5K hours of English and a total of about 6K hours for other languages.
Additionally, we provide Language Models (LM) and baseline Automatic Speech
Recognition (ASR) models and for all the languages in our dataset. We believe
such a large transcribed dataset will open new avenues in ASR and
Text-To-Speech (TTS) research. The dataset will be made freely available for
anyone at http://www.openslr.org
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
We introduce VoxPopuli, a large-scale multilingual corpus providing 100K
hours of unlabelled speech data in 23 languages. It is the largest open data to
date for unsupervised representation learning as well as semi-supervised
learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16
languages and their aligned oral interpretations into 5 other languages
totaling 5.1K hours. We provide speech recognition baselines and validate the
versatility of VoxPopuli unlabelled data in semi-supervised learning under
challenging out-of-domain settings. We will release the corpus at
https://github.com/facebookresearch/voxpopuli under an open license.Comment: Accepted to ACL 2021 (long paper
Libri-Light: A Benchmark for ASR with Limited or No Supervision
We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art
JWSign: A Highly Multilingual Corpus of Bible Translations for more Diversity in Sign Language Processing
Advancements in sign language processing have been hindered by a lack of
sufficient data, impeding progress in recognition, translation, and production
tasks. The absence of comprehensive sign language datasets across the world's
sign languages has widened the gap in this field, resulting in a few sign
languages being studied more than others, making this research area extremely
skewed mostly towards sign languages from high-income countries. In this work
we introduce a new large and highly multilingual dataset for sign language
translation: JWSign. The dataset consists of 2,530 hours of Bible translations
in 98 sign languages, featuring more than 1,500 individual signers. On this
dataset, we report neural machine translation experiments. Apart from bilingual
baseline systems, we also train multilingual systems, including some that take
into account the typological relatedness of signed or spoken languages. Our
experiments highlight that multilingual systems are superior to bilingual
baselines, and that in higher-resource scenarios, clustering language pairs
that are related improves translation quality.Comment: EMNLP 20223 (Findings
JWSign: A Highly Multilingual Corpus of Bible Translations for more Diversity in Sign Language Processing
Advancements in sign language processing have been hindered by a lack of sufficient data, impeding progress in recognition, translation, and production tasks. The absence of comprehensive sign language datasets across the world's sign languages has widened the gap in this field, resulting in a few sign languages being studied more than others, making this research area extremely skewed mostly towards sign languages from high-income countries. In this work we introduce a new large and highly multilingual dataset for sign language translation: JWSign. The dataset consists of 2,530 hours of Bible translations in 98 sign languages, featuring more than 1,500 individual signers. On this dataset, we report neural machine translation experiments. Apart from bilingual baseline systems, we also train multilingual systems, including some that take into account the typological relatedness of signed or spoken languages. Our experiments highlight that multilingual systems are superior to bilingual baselines, and that in higher-resource scenarios, clustering language pairs that are related improves translation quality
ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development
We introduce "ivrit.ai", a comprehensive Hebrew speech dataset, addressing
the distinct lack of extensive, high-quality resources for advancing Automated
Speech Recognition (ASR) technology in Hebrew. With over 3,300 speech hours and
a over a thousand diverse speakers, ivrit.ai offers a substantial compilation
of Hebrew speech across various contexts. It is delivered in three forms to
cater to varying research needs: raw unprocessed audio; data post-Voice
Activity Detection, and partially transcribed data. The dataset stands out for
its legal accessibility, permitting use at no cost, thereby serving as a
crucial resource for researchers, developers, and commercial entities. ivrit.ai
opens up numerous applications, offering vast potential to enhance AI
capabilities in Hebrew. Future efforts aim to expand ivrit.ai further, thereby
advancing Hebrew's standing in AI research and technology.Comment: 9 pages, 1 table and 3 figure