Search CORE

36 research outputs found

Scaling Speech Technology to 1,000+ Languages

Author: Adi Yossi
Auli Michael
Babu Arun
Baevski Alexei
Conneau Alexis
Elkahky Ali
Fazel-Zarandi Maryam
Hsu Wei-Ning
Kundu Sayani
Ni Zhaoheng
Pratap Vineel
Shi Bowen
Tjandra Andros
Tomasello Paden
Vyas Apoorv
Zhang Xiaohui
Publication venue
Publication date: 22/05/2023
Field of study

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data

arXiv.org e-Print Archive

MLS: A Large-Scale Multilingual Dataset for Speech Research

Author: Collobert Ronan
Pratap Vineel
Sriram Anuroop
Synnaeve Gabriel
Xu Qiantong
Publication venue: 'International Speech Communication Association'
Publication date: 19/12/2020
Field of study

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org

arXiv.org e-Print Archive

Crossref

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Author: Dupoux Emmanuel
Haziza Daniel
Lee Ann
Pino Juan
Rivière Morgane
Talnikar Chaitanya
Wang Changhan
Williamson Mary
Wu Anne
Publication venue
Publication date: 27/07/2021
Field of study

We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learning under challenging out-of-domain settings. We will release the corpus at https://github.com/facebookresearch/voxpopuli under an open license.Comment: Accepted to ACL 2021 (long paper

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Author: Collobert Ronan
Dupoux Emmanuel
Fuegen Christian
Joulin Armand
Kahn Jacob
Karadayi Julien
Kharitonov Evgeny
Likhomanenko Tatiana
Liptchinsky Vitaliy
Mazaré Pierre-Emmanuel
Mohamed Abdelrahman
Rivière Morgane
Synnaeve Gabriel
Xu Qiantong
Zheng Weiyi
Publication venue: HAL CCSD
Publication date: 20/12/2019
Field of study

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art

JWSign: A Highly Multilingual Corpus of Bible Translations for more Diversity in Sign Language Processing

Author: Gueuwou Shester
Leong Colin
Müller Mathias
Siake Sophie
Publication venue
Publication date: 16/11/2023
Field of study

Advancements in sign language processing have been hindered by a lack of sufficient data, impeding progress in recognition, translation, and production tasks. The absence of comprehensive sign language datasets across the world's sign languages has widened the gap in this field, resulting in a few sign languages being studied more than others, making this research area extremely skewed mostly towards sign languages from high-income countries. In this work we introduce a new large and highly multilingual dataset for sign language translation: JWSign. The dataset consists of 2,530 hours of Bible translations in 98 sign languages, featuring more than 1,500 individual signers. On this dataset, we report neural machine translation experiments. Apart from bilingual baseline systems, we also train multilingual systems, including some that take into account the typological relatedness of signed or spoken languages. Our experiments highlight that multilingual systems are superior to bilingual baselines, and that in higher-resource scenarios, clustering language pairs that are related improves translation quality.Comment: EMNLP 20223 (Findings

arXiv.org e-Print Archive

JWSign: A Highly Multilingual Corpus of Bible Translations for more Diversity in Sign Language Processing

Author: Gueuwou Shester
Leong Colin
Müller Mathias
Siake Sophie
Publication venue: Association for Computational Linguistics
Publication date: 01/12/2023
Field of study

ZORA

ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development

Author: Lifshitz Yair
Marmor Yanir
Misgav Kinneret
Publication venue
Publication date: 17/07/2023
Field of study

We introduce "ivrit.ai", a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) technology in Hebrew. With over 3,300 speech hours and a over a thousand diverse speakers, ivrit.ai offers a substantial compilation of Hebrew speech across various contexts. It is delivered in three forms to cater to varying research needs: raw unprocessed audio; data post-Voice Activity Detection, and partially transcribed data. The dataset stands out for its legal accessibility, permitting use at no cost, thereby serving as a crucial resource for researchers, developers, and commercial entities. ivrit.ai opens up numerous applications, offering vast potential to enhance AI capabilities in Hebrew. Future efforts aim to expand ivrit.ai further, thereby advancing Hebrew's standing in AI research and technology.Comment: 9 pages, 1 table and 3 figure

arXiv.org e-Print Archive