Search CORE

20 research outputs found

Audio Language Modeling using Perceptually-Guided Discrete Representations

Author: Adi Yossi
Copet Jade
Défossez Alexandre
Kreuk Felix
Polyak Adam
Synnaeve Gabriel
Taigman Yaniv
Publication venue
Publication date: 04/11/2022
Field of study

In this work, we study the task of Audio Language Modeling, in which we aim at learning probabilistic models for audio that can be used for generation and completion. We use a state-of-the-art perceptually-guided audio compression model, to encode audio to discrete representations. Next, we train a transformer-based causal language model using these representations. At inference time, we perform audio auto-completion by encoding an audio prompt as a discrete sequence, feeding it to the audio language model, sampling from the model, and synthesizing the corresponding time-domain signal. We evaluate the quality of samples generated by our method on Audioset, the largest dataset for general audio to date, and show that it is superior to the evaluated baseline audio encoders. We additionally provide an extensive analysis to better understand the trade-off between audio-quality and language-modeling capabilities. Samples:link

arXiv.org e-Print Archive

On The Robustness of Self-Supervised Representations for Spoken Language Modeling

Author: Adi Yossi
Copet Jade
Dupoux Emmanuel
Gat Itai
Kreuk Felix
Lee Ann
Synnaeve Gabriel
Publication venue
Publication date: 30/09/2022
Field of study

Self-supervised representations have been extensively studied for discriminative and generative tasks. However, their robustness capabilities have not been extensively investigated. This work focuses on self-supervised representations for spoken generative language models. First, we empirically demonstrate how current state-of-the-art speech representation models lack robustness to basic signal variations that do not alter the spoken information. To overcome this, we propose an effective and efficient method to learn robust self-supervised speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding metrics. We additionally evaluate our method on the speech-to-speech translation task. We consider Spanish-English and French-English conversions and empirically demonstrate the benefits of following the proposed approach

arXiv.org e-Print Archive

Simple and Controllable Music Generation

Author: Adi Yossi
Copet Jade
Défossez Alexandre
Gat Itai
Kant David
Kreuk Felix
Remez Tal
Synnaeve Gabriel
Publication venue
Publication date: 08/06/2023
Field of study

We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft

arXiv.org e-Print Archive

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Author: Adi Yossi
Copet Jade
Dupoux Emmanuel
Hsu Wei-Ning
Kharitonov Eugene
Lakhotia Kushal
Mohamed Abdelrahman
Polyak Adam
Publication venue
Publication date: 27/07/2021
Field of study

We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: speechbot.github.io/resynthesis.Comment: In Proceedings of Interspeech 202

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

AudioGen: Textually Guided Audio Generation

Author: Adi Yossi
Copet Jade
Défossez Alexandre
Kreuk Felix
Parikh Devi
Polyak Adam
Singer Uriel
Synnaeve Gabriel
Taigman Yaniv
Publication venue
Publication date: 30/09/2022
Field of study

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://tinyurl.com/audiogen-text2audi

arXiv.org e-Print Archive

Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Author: Bergelson Elika
Boissonnet Alodie
Bredin Hervé
Copet Jade
Cristia Alejandrina
Dupoux Emmanuel
Lavechin Marvin
Métais Marianne
Rivière Morgane
Titeux Hadrien
Publication venue
Publication date: 27/10/2022
Field of study

Most automatic speech processing systems are sensitive to the acoustic environment, with degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a pipeline to simulate audio segments recorded in noisy and reverberant conditions. We then use the simulated audio to jointly train the Brouhaha model for voice activity detection, signal-to-noise ratio estimation, and C50 room acoustics prediction. We show how the predicted SNR and C50 values can be used to investigate and help diagnose errors made by automatic speech processing tools (such as pyannote.audio for speaker diarization or OpenAI's Whisper for automatic speech recognition). Both our pipeline and a pretrained model are open source and shared with the speech community

arXiv.org e-Print Archive

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Author: Adi Yossi
Copet Jade
Dupoux Emmanuel
Hsu Wei-Ning
Kharitonov Eugene
Lakhotia Kushal
Mohamed Abdelrahman
Polyak Adam
Publication venue: HAL CCSD
Publication date: 30/08/2021
Field of study

In Proceedings of Interspeech 2021International audienceWe propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: speechbot.github.io/resynthesis

INRIA a CCSD electronic archive server

STOP: A dataset for Spoken Task Oriented Semantic Parsing

Author: Algayres Robin
Copet Jade
Dupoux Emmanuel
Elkahky Ali
Hsu Po-Chun
Hsu Wei-Ning
Lazar Daniel
Le Duc
Mohamed Abdelrahman
Mordechay Yossef
Nguyen Tu Ahn
Sagar Adithya
Shrivastava Akshat
Tomasello Paden
Zettlemoyer Luke
Publication venue
Publication date: 22/07/2022
Field of study

End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device. However, the limited number of public audio datasets with semantic parse labels hinders the research progress in this area. In this paper, we release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. Additionally, we define low-resource splits to establish a benchmark for improving SLU when limited labeled data is available. Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems. Initial experimentation show end-to-end SLU models performing slightly worse than their cascaded counterparts, which we hope encourages future work in this direction

arXiv.org e-Print Archive

Textless Speech Emotion Conversion using Discrete & Decomposed Representations

Author: Adi Yossi
Copet Jade
Dupoux Emmanuel
Hsu Wei-Ning
Kharitonov Eugene
Kreuk Felix
Mohamed Abdelrahman
Nguyen Tu-Anh
Polyak Adam
Rivière Morgane
Publication venue: HAL CCSD
Publication date: 27/06/2022
Field of study

International audienceSpeech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion. First, we modify the speech content by translating the phoneticcontent units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples are available under the following link: https: //speechbot.github.io/emotion

INRIA a CCSD electronic archive server

Text-Free Prosody-Aware Generative Spoken Language Modeling

Author: Adi Yossi
Copet Jade
Dupoux Emmanuel
Hsu Wei-Ning
Kharitonov Eugene
Lakhotia Kushal
Lee Ann
Mohamed Abdelrahman
Nguyen Tu-Anh
Polyak Adam
Rivière Morgane
Publication venue: 'MIT Press - Journals'
Publication date: 22/05/2022
Field of study

International audienceSpeech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pretraining, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.

INRIA a CCSD electronic archive server