20 research outputs found
Audio Language Modeling using Perceptually-Guided Discrete Representations
In this work, we study the task of Audio Language Modeling, in which we aim
at learning probabilistic models for audio that can be used for generation and
completion. We use a state-of-the-art perceptually-guided audio compression
model, to encode audio to discrete representations. Next, we train a
transformer-based causal language model using these representations. At
inference time, we perform audio auto-completion by encoding an audio prompt as
a discrete sequence, feeding it to the audio language model, sampling from the
model, and synthesizing the corresponding time-domain signal. We evaluate the
quality of samples generated by our method on Audioset, the largest dataset for
general audio to date, and show that it is superior to the evaluated baseline
audio encoders. We additionally provide an extensive analysis to better
understand the trade-off between audio-quality and language-modeling
capabilities. Samples:link
On The Robustness of Self-Supervised Representations for Spoken Language Modeling
Self-supervised representations have been extensively studied for
discriminative and generative tasks. However, their robustness capabilities
have not been extensively investigated. This work focuses on self-supervised
representations for spoken generative language models. First, we empirically
demonstrate how current state-of-the-art speech representation models lack
robustness to basic signal variations that do not alter the spoken information.
To overcome this, we propose an effective and efficient method to learn robust
self-supervised speech representation for generative spoken language modeling.
The proposed approach is based on applying a set of signal transformations to
the speech signal and optimizing the model using an iterative pseudo-labeling
scheme. Our method significantly improves over the evaluated baselines when
considering encoding metrics. We additionally evaluate our method on the
speech-to-speech translation task. We consider Spanish-English and
French-English conversions and empirically demonstrate the benefits of
following the proposed approach
Simple and Controllable Music Generation
We tackle the task of conditional music generation. We introduce MusicGen, a
single Language Model (LM) that operates over several streams of compressed
discrete music representation, i.e., tokens. Unlike prior work, MusicGen is
comprised of a single-stage transformer LM together with efficient token
interleaving patterns, which eliminates the need for cascading several models,
e.g., hierarchically or upsampling. Following this approach, we demonstrate how
MusicGen can generate high-quality samples, while being conditioned on textual
description or melodic features, allowing better controls over the generated
output. We conduct extensive empirical evaluation, considering both automatic
and human studies, showing the proposed approach is superior to the evaluated
baselines on a standard text-to-music benchmark. Through ablation studies, we
shed light over the importance of each of the components comprising MusicGen.
Music samples, code, and models are available at
https://github.com/facebookresearch/audiocraft
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
We propose using self-supervised discrete representations for the task of
speech resynthesis. To generate disentangled representation, we separately
extract low-bitrate representations for speech content, prosodic information,
and speaker identity. This allows to synthesize speech in a controllable
manner. We analyze various state-of-the-art, self-supervised representation
learning methods and shed light on the advantages of each method while
considering reconstruction quality and disentanglement properties.
Specifically, we evaluate the F0 reconstruction, speaker identification
performance (for both resynthesis and voice conversion), recordings'
intelligibility, and overall quality using subjective human evaluation. Lastly,
we demonstrate how these representations can be used for an ultra-lightweight
speech codec. Using the obtained representations, we can get to a rate of 365
bits per second while providing better speech quality than the baseline
methods. Audio samples can be found under the following link:
speechbot.github.io/resynthesis.Comment: In Proceedings of Interspeech 202
AudioGen: Textually Guided Audio Generation
We tackle the problem of generating audio samples conditioned on descriptive
text captions. In this work, we propose AaudioGen, an auto-regressive
generative model that generates audio samples conditioned on text inputs.
AudioGen operates on a learnt discrete audio representation. The task of
text-to-audio generation poses multiple challenges. Due to the way audio
travels through a medium, differentiating ``objects'' can be a difficult task
(e.g., separating multiple people simultaneously speaking). This is further
complicated by real-world recording conditions (e.g., background noise,
reverberation, etc.). Scarce text annotations impose another constraint,
limiting the ability to scale models. Finally, modeling high-fidelity audio
requires encoding audio at high sampling rate, leading to extremely long
sequences. To alleviate the aforementioned challenges we propose an
augmentation technique that mixes different audio samples, driving the model to
internally learn to separate multiple sources. We curated 10 datasets
containing different types of audio and text annotations to handle the scarcity
of text-audio data points. For faster inference, we explore the use of
multi-stream modeling, allowing the use of shorter sequences while maintaining
a similar bitrate and perceptual quality. We apply classifier-free guidance to
improve adherence to text. Comparing to the evaluated baselines, AudioGen
outperforms over both objective and subjective metrics. Finally, we explore the
ability of the proposed method to generate audio continuation conditionally and
unconditionally. Samples: https://tinyurl.com/audiogen-text2audi
Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation
Most automatic speech processing systems are sensitive to the acoustic
environment, with degraded performance when applied to noisy or reverberant
speech. But how can one tell whether speech is noisy or reverberant? We propose
Brouhaha, a pipeline to simulate audio segments recorded in noisy and
reverberant conditions. We then use the simulated audio to jointly train the
Brouhaha model for voice activity detection, signal-to-noise ratio estimation,
and C50 room acoustics prediction. We show how the predicted SNR and C50 values
can be used to investigate and help diagnose errors made by automatic speech
processing tools (such as pyannote.audio for speaker diarization or OpenAI's
Whisper for automatic speech recognition). Both our pipeline and a pretrained
model are open source and shared with the speech community
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
In Proceedings of Interspeech 2021International audienceWe propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: speechbot.github.io/resynthesis
STOP: A dataset for Spoken Task Oriented Semantic Parsing
End-to-end spoken language understanding (SLU) predicts intent directly from
audio using a single model. It promises to improve the performance of assistant
systems by leveraging acoustic information lost in the intermediate textual
representation and preventing cascading errors from Automatic Speech
Recognition (ASR). Further, having one unified model has efficiency advantages
when deploying assistant systems on-device. However, the limited number of
public audio datasets with semantic parse labels hinders the research progress
in this area. In this paper, we release the Spoken Task-Oriented semantic
Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly
available. Additionally, we define low-resource splits to establish a benchmark
for improving SLU when limited labeled data is available. Furthermore, in
addition to the human-recorded audio, we are releasing a TTS-generated version
to benchmark the performance for low-resource domain adaptation of end-to-end
SLU systems. Initial experimentation show end-to-end SLU models performing
slightly worse than their cascaded counterparts, which we hope encourages
future work in this direction
Textless Speech Emotion Conversion using Discrete & Decomposed Representations
International audienceSpeech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion. First, we modify the speech content by translating the phoneticcontent units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples are available under the following link: https: //speechbot.github.io/emotion
Text-Free Prosody-Aware Generative Spoken Language Modeling
International audienceSpeech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pretraining, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.