14 research outputs found
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
We propose a self-supervised representation learning model for the task of
unsupervised phoneme boundary detection. The model is a convolutional neural
network that operates directly on the raw waveform. It is optimized to identify
spectral changes in the signal using the Noise-Contrastive Estimation
principle. At test time, a peak detection algorithm is applied over the model
outputs to produce the final boundaries. As such, the proposed model is trained
in a fully unsupervised manner with no manual annotations in the form of target
boundaries nor phonetic transcriptions. We compare the proposed approach to
several unsupervised baselines using both TIMIT and Buckeye corpora. Results
suggest that our approach surpasses the baseline models and reaches
state-of-the-art performance on both data sets. Furthermore, we experimented
with expanding the training set with additional examples from the Librispeech
corpus. We evaluated the resulting model on distributions and languages that
were not seen during the training phase (English, Hebrew and German) and showed
that utilizing additional untranscribed data is beneficial for model
performance.Comment: Interspeech 2020 pape
Self-supervised Speaker Diarization
Over the last few years, deep learning has grown in popularity for speaker
verification, identification, and diarization. Inarguably, a significant part
of this success is due to the demonstrated effectiveness of their speaker
representations. These, however, are heavily dependent on large amounts of
annotated data and can be sensitive to new domains. This study proposes an
entirely unsupervised deep-learning model for speaker diarization.
Specifically, the study focuses on generating high-quality neural speaker
representations without any annotated data, as well as on estimating secondary
hyperparameters of the model without annotations.
The speaker embeddings are represented by an encoder trained in a
self-supervised fashion using pairs of adjacent segments assumed to be of the
same speaker. The trained encoder model is then used to self-generate
pseudo-labels to subsequently train a similarity score between different
segments of the same call using probabilistic linear discriminant analysis
(PLDA) and further to learn a clustering stopping threshold. We compared our
model to state-of-the-art unsupervised as well as supervised baselines on the
CallHome benchmarks. According to empirical results, our approach outperforms
unsupervised methods when only two speakers are present in the call, and is
only slightly worse than recent supervised models.Comment: Submitted to Interspeech 202
Audio Language Modeling using Perceptually-Guided Discrete Representations
In this work, we study the task of Audio Language Modeling, in which we aim
at learning probabilistic models for audio that can be used for generation and
completion. We use a state-of-the-art perceptually-guided audio compression
model, to encode audio to discrete representations. Next, we train a
transformer-based causal language model using these representations. At
inference time, we perform audio auto-completion by encoding an audio prompt as
a discrete sequence, feeding it to the audio language model, sampling from the
model, and synthesizing the corresponding time-domain signal. We evaluate the
quality of samples generated by our method on Audioset, the largest dataset for
general audio to date, and show that it is superior to the evaluated baseline
audio encoders. We additionally provide an extensive analysis to better
understand the trade-off between audio-quality and language-modeling
capabilities. Samples:link
On The Robustness of Self-Supervised Representations for Spoken Language Modeling
Self-supervised representations have been extensively studied for
discriminative and generative tasks. However, their robustness capabilities
have not been extensively investigated. This work focuses on self-supervised
representations for spoken generative language models. First, we empirically
demonstrate how current state-of-the-art speech representation models lack
robustness to basic signal variations that do not alter the spoken information.
To overcome this, we propose an effective and efficient method to learn robust
self-supervised speech representation for generative spoken language modeling.
The proposed approach is based on applying a set of signal transformations to
the speech signal and optimizing the model using an iterative pseudo-labeling
scheme. Our method significantly improves over the evaluated baselines when
considering encoding metrics. We additionally evaluate our method on the
speech-to-speech translation task. We consider Spanish-English and
French-English conversions and empirically demonstrate the benefits of
following the proposed approach
Simple and Controllable Music Generation
We tackle the task of conditional music generation. We introduce MusicGen, a
single Language Model (LM) that operates over several streams of compressed
discrete music representation, i.e., tokens. Unlike prior work, MusicGen is
comprised of a single-stage transformer LM together with efficient token
interleaving patterns, which eliminates the need for cascading several models,
e.g., hierarchically or upsampling. Following this approach, we demonstrate how
MusicGen can generate high-quality samples, while being conditioned on textual
description or melodic features, allowing better controls over the generated
output. We conduct extensive empirical evaluation, considering both automatic
and human studies, showing the proposed approach is superior to the evaluated
baselines on a standard text-to-music benchmark. Through ablation studies, we
shed light over the importance of each of the components comprising MusicGen.
Music samples, code, and models are available at
https://github.com/facebookresearch/audiocraft