339 research outputs found
Aligned Contrastive Predictive Coding
We investigate the possibility of forcing a self-supervised model trained
using a contrastive predictive loss to extract slowly varying latent
representations. Rather than producing individual predictions for each of the
future representations, the model emits a sequence of predictions shorter than
that of the upcoming representations to which they will be aligned. In this
way, the prediction network solves a simpler task of predicting the next
symbols, but not their exact timing, while the encoding network is trained to
produce piece-wise constant latent codes. We evaluate the model on a speech
coding task and demonstrate that the proposed Aligned Contrastive Predictive
Coding (ACPC) leads to higher linear phone prediction accuracy and lower ABX
error rates, while being slightly faster to train due to the reduced number of
prediction heads.Comment: Published in Interspeech 202
Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages
Lyrics alignment gained considerable attention in recent years.
State-of-the-art systems either re-use established speech recognition toolkits,
or design end-to-end solutions involving a Connectionist Temporal
Classification (CTC) loss. However, both approaches suffer from specific
weaknesses: toolkits are known for their complexity, and CTC systems use a loss
designed for transcription which can limit alignment accuracy. In this paper,
we use instead a contrastive learning procedure that derives cross-modal
embeddings linking the audio and text domains. This way, we obtain a novel
system that is simple to train end-to-end, can make use of weakly annotated
training data, jointly learns a powerful text model, and is tailored to
alignment. The system is not only the first to yield an average absolute error
below 0.2 seconds on the standard Jamendo dataset but it is also robust to
other languages, even when trained on English data only. Finally, we release
word-level alignments for the JamendoLyrics Multi-Lang dataset.Comment: 5 pages, accepted at the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP) 202
Audio self-supervised learning: a survey
Inspired by the humans' cognitive ability to generalise knowledge and skills,
Self-Supervised Learning (SSL) targets at discovering general representations
from large-scale data without requiring human annotations, which is an
expensive and time consuming task. Its success in the fields of computer vision
and natural language processing have prompted its recent adoption into the
field of audio and speech processing. Comprehensive reviews summarising the
knowledge in audio SSL are currently missing. To fill this gap, in the present
work, we provide an overview of the SSL methods used for audio and speech
processing applications. Herein, we also summarise the empirical works that
exploit the audio modality in multi-modal SSL frameworks, and the existing
suitable benchmarks to evaluate the power of SSL in the computer audition
domain. Finally, we discuss some open problems and point out the future
directions on the development of audio SSL
- …