Search CORE

339 research outputs found

Aligned Contrastive Predictive Coding

Author: Chorowski Jan
Ciesielski Grzegorz
Dzikowski Jarosław
Marxer Ricard
Opala Mateusz
Pusz Piotr
Rychlikowski Paweł
Stypułkowski Michał
Łańcucki Adrian
Publication venue
Publication date: 22/06/2021
Field of study

We investigate the possibility of forcing a self-supervised model trained using a contrastive predictive loss to extract slowly varying latent representations. Rather than producing individual predictions for each of the future representations, the model emits a sequence of predictions shorter than that of the upcoming representations to which they will be aligned. In this way, the prediction network solves a simpler task of predicting the next symbols, but not their exact timing, while the encoding network is trained to produce piece-wise constant latent codes. We evaluate the model on a speech coding task and demonstrate that the proposed Aligned Contrastive Predictive Coding (ACPC) leads to higher linear phone prediction accuracy and lower ABX error rates, while being slightly faster to train due to the reduced number of prediction heads.Comment: Published in Interspeech 202

arXiv.org e-Print Archive

HAL AMU

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

Author: Durand Simon
Ewert Sebastian
Stoller Daniel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/06/2023
Field of study

Lyrics alignment gained considerable attention in recent years. State-of-the-art systems either re-use established speech recognition toolkits, or design end-to-end solutions involving a Connectionist Temporal Classification (CTC) loss. However, both approaches suffer from specific weaknesses: toolkits are known for their complexity, and CTC systems use a loss designed for transcription which can limit alignment accuracy. In this paper, we use instead a contrastive learning procedure that derives cross-modal embeddings linking the audio and text domains. This way, we obtain a novel system that is simple to train end-to-end, can make use of weakly annotated training data, jointly learns a powerful text model, and is tailored to alignment. The system is not only the first to yield an average absolute error below 0.2 seconds on the standard Jamendo dataset but it is also robust to other languages, even when trained on English data only. Finally, we release word-level alignments for the JamendoLyrics Multi-Lang dataset.Comment: 5 pages, accepted at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 202

arXiv.org e-Print Archive

Audio self-supervised learning: a survey

Author: Hu Bin
Jing Xin
Kathan Alexander
Liu Shuo
Mallol-Ragolta Adria
Parada-Cabaleiro Emilia
Qian Kun
Schuller Björn W.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Inspired by the humans' cognitive ability to generalise knowledge and skills, Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations, which is an expensive and time consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarising the knowledge in audio SSL are currently missing. To fill this gap, in the present work, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarise the empirical works that exploit the audio modality in multi-modal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions on the development of audio SSL

arXiv.org e-Print Archive

OPUS Augsburg