753 research outputs found
Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages
Lyrics alignment gained considerable attention in recent years.
State-of-the-art systems either re-use established speech recognition toolkits,
or design end-to-end solutions involving a Connectionist Temporal
Classification (CTC) loss. However, both approaches suffer from specific
weaknesses: toolkits are known for their complexity, and CTC systems use a loss
designed for transcription which can limit alignment accuracy. In this paper,
we use instead a contrastive learning procedure that derives cross-modal
embeddings linking the audio and text domains. This way, we obtain a novel
system that is simple to train end-to-end, can make use of weakly annotated
training data, jointly learns a powerful text model, and is tailored to
alignment. The system is not only the first to yield an average absolute error
below 0.2 seconds on the standard Jamendo dataset but it is also robust to
other languages, even when trained on English data only. Finally, we release
word-level alignments for the JamendoLyrics Multi-Lang dataset.Comment: 5 pages, accepted at the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP) 202
PoLyScriber: Integrated Training of Extractor and Lyrics Transcriber for Polyphonic Music
Lyrics transcription of polyphonic music is challenging as the background
music affects lyrics intelligibility. Typically, lyrics transcription can be
performed by a two step pipeline, i.e. singing vocal extraction frontend,
followed by a lyrics transcriber backend, where the frontend and backend are
trained separately. Such a two step pipeline suffers from both imperfect vocal
extraction and mismatch between frontend and backend. In this work, we propose
a novel end-to-end integrated training framework, that we call PoLyScriber, to
globally optimize the vocal extractor front-end and lyrics transcriber backend
for lyrics transcription in polyphonic music. The experimental results show
that our proposed integrated training model achieves substantial improvements
over the existing approaches on publicly available test datasets.Comment: 13 page
HCLAS-X: Hierarchical and Cascaded Lyrics Alignment System Using Multimodal Cross-Correlation
In this work, we address the challenge of lyrics alignment, which involves
aligning the lyrics and vocal components of songs. This problem requires the
alignment of two distinct modalities, namely text and audio. To overcome this
challenge, we propose a model that is trained in a supervised manner, utilizing
the cross-correlation matrix of latent representations between vocals and
lyrics. Our system is designed in a hierarchical and cascaded manner. It
predicts synced time first on a sentence-level and subsequently on a
word-level. This design enables the system to process long sequences, as the
cross-correlation uses quadratic memory with respect to sequence length. In our
experiments, we demonstrate that our proposed system achieves a significant
improvement in mean average error, showcasing its robustness in comparison to
the previous state-of-the-art model. Additionally, we conduct a qualitative
analysis of the system after successfully deploying it in several music
streaming services
Proceedings of the 6th International Workshop on Folk Music Analysis, 15-17 June, 2016
The Folk Music Analysis Workshop brings together computational music analysis and ethnomusicology. Both symbolic and audio representations of music are considered, with a broad range of scientific approaches being applied (signal processing, graph theory, deep learning). The workshop features a range of interesting talks from international researchers in areas such as Indian classical music, Iranian singing, Ottoman-Turkish Makam music scores, Flamenco singing, Irish traditional music, Georgian traditional music and Dutch folk songs. Invited guest speakers were Anja Volk, Utrecht University and Peter Browne, Technological University Dublin
And what if two musical versions don't share melody, harmony, rhythm, or lyrics ?
Version identification (VI) has seen substantial progress over the past few
years. On the one hand, the introduction of the metric learning paradigm has
favored the emergence of scalable yet accurate VI systems. On the other hand,
using features focusing on specific aspects of musical pieces, such as melody,
harmony, or lyrics, yielded interpretable and promising performances. In this
work, we build upon these recent advances and propose a metric learning-based
system systematically leveraging four dimensions commonly admitted to convey
musical similarity between versions: melodic line, harmonic structure, rhythmic
patterns, and lyrics. We describe our deliberately simple model architecture,
and we show in particular that an approximated representation of the lyrics is
an efficient proxy to discriminate between versions and non-versions. We then
describe how these features complement each other and yield new
state-of-the-art performances on two publicly available datasets. We finally
suggest that a VI system using a combination of melodic, harmonic, rhythmic and
lyrics features could theoretically reach the optimal performances obtainable
on these datasets
Linking Sheet Music and Audio - Challenges and New Approaches
Score and audio files are the two most important ways to represent,
convey, record, store, and experience music. While score describes a piece of music on an abstract level using symbols such as notes, keys, and measures, audio files allow for reproducing a specific acoustic realization of the piece. Each of these representations reflects different facets of music yielding insights into aspects ranging from structural elements (e.g., motives, themes, musical form) to specific performance aspects (e.g., artistic shaping,
sound). Therefore, the simultaneous access to score and audio
representations is of great importance.
In this paper, we address the problem of automatically generating
musically relevant linking structures between the various data sources
that are available for a given piece of music. In particular, we discuss the task of sheet music-audio synchronization with the aim to link regions in images of scanned scores to musically corresponding sections in an audio recording of the same piece. Such linking structures form the basis for novel interfaces that allow users to access and explore multimodal sources of music within a single framework.
As our main contributions, we give an overview of the state-of-the-art for this kind of synchronization task, we present some novel approaches, and indicate future research directions. In particular, we address problems that arise in the presence of structural differences and discuss challenges when applying optical music recognition to complex orchestral scores. Finally, potential applications of the synchronization results are presented
LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic
lyrics transcription method achieving state-of-the-art performance on various
lyrics transcription datasets, even in challenging genres such as rock and
metal. Our novel, training-free approach utilizes Whisper, a weakly supervised
robust speech recognition model, and GPT-4, today's most performant chat-based
large language model. In the proposed method, Whisper functions as the "ear" by
transcribing the audio, while GPT-4 serves as the "brain," acting as an
annotator with a strong performance for contextualized output selection and
correction. Our experiments show that LyricWhiz significantly reduces Word
Error Rate compared to existing methods in English and can effectively
transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to
create the first publicly available, large-scale, multilingual lyrics
transcription dataset with a CC-BY-NC-SA copyright license, based on
MTG-Jamendo, and offer a human-annotated subset for noise level estimation and
evaluation. We anticipate that our proposed method and dataset will advance the
development of multilingual lyrics transcription, a challenging and emerging
task.Comment: 9 pages, 2 figures, 5 tables, accepted by ISMIR 202
- …