13 research outputs found
Score-Informed Source Separation for Musical Audio Recordings [An overview]
(c) 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works
Towards Bridging the Gap between Sheet Music and Audio
Sheet music and audio recordings represent and describe music
on different semantic levels. Sheet music describes abstract high-level
parameters such as notes, keys, measures, or repeats in a visual form.
Because of its explicitness and compactness, most musicologists discuss
and analyze the meaning of music on the basis of sheet music. On the
contrary, most people enjoy music by listening to audio recordings, which
represent music in an acoustic form. In particular, the nuances and subtleties
of musical performances, which are generally not written down
in the score, make the music come alive. In this paper, we address the
problem of bridging the gap between the sheet music domain and the audio
domain. In particular, we discuss aspects on music representations,
music synchronization, and optical music recognition, while indicating
various strategies and open research problems
Linking Sheet Music and Audio - Challenges and New Approaches
Score and audio files are the two most important ways to represent,
convey, record, store, and experience music. While score describes a piece of music on an abstract level using symbols such as notes, keys, and measures, audio files allow for reproducing a specific acoustic realization of the piece. Each of these representations reflects different facets of music yielding insights into aspects ranging from structural elements (e.g., motives, themes, musical form) to specific performance aspects (e.g., artistic shaping,
sound). Therefore, the simultaneous access to score and audio
representations is of great importance.
In this paper, we address the problem of automatically generating
musically relevant linking structures between the various data sources
that are available for a given piece of music. In particular, we discuss the task of sheet music-audio synchronization with the aim to link regions in images of scanned scores to musically corresponding sections in an audio recording of the same piece. Such linking structures form the basis for novel interfaces that allow users to access and explore multimodal sources of music within a single framework.
As our main contributions, we give an overview of the state-of-the-art for this kind of synchronization task, we present some novel approaches, and indicate future research directions. In particular, we address problems that arise in the presence of structural differences and discuss challenges when applying optical music recognition to complex orchestral scores. Finally, potential applications of the synchronization results are presented
Music Information Retrieval: An Inspirational Guide to Transfer from Related Disciplines
The emerging field of Music Information Retrieval (MIR) has been influenced by neighboring domains in signal processing and machine learning, including automatic speech recognition, image processing and text information retrieval. In this contribution, we start with concrete examples for methodology transfer between speech and music processing, oriented on the building blocks of pattern recognition: preprocessing, feature extraction, and classification/decoding. We then assume a higher level viewpoint when describing sources of mutual inspiration derived from text and image information retrieval. We conclude that dealing with the peculiarities of music in MIR research has contributed to advancing the state-of-the-art in other fields, and that many future challenges in MIR are strikingly similar to those that other research areas have been facing
Case Study ``Beatles Songs'' — What can be Learned from Unreliable Music Alignments?
As a result of massive digitization efforts and the world wide
web, there is an exploding amount of available digital data describing
and representing music at various semantic levels and in diverse formats.
For example, in the case of the Beatles songs, there are numerous recordings
including an increasing number of cover songs and arrangements as well
as MIDI data and other symbolic music representations. The general
goal of music synchronization is to align the multiple information sources
related to a given piece of music. This becomes a difficult problem when
the various representations reveal significant differences in structure and
polyphony, while exhibiting various types of artifacts. In this paper, we
address the issue of how music synchronization techniques are useful
for automatically revealing critical passages with significant difference
between the two versions to be aligned. Using the corpus of the Beatles
songs as test bed, we analyze the kind of differences occurring in audio
and MIDI versions available for the song
Music Synchronization, Audio Matching, Pattern Detection, and User Interfaces for a Digital Music Library System
Over the last two decades, growing efforts to digitize our cultural heritage could be observed. Most of these digitization initiatives pursuit either one or both of the following goals: to conserve the documents - especially those threatened by decay - and to provide remote access on a grand scale. For music documents these trends are observable as well, and by now several digital music libraries are in existence. An important characteristic of these music libraries is an inherent multimodality resulting from the large variety of available digital music representations, such as scanned score, symbolic score, audio recordings, and videos. In addition, for each piece of music there exists not only one document of each type, but many. Considering and exploiting this multimodality and multiplicity, the DFG-funded digital library initiative PROBADO MUSIC aimed at developing a novel user-friendly interface for content-based retrieval, document access, navigation, and browsing in large music collections. The implementation of such a front end requires the multimodal linking and indexing of the music documents during preprocessing. As the considered music collections can be very large, the automated or at least semi-automated calculation of these structures would be recommendable. The field of music information retrieval (MIR) is particularly concerned with the development of suitable procedures, and it was the goal of PROBADO MUSIC to include existing and newly developed MIR techniques to realize the envisioned digital music library system. In this context, the present thesis discusses the following three MIR tasks: music synchronization, audio matching, and pattern detection. We are going to identify particular issues in these fields and provide algorithmic solutions as well as prototypical implementations. In Music synchronization, for each position in one representation of a piece of music the corresponding position in another representation is calculated. This thesis focuses on the task of aligning scanned score pages of orchestral music with audio recordings. Here, a previously unconsidered piece of information is the textual specification of transposing instruments provided in the score. Our evaluations show that the neglect of such information can result in a measurable loss of synchronization accuracy. Therefore, we propose an OCR-based approach for detecting and interpreting the transposition information in orchestral scores. For a given audio snippet, audio matching methods automatically calculate all musically similar excerpts within a collection of audio recordings. In this context, subsequence dynamic time warping (SSDTW) is a well-established approach as it allows for local and global tempo variations between the query and the retrieved matches. Moving to real-life digital music libraries with larger audio collections, however, the quadratic runtime of SSDTW results in untenable response times. To improve on the response time, this thesis introduces a novel index-based approach to SSDTW-based audio matching. We combine the idea of inverted file lists introduced by Kurth and MĂĽller (Efficient index-based audio matching, 2008) with the shingling techniques often used in the audio identification scenario. In pattern detection, all repeating patterns within one piece of music are determined. Usually, pattern detection operates on symbolic score documents and is often used in the context of computer-aided motivic analysis. Envisioned as a new feature of the PROBADO MUSIC system, this thesis proposes a string-based approach to pattern detection and a novel interactive front end for result visualization and analysis
Signal Processing Methods for Music Synchronization, Audio Matching, and Source Separation
The field of music information retrieval (MIR) aims at developing techniques and tools for organizing, understanding, and searching multimodal information in large music collections in a robust, efficient and intelligent manner. In this context, this thesis presents novel, content-based methods for music synchronization, audio matching, and source separation. In general, music synchronization denotes a procedure which, for a given position in one representation of a piece of music, determines the corresponding position within another representation. Here, the thesis presents three complementary synchronization approaches, which improve upon previous methods in terms of robustness, reliability, and accuracy. The first approach employs a late-fusion strategy based on multiple, conceptually different alignment techniques to identify those music passages that allow for reliable alignment results. The second approach is based on the idea of employing musical structure analysis methods in the context of synchronization to derive reliable synchronization results even in the presence of structural differences between the versions to be aligned. Finally, the third approach employs several complementary strategies for increasing the accuracy and time resolution of synchronization results. Given a short query audio clip, the goal of audio matching is to automatically retrieve all musically similar excerpts in different versions and arrangements of the same underlying piece of music. In this context, chroma-based audio features are a well-established tool as they possess a high degree of invariance to variations in timbre. This thesis describes a novel procedure for making chroma features even more robust to changes in timbre while keeping their discriminative power. Here, the idea is to identify and discard timbre-related information using techniques inspired by the well-known MFCC features, which are usually employed in speech processing. Given a monaural music recording, the goal of source separation is to extract musically meaningful sound sources corresponding, for example, to a melody, an instrument, or a drum track from the recording. To facilitate this complex task, one can exploit additional information provided by a musical score. Based on this idea, this thesis presents two novel, conceptually different approaches to source separation. Using score information provided by a given MIDI file, the first approach employs a parametric model to describe a given audio recording of a piece of music. The resulting model is then used to extract sound sources as specified by the score. As a computationally less demanding and easier to implement alternative, the second approach employs the additional score information to guide a decomposition based on non-negative matrix factorization (NMF)
Alignement du chant par rapport à une référence audio en temps réel
Dans l'optique de créer un système de karaoké qui modifie une interprétation chantée à capella en temps réel, il est nécessaire de pouvoir localiser l'interprète par rapport à une référence afin de pouvoir déterminer quelle serait la cible d'un algorithme de modification de la voix. Pour qu'un tel système fonctionne bien, il est nécessaire que l'algorithme d'alignement exploite au maximum les spécificités de la voix, qu'il utilise l'information liée au texte prononcé plutôt qu'aux aspects artistiques du chant, qu'il soit à temps réel et qu'il offr la plus faible latence possible. Afin d'atteindre ces objectifs, un système d'alignement basé sur le Dynamic Time Warping (DTW) a été développé. Une adaptation temps réel simple de l'algorithme ordinaire de la DTW qui permet d'atteindre les objectifs énumérés est proposée et comparée à d'autres approches répertoriées dans la littérature. Cette adaptation a permis d'obtenir de meilleurs résultats que les autres techniques testées. Une étude comparative de trois types d'analyses spectrales couramment utilisées dans des systèmes de reconnaissance automatique de la voix a été réalisée, dans le cadre spécifique d'un algorithme d'alignement de la voix chantée. Les coefficients évalués sont les Mel-frquency Cepstrum Coefficients (MFCC), les Warped Discrete Cosine Transform Coefficients (WDCTC) et les coefficients de l'analyse Perceptual Linear Prediction (PLP). Les résultats obtenus indiquent une meilleure performance pour l'analyse PLP. L'utilisation d'une fonction de transformation linéaire par morceaux, appliquée aux matrices de coûts instantanés obtenues, permet de rendre l'alignement le plus facilement distinguable dans les matrices de coûts cumulés calculées. Les paramètres de la fonction de transformation peuvent être obtenus par l'optimisation en boucle fermée par recherche directe par motif. Une fonction-objectif permettant d'éviter les discontinuités de l'écart quadratique moyen sur l'alignement est développée. Plusieurs matrices de coûts peuvent être combinées entre elles en effectuant une somme pondérée des matrices de coûts instantanées transformées de chacun des paramètres considérés. La pondération est également obtenue par optimisation. Plusieurs assemblages sont comparés : les meilleurs résultats sont obtenus avec une combinaison de l'analyse PLP et du niveau d'énergie et des dérivées de ceux-ci. L'écart moyen sur l'alignement de référence est de l'ordre de 50 ms, avec un écart-type d'environ 75 ms pour les séquences testées. Des perspectives permettant d'améliorer la convergence de l'algorithme pour les paires de séquences audio difficiles à aligner, d'obtenir de meilleures matrices de coûts en utilisant d'autres contraintes locales, en considérant l'intégration de nouveaux paramètres tels le pitch ou en utilisant une base de données de voix chantée segmentée pour optimiser une mesure de distance sont données
Linking Music Metadata.
PhDThe internet has facilitated music metadata production and distribution on an unprecedented
scale. A contributing factor of this data deluge is a change in the
authorship of this data from the expert few to the untrained crowd. The resulting
unordered flood of imperfect annotations provides challenges and opportunities in
identifying accurate metadata and linking it to the music audio in order to provide
a richer listening experience. We advocate novel adaptations of Dynamic Programming
for music metadata synchronisation, ranking and comparison. This thesis
introduces Windowed Time Warping, Greedy, Constrained On-Line Time Warping
for synchronisation and the Concurrence Factor for automatically ranking metadata.
We begin by examining the availability of various music metadata on the web.
We then review Dynamic Programming methods for aligning and comparing two
source sequences whilst presenting novel, specialised adaptations for efficient, realtime
synchronisation of music and metadata that make improvements in speed and
accuracy over existing algorithms. The Concurrence Factor, which measures the
degree in which an annotation of a song agrees with its peers, is proposed in order to
utilise the wisdom of the crowds to establish a ranking system. This attribute uses
a combination of the standard Dynamic Programming methods Levenshtein Edit
Distance, Dynamic Time Warping, and Longest Common Subsequence to compare
annotations.
We present a synchronisation application for applying the aforementioned methods
as well as a tablature-parsing application for mining and analysing guitar tablatures
from the web. We evaluate the Concurrence Factor as a ranking system on a largescale
collection of guitar tablatures and lyrics to show a correlation with accuracy
that is superior to existing methods currently used in internet search engines, which
are based on popularity and human ratingsEngineering
and Physical Sciences Research Council; Travel grant from the Royal Engineering Society