893 research outputs found
Language of music: a computational model of music interpretation
Automatic music transcription (AMT) is commonly defined as the process of converting
an acoustic musical signal into some form of musical notation, and can be split
into two separate phases: (1) multi-pitch detection, the conversion of an audio signal
into a time-frequency representation similar to a MIDI file; and (2) converting from
this time-frequency representation into a musical score. A substantial amount of AMT
research in recent years has concentrated on multi-pitch detection, and yet, in the case
of the transcription of polyphonic music, there has been little progress.
There are many potential reasons for this slow progress, but this thesis concentrates
on the (lack of) use of music language models during the transcription process. In particular,
a music language model would impart to a transcription system the background
knowledge of music theory upon which a human transcriber relies. In the related field
of automatic speech recognition, it has been shown that the use of a language model
drawn from the field of natural language processing (NLP) is an essential component
of a system for transcribing spoken word into text, and there is no reason to believe
that music should be any different.
This thesis will show that a music language model inspired by NLP techniques can
be used successfully for transcription. In fact, this thesis will create the blueprint for
such a music language model. We begin with a brief overview of existing multi-pitch
detection systems, in particular noting four key properties which any music language
model should have to be useful for integration into a joint system for AMT: it should
(1) be probabilistic, (2) not use any data a priori, (3) be able to run on live performance
data, and (4) be incremental.
We then investigate voice separation, creating a model which achieves state-of-the-art
performance on the task, and show that, used as a simple music language model, it
improves multi-pitch detection performance significantly. This is followed by an investigation
of metrical detection and alignment, where we introduce a grammar crafted for
the task which, combined with a beat-tracking model, achieves state-of-the-art results
on metrical alignment. This system’s success adds more evidence to the long-existing
hypothesis that music and language consist of extremely similar structures.
We end by investigating the joint analysis of music, in particular showing that a
combination of our two models running jointly outperforms each running independently.
We also introduce a new joint, automatic, quantitative metric for the complete
transcription of an audio recording into an annotated musical score, something which
the field currently lacks
Recommended from our members
Anchoring Knowledge in Interaction: Towards a Harmonic Subsymbolic/Symbolic Framework and Architecture of Computational Cognition
We outline a proposal for a research program leading to a new paradigm, architectural framework, and prototypical implementation, for the cognitively inspired anchoring of an agent’s learning, knowledge formation, and higher reasoning abilities in real-world interactions: Learning through interaction in real-time in a real environment triggers the incremental accumulation and repair of knowledge that leads to the formation of theories at a higher level of abstraction. The transformations at this higher level filter down and inform the learning process as part of a permanent cycle of learning through experience, higher-order deliberation, theory formation and revision.
The envisioned framework will provide a precise computational theory, algorithmic descriptions, and an implementation in cyber-physical systems, addressing the lifting of action patterns from the subsymbolic to the symbolic knowledge level, effective methods for theory formation, adaptation, and evolution, the anchoring of knowledge-level objects, real-world interactions and manipulations, and the realization and evaluation of such a system in different scenarios. The expected results can provide new foundations for future agent architectures, multi-agent systems, robotics, and cognitive systems, and can facilitate a deeper understanding of the development and interaction in human-technological settings
Towards the automated analysis of simple polyphonic music : a knowledge-based approach
PhDMusic understanding is a process closely related to the knowledge and experience
of the listener. The amount of knowledge required is relative to the
complexity of the task in hand.
This dissertation is concerned with the problem of automatically decomposing
musical signals into a score-like representation. It proposes that, as
with humans, an automatic system requires knowledge about the signal and
its expected behaviour to correctly analyse music.
The proposed system uses the blackboard architecture to combine the
use of knowledge with data provided by the bottom-up processing of the
signal's information. Methods are proposed for the estimation of pitches,
onset times and durations of notes in simple polyphonic music.
A method for onset detection is presented. It provides an alternative to
conventional energy-based algorithms by using phase information. Statistical
analysis is used to create a detection function that evaluates the expected
behaviour of the signal regarding onsets.
Two methods for multi-pitch estimation are introduced. The first concentrates
on the grouping of harmonic information in the frequency-domain.
Its performance and limitations emphasise the case for the use of high-level
knowledge.
This knowledge, in the form of the individual waveforms of a single
instrument, is used in the second proposed approach. The method is based
on a time-domain linear additive model and it presents an alternative to
common frequency-domain approaches.
Results are presented and discussed for all methods, showing that, if
reliably generated, the use of knowledge can significantly improve the quality
of the analysis.Joint Information Systems Committee (JISC) in the UK National Science Foundation (N.S.F.) in the United states. Fundacion Gran Mariscal Ayacucho in Venezuela
Unsupervised Learning from Narrated Instruction Videos
We address the problem of automatically learning the main steps to complete a
certain task, such as changing a car tire, from a set of narrated instruction
videos. The contributions of this paper are three-fold. First, we develop a new
unsupervised learning approach that takes advantage of the complementary nature
of the input video and the associated narration. The method solves two
clustering problems, one in text and one in video, applied one after each other
and linked by joint constraints to obtain a single coherent sequence of steps
in both modalities. Second, we collect and annotate a new challenging dataset
of real-world instruction videos from the Internet. The dataset contains about
800,000 frames for five different tasks that include complex interactions
between people and objects, and are captured in a variety of indoor and outdoor
settings. Third, we experimentally demonstrate that the proposed method can
automatically discover, in an unsupervised manner, the main steps to achieve
the task and locate the steps in the input videos.Comment: Appears in: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2016). 21 page
Musicians and Machines: Bridging the Semantic Gap In Live Performance
PhDThis thesis explores the automatic extraction of musical information from
live performances – with the intention of using that information to create
novel, responsive and adaptive performance tools for musicians.
We focus specifically on two forms of musical analysis – harmonic analysis
and beat tracking. We present two harmonic analysis algorithms –
specifically we present a novel chroma vector analysis technique which
we later use as the input for a chord recognition algorithm. We also
present a real-time beat tracker, based upon an extension of state of the
art non-causal models, that is computationally efficient and capable of
strong performance compared to other models. Furthermore, through a
modular study of several beat tracking algorithms we attempt to establish
methods to improve beat tracking and apply these lessons to our model.
Building upon this work, we show that these analyses can be combined
to create a beat-synchronous musical representation, with harmonic information
segmented at the level of the beat. We present a number of ways
of calculating these representations and discuss their relative merits.
We proceed by introducing a technique, which we call Performance
Following, for recognising repeated patterns in live musical performances.
Through examining the real-time beat-synchronous musical representation,
this technique makes predictions of future harmonic content in musical
performances with no prior knowledge in the form of a score.
Finally, we present a number of potential applications for live performances
that incorporate the real-time musical analysis techniques outlined
previously. The applications presented include audio effects informed by
beat tracking, a technique for synchronising video to a live performance,
the use of harmonic information to control visual displays and an automatic
accompaniment system based upon our performance following
technique.EPSR
Recommended from our members
A new user interface for musical timbre design
This thesis characterises and addresses problems and issues associated with the design of intuitive user interfaces for timbral control. The usability of a range of synthesis methods and representative implementations of these methods is assessed, and three interface architectures - fixed architecture, architecture specification and direct specification - are identified. The characteristics of each of these architectures, as well as problems of usability inherent to each of them are discussed; it is argued that none of them provide intuitive tools for the manipulation and control of timbre.
The study examines the nature of timbre and the notion of timbre space; different kinds of timbre space are considered and criteria are proposed for the selection of suitable timbre spaces as vehicles for synthesis.
A number of listening tests, designed to demonstrate the feasibility of subsequent work, were devised and carried out; the results of these tests provide evidence that, where Euclidean distances between sounds located in a given timbre space are reflected in perceptual distances, the ability of subjects to detect relative distances in different parts of the space varies with the perceptual granularity of the space.
Three contrasting timbre spaces conforming to the proposed criteria for use in synthesis are constructed; the purpose of these spaces is to provide an environment for a novel user interaction approach for timbral design which incorporates a search strategy based on weighted centroid localization. Two prototypes which exemplify the proposed approach in alternative ways are designed, implemented and tested with potential users in order to validate the approach; a third contrasting prototype which represents a simple contrasting alternative is tested for purposes of comparison. The results of these tests are evaluated and discussed, and areas of further work identified
Audio source separation for music in low-latency and high-latency scenarios
Aquesta tesi proposa mètodes per tractar les limitacions de les tècniques existents de separació de fonts musicals en condicions de baixa i alta latència. En primer lloc, ens centrem en els mètodes amb un baix cost computacional i baixa latència. Proposem l'ús de la regularització de Tikhonov com a mètode de descomposició de l'espectre en el context de baixa latència. El comparem amb les tècniques existents en tasques d'estimació i seguiment dels tons, que són passos crucials en molts mètodes de separació. A continuació utilitzem i avaluem el mètode de descomposició de l'espectre en tasques de separació de veu cantada, baix i percussió. En segon lloc, proposem diversos mètodes d'alta latència que milloren la separació de la veu cantada, grà cies al modelatge de components especÃfics, com la respiració i les consonants. Finalment, explorem l'ús de correlacions temporals i anotacions manuals per millorar la separació dels instruments de percussió i dels senyals musicals polifònics complexes.Esta tesis propone métodos para tratar las limitaciones de las técnicas existentes de separación de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los métodos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularización de Tikhonov como método de descomposición del espectro en el contexto de baja latencia. Lo comparamos con las técnicas existentes en tareas de estimación y seguimiento de los tonos, que son pasos cruciales en muchos métodos de separación. A continuación utilizamos y evaluamos el método de descomposición del espectro en tareas de separación de voz cantada, bajo y percusión. En segundo lugar, proponemos varios métodos de alta latencia que mejoran la separación de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiración y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separación de los instrumentos de percusión y señales musicales polifónicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals
- …