Search CORE

78 research outputs found

Automatic music transcription: challenges and future directions

Author: Anssi Klapuri
Anssi Klapuri
Dimitrios Giannoulis
E. Benetos
Emmanouil Benetos
Emmanouil Benetos
Holger Kirchhoff
Holger Kirchhoff
See Profile
Simon Dixon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Automatic music transcription is considered by many to be a key enabling technology in music signal processing. However, the performance of transcription systems is still significantly below that of a human expert, and accuracies reported in recent years seem to have reached a limit, although the field is still very active. In this paper we analyse limitations of current methods and identify promising directions for future research. Current transcription methods use general purpose models which are unable to capture the rich diversity found in music signals. One way to overcome the limited performance of transcription systems is to tailor algorithms to specific use-cases. Semi-automatic approaches are another way of achieving a more reliable transcription. Also, the wealth of musical scores and corresponding audio data now available are a rich potential source of training data, via forced alignment of audio to scores, but large scale utilisation of such data has yet to be attempted. Other promising approaches include the integration of information from multiple algorithms and different musical aspects

CiteSeerX

City Research Online

Queen Mary Research Online

A User-assisted Approach to Multiple Instrument Music Transcription

Author: Kirchhoff Holger
Publication venue: 'Queen Mary University of London'
Publication date: 21/05/2014
Field of study

PhDThe task of automatic music transcription has been studied for several decades and is regarded as an enabling technology for a multitude of applications such as music retrieval and discovery, intelligent music processing and large-scale musicological analyses. It refers to the process of identifying the musical content of a performance and representing it in a symbolic format. Despite its long research history, fully automatic music transcription systems are still error prone and often fail when more complex polyphonic music is analysed. This gives rise to the question in what ways human knowledge can be incorporated in the transcription process. This thesis investigates ways to involve a human user in the transcription process. More specifically, it is investigated how user input can be employed to derive timbre models for the instruments in a music recording, which are employed to obtain instrument-specific (parts-based) transcriptions. A first investigation studies different types of user input in order to derive instrument models by means of a non-negative matrix factorisation framework. The transcription accuracy of the different models is evaluated and a method is proposed that refines the models by allowing each pitch of each instrument to be represented by multiple basis functions. A second study aims at limiting the amount of user input to make the method more applicable in practice. Different methods are considered to estimate missing non-negative basis functions when only a subset of basis functions can be extracted based on the user information. A method is proposed to track the pitches of individual instruments over time by means of a Viterbi framework in which the states at each time frame contain several candidate instrument-pitch combinations. A transition probability is employed that combines three different criteria: the frame-wise reconstruction error of each combination, a pitch continuity measure that favours similar pitches in consecutive frames, and an explicit activity model for each instrument. The method is shown to outperform other state-of-the-art multi-instrument tracking methods. Finally, the extraction of instrument models that include phase information is investigated as a step towards complex matrix decomposition. The phase relations between the partials of harmonic sounds are explored as a time-invariant property that can be employed to form complex-valued basis functions. The application of the model for a user-assisted transcription task is illustrated with a saxophone example.QMU

CiteSeerX

Queen Mary Research Online

From heuristics-based to data-driven audio melody extraction

Author: Bosch Juan J.
Publication venue
Publication date: 01/01/2017
Field of study

The identification of the melody from a music recording is a relatively easy task for humans, but very challenging for computational systems. This task is known as "audio melody extraction", more formally defined as the automatic estimation of the pitch sequence of the melody directly from the audio signal of a polyphonic music recording. This thesis investigates the benefits of exploiting knowledge automatically derived from data for audio melody extraction, by combining digital signal processing and machine learning methods. We extend the scope of melody extraction research by working with a varied dataset and multiple definitions of melody. We first present an overview of the state of the art, and perform an evaluation focused on a novel symphonic music dataset. We then propose melody extraction methods based on a source-filter model and pitch contour characterisation and evaluate them on a wide range of music genres. Finally, we explore novel timbre, tonal and spatial features for contour characterisation, and propose a method for estimating multiple melodic lines. The combination of supervised and unsupervised approaches leads to advancements on melody extraction and shows a promising path for future research and applications

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

ZENODO

Tesis Doctorals en Xarxa

Neural Networks for Analysing Music and Environmental Audio

Author: Sigtia Siddharth
Publication venue: 'Queen Mary University of London'
Publication date: 06/07/2017
Field of study

PhDIn this thesis, we consider the analysis of music and environmental audio recordings with neural networks. Recently, neural networks have been shown to be an effective family of models for speech recognition, computer vision, natural language processing and a number of other statistical modelling problems. The composite layer-wise structure of neural networks allows for flexible model design, where prior knowledge about the domain of application can be used to inform the design and architecture of the neural network models. Additionally, it has been shown that when trained on sufficient quantities of data, neural networks can be directly applied to low-level features to learn mappings to high level concepts like phonemes in speech and object classes in computer vision. In this thesis we investigate whether neural network models can be usefully applied to processing music and environmental audio. With regards to music signal analysis, we investigate 2 different problems. The fi rst problem, automatic music transcription, aims to identify the score or the sequence of musical notes that comprise an audio recording. We also consider the problem of automatic chord transcription, where the aim is to identify the sequence of chords in a given audio recording. For both problems, we design neural network acoustic models which are applied to low-level time-frequency features in order to detect the presence of notes or chords. Our results demonstrate that the neural network acoustic models perform similarly to state-of-the-art acoustic models, without the need for any feature engineering. The networks are able to learn complex transformations from time-frequency features to the desired outputs, given sufficient amounts of training data. Additionally, we use recurrent neural networks to model the temporal structure of sequences of notes or chords, similar to language modelling in speech. Our results demonstrate that the combination of the acoustic and language model predictions yields improved performance over the acoustic models alone. We also observe that convolutional neural networks yield better performance compared to other neural network architectures for acoustic modelling. For the analysis of environmental audio recordings, we consider the problem of acoustic event detection. Acoustic event detection has a similar structure to automatic music and chord transcription, where the system is required to output the correct sequence of semantic labels along with onset and offset times. We compare the performance of neural network architectures against Gaussian mixture models and support vector machines. In order to account for the fact that such systems are typically deployed on embedded devices, we compare performance as a function of the computational cost of each model. We evaluate the models on 2 large datasets of real-world recordings of baby cries and smoke alarms. Our results demonstrate that the neural networks clearly outperform the other models and they are able to do so without incurring a heavy computation cost

Queen Mary Research Online

Analysis of an efficient parallel implementation of active-set Newton algorithm

Author: García Mollá Víctor Manuel
San Juan-Sebastian Pablo
Vidal Maciá Antonio Manuel
Virtanen T.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

[EN] This paper presents an analysis of an efficient parallel implementation of the active-set Newton algorithm (ASNA), which is used to estimate the nonnegative weights of linear combinations of the atoms in a large-scale dictionary to approximate an observation vector by minimizing the Kullback¿Leibler divergence between the observation vector and the approximation. The performance of ASNA has been proved in previous works against other state-of-the-art methods. The implementations analysed in this paper have been developed in C, using parallel programming techniques to obtain a better performance in multicore architectures than the original MATLAB implementation. Also a hardware analysis is performed to check the influence of CPU frequency and number of CPU cores in the different implementations proposed. The new implementations allow ASNA algorithm to tackle real-time problems due to the execution time reduction obtained.This work has been partially supported by Programa de FPU del MECD, by MINECO and FEDER from Spain, under the projects TEC2015-67387- C4-1-R, and by project PROMETEO FASE II 2014/003 of Generalitat Valenciana. The authors want to thank Dr. Konstantinos Drossos for some very useful mind changing discussions. This work has been conducted in Laboratory of Signal Processing, Tampere University of Technology.San Juan-Sebastian, P.; Virtanen, T.; García Mollá, VM.; Vidal Maciá, AM. (2018). Analysis of an efficient parallel implementation of active-set Newton algorithm. The Journal of Supercomputing. 75(3):1298-1309. https://doi.org/10.1007/s11227-018-2423-5S12981309753Raj B, Smaragdis P (2005) Latent variable decomposition of spectrograms for single channel speaker separation. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2005), New Paltz, NyBertin N, Badeau R, Vincent E (2010) Enforcing harmonicity and smoothness in Bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Trans Audio Speech Lang Process 18(3):538–549Dikmen O, Mesaros A (2013) Sound event detection using non-negative dictionaries learned from annotated overlapping events. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2013). New Paltz, NYLawson CL, Hanson RJ (1995) Solving least squares problems. Society for Industrial and Applied Mathematics, PhiladelphiaVirtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074Virtanen T, Gemmeke J, Raj B (2013) Active-set Newton algorithm for overcomplete non-negative representations of audio. IEEE Trans Audio Speech Lang Process 21(11):2277–2289Cemgil AT (2009) Bayesian inference for nonnegative matrix factorisation models. Comput Intell Neurosci 2009:785152Cichocki A, Zdunek R, Phan AH, Amari S (2009) Nonnegative matrix and tensor factorizations. Wiley, New YorkMATLAB (2014) The Mathworks Inc., MATLAB R2014B, Natnick MATuomas Virtanen, Original MATLAB implementation of ASNA algorithm. http://www.cs.tut.fi/~tuomasv/software.htmlCarabias-Orti J, Rodriguez-Serrano F, Vera-Candeas P, Canadas-Quesada F, Ruiz-Reyes N (2013) Constrained non-negative sparse coding using learnt instrument templates for realtime music transcription. Eng Appl Artif Intell 26:1671–1680San Juan P, Virtanen T, Garcia-Molla Victor M, Vidal Antonio M (2016) Efficient parallel implementation of active-set newton algorithm for non-negative sparse representations. In: 16th International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE 2016), Rota, SpainJuan P San, Efficient implementations of ASNA algorithm. https://gitlab.com/P.SanJuan/ASNAOpenMP v4.5 specification (2015). http://www.openmp.org/wpcontent/uploads/openmp-4.5.pdfGemmeke JF, Hurmalainen A, Virtanen T, Sun Y (2011) Toward a practical implementation of exemplar-based noise robust ASR. In: Signal Processing Conference, 19th European, IEEE, pp 1490–149

RiuNet

Active-set newton algorithm for overcomplete non-negative representations of audio

Author: Gemmeke Jort F.
Raj Bhiksha
Virtanen Tuomas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

acceptedVersionPeer reviewe

CiteSeerX

Trepo - Institutional Repository of Tampere University

Recommended from our members

Supporting virtuosity and flow in computer music

Author: Nash Chris
Publication venue: University of Cambridge
Publication date: 09/10/2012
Field of study

As we begin to realise the sonic and expressive potential of the computer, HCI researchers face the challenge of designing rewarding and accessible user experiences that enable individuals to explore complex creative domains such as music. In performance-based music systems such as sequencers, a disjunction exists between the musician’s specialist skill with performance hardware and the generic usability techniques applied in the design of the software. The creative process is not only fragmented across multiple physical (and virtual) devices, but divided across creativity and productivity phases separated by the act of recording. Integrating psychologies of expertise and intrinsic motivation, this thesis proposes a design shift from usability to virtuosity, using theories of “flow” (Csikszentmihalyi, 1996) and feedback “liveness” (Tanimoto, 1990) to identify factors that facilitate learning and creativity in digital notations and interfaces, leading to a set of design heuristics to support virtuosity in notation use. Using the cognitive dimensions of notations framework (Green, 1996), models of the creative user experience are developed, working towards a theoretical framework for HCI in music systems, and specifically computer-aided composition. Extensive analytical methods are used to look at corollaries of virtuosity and flow in real-world computer music interaction, notably in soundtracking, a software-based composing environment offering a rapid edit-audition feedback cycle, enabled by the user’s skill in manipulating the text-based notation (and program) through the computer keyboard. The interaction and development of more than 1,000 sequencer and tracker users was recorded over a period of 2 years, to investigate the nature and development of skill and technique, look for evidence of flow experiences, and establish the use and role of both visual and musical feedback in music software. Quantitative analyses of interaction data are supplemented with a detailed video study of a professional tracker composer, and a user survey that draws on psychometric methods to evaluate flow experiences in the use of digital music notations, such as sequencers and trackers. Empirical findings broadly support the proposed design heuristics, and enable the development of further models of liveness and flow in notation use. Implications for UI design are discussed in the context of existing music systems, and supporting digitally-mediated creativity in other domains based on notation use

Apollo (Cambridge)

Computational Methods for the Alignment and Score-Informed Transcription of Piano Music

Author: Wang Siying
Publication venue: 'Queen Mary University of London'
Publication date: 19/12/2017
Field of study

PhDThis thesis is concerned with computational methods for alignment and score-informed transcription of piano music. Firstly, several methods are proposed to improve the alignment robustness and accuracywhen various versions of one piece of music showcomplex differences with respect to acoustic conditions or musical interpretation. Secondly, score to performance alignment is applied to enable score-informed transcription. Although music alignment methods have considerably improved in accuracy in recent years, the task remains challenging. The research in this thesis aims to improve the robustness for some cases where there are substantial differences between versions and state-of-the-art methods may fail in identifying a correct alignment. This thesis first exploits the availability of multiple versions of the piece to be aligned. By processing these jointly, the alignment process can be stabilised by exploiting additional examples of how a section might be interpreted or which acoustic conditions may arise. Two methods are proposed, progressive alignment and profile HMM, both adapted from the multiple biological sequence alignment task. Experiments demonstrate that these methods can indeed improve the alignment accuracy and robustness over comparable pairwise methods. Secondly, this thesis presents a score to performance alignment method that can improve the robustness in cases where some musical voices, such as the melody, are played asynchronously to others – a stylistic device used in musical expression. The asynchronies between the melody and the accompaniment are handled by treating the voices as separate timelines in a multi-dimensional variant of dynamic time warping (DTW). The method measurably improves the alignment accuracy for pieces with asynchronous voices and preserves the accuracy otherwise. Once an accurate alignment between a score and an audio recording is available, the score information can be exploited as prior knowledge in automatic music transcription (AMT), for scenarios where score is available, such as music tutoring. Score-informed dictionary learning is used to learn the spectral pattern of each pitch that describes the energy distribution of the associated notes in the recording. More precisely, the dictionary learning process in non-negative matrix factorization (NMF) is constrained using the aligned score. This way, by adapting the dictionary to a given recording, the proposed method improves the accuracy over the state-of-the-art.China Scholarship Council

Queen Mary Research Online

Robust speech recognition with spectrogram factorisation

Author: Hurmalainen Antti
Publication venue: Tampere University of Technology
Publication date: 01/01/2014
Field of study

Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

Trepo - Institutional Repository of Tampere University

An efficient musical accompaniment parallel system for mobile devices

Author: A Cont
C Joder
JJ Carabias-Ortí
José Ranilla
P. Vera-Candeas
Pedro Alonso
Raquel Cortina
Z Duan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

[EN] This work presents a software system designed to track the reproduction of a musical piece with the aim to match the score position into its symbolic representation on a digital sheet. Into this system, known as automated musical accompaniment system, the process of score alignment can be carried out real-time. A real-time score alignment, also known as score following, poses an important challenge due to the large amount of computation needed to process each digital frame and the very small time slot to process it. Moreover, the challenge is even greater since we are interested on handheld devices, i.e. devices characterized by both low power consumption and mobility. The results presented here show that it is possible to exploit efficiently several cores of an ARM(A (R)) processor, or a GPU accelerator (presented in some SoCs from NVIDIA) reducing the processing time per frame under 10 ms in most of the cases.This work was supported by the Ministry of Economy and Competitiveness from Spain (FEDER) under projects TEC2015-67387-C4-1-R, TEC2015-67387-C4-2-R and TEC2015-67387-C4-3-R, the Andalusian Business, Science and Innovation Council under project P2010-TIC-6762 (FEDER), and the Generalitat Valenciana PROMETEOII/2014/003Alonso-Jordá, P.; Vera-Candeas, P.; Cortina, R.; Ranilla, J. (2017). An efficient musical accompaniment parallel system for mobile devices. The Journal of Supercomputing. 73(1):343-353. https://doi.org/10.1007/s11227-016-1865-xS343353731Cont A, Schwarz D, Schnell N, Raphael C (2007) Evaluation of real- time audio-to-score alignment. In: Proc. of the International Conference on Music Information Retrieval (ISMIR) 2007, ViennaArzt A (2008) Score following with dynamic time warping. An automatic page-turner. Master’s Thesis, Vienna University of Technology, ViennaRaphael C (2010) Music plus one and machine learning. In: Proc. of the 27 th International Conference on Machine Learning, Haifa, pp 21–28Carabias-Ortí JJ, Rodríguez-Serrano FJ, Vera-Candeas P, Ruiz-Reyes N, Cañadas-Quesada FJ (2015) An audio to score alignment framework using spectral factorization and dynamic time warping. In: Proc. of the International Conference on Music Information Retrieval (ISMIR), Málaga, pp 742–748Cont A (2010) A coupled duration-focused architecture for real-time music-to-score alignment. IEEE Trans. Pattern Anal. Mach. Intell. 32(6):974–987Montecchio N, Orio N (2009) A discrete filterbank approach to audio to score matching for score following. In: Proc. of the International Conference on Music Information Retrieval (ISMIR), pp 495–500Puckette M (1995) Score following using the sung voice. In: Proc. of the International Computer Music Conference (ICMC), pp 175–178Duan Z, Pardo B (2011) Soundprism: an online system for score-informed source separation of music audio. IEEE J. Sel. Top. Signal Process. 5(6):1205–1215Cont A (2006) Realtime audio to score alignment for polyphonic music instruments using sparse non-negative constraints and hierarchical hmms. In: Proc. of IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), ToulouseCuvillier P, Cont A (2014) Coherent time modeling of Semi-Markov models with application to realtime audio-to-score alignment. In Proc. of the 2014 IEEE International Workshop on Machine Learning for Signal Processing, p 16Joder C, Essid S, Richard G (2013) Learning optimal features for polyphonic audio-to-score alignment. IEEE Trans. Audio Speech Lang. Process. 21(10):2118–2128Dixon S (2005) Live tracking of musical performances using on-line time warping. In: Proc. International Conference on Digital Audio Effects (DAFx), Madrid, pp 92–97Hu N, Dannenberg RB, Tzanetakis G (2009) Polyphonic audio matching and alignment for music retrieval. In: Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp 185–188Orio N, Schwarz D (2001) Alignment of monophonic and polyphonic music to a score. In: Proc. International Computer Music Conference (ICMC)Alonso P, Cortina R, Rodríguez-Serrano FJ, Vera-Candeas P, Alonso-Gonzalez M, Ranilla J (2016) Parallel online time warping for real-time audio-to-score alignment in multi-core systems. J. Supercomput. doi: 10.1007/s11227-016-1647-5 (published online)Carabias-Ortí JJ, Rodríguez-Serrano FJ, Vera-Candeas P, Cañadas-Quesada FJ, Ruiz-Reyes N (2013) Constrained non-negative sparse coding using learnt instrument templates for realtime music transcription, Eng. Appl. Artif. Intell. 26(7):1671–1680Carabias-Ortí JJ, Rodríguez-Serrano FJ, Vera-Candeas P, Martínez-Muñoz D (2016) Tempo driven audio-to-score alignment using spectral decomposition and online dynamic time warping. ACM Trans. Intell. Syst. Technol. (accepted)FFTW (2016) http://www.fftw.org . Accessed July 2016NVIDIA CUDA Fast Fourier Transform library (cuFFT) (2016) http://developer.nvidia.com/cufft . Accessed July 2016The OpenMP API specification for parallel programming (2016) http://openmp.org . Accessed July 201

Crossref

Repositorio Institucional de la Universidad de Oviedo

RiuNet