5 research outputs found
Robust Multipitch Analyzer against Initialization based on Latent Harmonic Allocation using Overtone Corpus
We present a Bayesian analysis method that estimates the harmonic structure of musical instruments in music signals on the basis of psychoacoustic evidence. Since the main objective of multipitch analysis is joint estimation of the fundamental frequencies and their harmonic structures, the performance of harmonic structure estimation significantly affects fundamental frequency estimation accuracy. Many methods have been proposed for estimating the harmonic structure accurately, but no method has been proposed that satisfies all these requirements: robust against initialization, optimization-free, and psychoacoustically appropriate and thus easy to develop further. Our method satisfies these requirements by explicitly incorporating Terhardt's virtual pitch theory within a Bayesian framework. It does this by automatically learning the valid weight range of the harmonic components using a MIDI synthesizer. The bounds are termed "overtone corpus." Modeling demonstrated that the proposed overtone corpus method can stably estimate the harmonic structure of 40 musical pieces for a wide variety of initial settings
Neural Networks for Analysing Music and Environmental Audio
PhDIn this thesis, we consider the analysis of music and environmental audio
recordings with neural networks. Recently, neural networks have been
shown to be an effective family of models for speech recognition, computer
vision, natural language processing and a number of other statistical modelling
problems. The composite layer-wise structure of neural networks
allows for flexible model design, where prior knowledge about the domain
of application can be used to inform the design and architecture of the
neural network models. Additionally, it has been shown that when trained
on sufficient quantities of data, neural networks can be directly applied to
low-level features to learn mappings to high level concepts like phonemes
in speech and object classes in computer vision. In this thesis we investigate
whether neural network models can be usefully applied to processing
music and environmental audio.
With regards to music signal analysis, we investigate 2 different problems.
The fi rst problem, automatic music transcription, aims to identify the
score or the sequence of musical notes that comprise an audio recording.
We also consider the problem of automatic chord transcription, where the
aim is to identify the sequence of chords in a given audio recording. For
both problems, we design neural network acoustic models which are applied
to low-level time-frequency features in order to detect the presence of
notes or chords. Our results demonstrate that the neural network acoustic
models perform similarly to state-of-the-art acoustic models, without the
need for any feature engineering. The networks are able to learn complex
transformations from time-frequency features to the desired outputs, given
sufficient amounts of training data. Additionally, we use recurrent neural
networks to model the temporal structure of sequences of notes or chords,
similar to language modelling in speech. Our results demonstrate that
the combination of the acoustic and language model predictions yields
improved performance over the acoustic models alone. We also observe
that convolutional neural networks yield better performance compared to
other neural network architectures for acoustic modelling.
For the analysis of environmental audio recordings, we consider the problem
of acoustic event detection. Acoustic event detection has a similar
structure to automatic music and chord transcription, where the system
is required to output the correct sequence of semantic labels along with
onset and offset times. We compare the performance of neural network
architectures against Gaussian mixture models and support vector machines.
In order to account for the fact that such systems are typically
deployed on embedded devices, we compare performance as a function of
the computational cost of each model. We evaluate the models on 2 large
datasets of real-world recordings of baby cries and smoke alarms. Our results
demonstrate that the neural networks clearly outperform the other
models and they are able to do so without incurring a heavy computation
cost