5 research outputs found

    Robust Multipitch Analyzer against Initialization based on Latent Harmonic Allocation using Overtone Corpus

    Get PDF
    We present a Bayesian analysis method that estimates the harmonic structure of musical instruments in music signals on the basis of psychoacoustic evidence. Since the main objective of multipitch analysis is joint estimation of the fundamental frequencies and their harmonic structures, the performance of harmonic structure estimation significantly affects fundamental frequency estimation accuracy. Many methods have been proposed for estimating the harmonic structure accurately, but no method has been proposed that satisfies all these requirements: robust against initialization, optimization-free, and psychoacoustically appropriate and thus easy to develop further. Our method satisfies these requirements by explicitly incorporating Terhardt's virtual pitch theory within a Bayesian framework. It does this by automatically learning the valid weight range of the harmonic components using a MIDI synthesizer. The bounds are termed "overtone corpus." Modeling demonstrated that the proposed overtone corpus method can stably estimate the harmonic structure of 40 musical pieces for a wide variety of initial settings

    Robust Multipitch Analyzer against Initialization based on Latent Harmonic Allocation using Overtone Corpus

    No full text

    Neural Networks for Analysing Music and Environmental Audio

    Get PDF
    PhDIn this thesis, we consider the analysis of music and environmental audio recordings with neural networks. Recently, neural networks have been shown to be an effective family of models for speech recognition, computer vision, natural language processing and a number of other statistical modelling problems. The composite layer-wise structure of neural networks allows for flexible model design, where prior knowledge about the domain of application can be used to inform the design and architecture of the neural network models. Additionally, it has been shown that when trained on sufficient quantities of data, neural networks can be directly applied to low-level features to learn mappings to high level concepts like phonemes in speech and object classes in computer vision. In this thesis we investigate whether neural network models can be usefully applied to processing music and environmental audio. With regards to music signal analysis, we investigate 2 different problems. The fi rst problem, automatic music transcription, aims to identify the score or the sequence of musical notes that comprise an audio recording. We also consider the problem of automatic chord transcription, where the aim is to identify the sequence of chords in a given audio recording. For both problems, we design neural network acoustic models which are applied to low-level time-frequency features in order to detect the presence of notes or chords. Our results demonstrate that the neural network acoustic models perform similarly to state-of-the-art acoustic models, without the need for any feature engineering. The networks are able to learn complex transformations from time-frequency features to the desired outputs, given sufficient amounts of training data. Additionally, we use recurrent neural networks to model the temporal structure of sequences of notes or chords, similar to language modelling in speech. Our results demonstrate that the combination of the acoustic and language model predictions yields improved performance over the acoustic models alone. We also observe that convolutional neural networks yield better performance compared to other neural network architectures for acoustic modelling. For the analysis of environmental audio recordings, we consider the problem of acoustic event detection. Acoustic event detection has a similar structure to automatic music and chord transcription, where the system is required to output the correct sequence of semantic labels along with onset and offset times. We compare the performance of neural network architectures against Gaussian mixture models and support vector machines. In order to account for the fact that such systems are typically deployed on embedded devices, we compare performance as a function of the computational cost of each model. We evaluate the models on 2 large datasets of real-world recordings of baby cries and smoke alarms. Our results demonstrate that the neural networks clearly outperform the other models and they are able to do so without incurring a heavy computation cost
    corecore