1,641 research outputs found

    Multiscale approaches to music audio feature learning

    Get PDF
    Content-based music information retrieval tasks are typically solved with a two-stage approach: features are extracted from music audio signals, and are then used as input to a regressor or classifier. These features can be engineered or learned from data. Although the former approach was dominant in the past, feature learning has started to receive more attention from the MIR community in recent years. Recent results in feature learning indicate that simple algorithms such as K-means can be very effective, sometimes surpassing more complicated approaches based on restricted Boltzmann machines, autoencoders or sparse coding. Furthermore, there has been increased interest in multiscale representations of music audio recently. Such representations are more versatile because music audio exhibits structure on multiple timescales, which are relevant for different MIR tasks to varying degrees. We develop and compare three approaches to multiscale audio feature learning using the spherical K-means algorithm. We evaluate them in an automatic tagging task and a similarity metric learning task on the Magnatagatune dataset

    Towards Building Deep Networks with Bayesian Factor Graphs

    Full text link
    We propose a Multi-Layer Network based on the Bayesian framework of the Factor Graphs in Reduced Normal Form (FGrn) applied to a two-dimensional lattice. The Latent Variable Model (LVM) is the basic building block of a quadtree hierarchy built on top of a bottom layer of random variables that represent pixels of an image, a feature map, or more generally a collection of spatially distributed discrete variables. The multi-layer architecture implements a hierarchical data representation that, via belief propagation, can be used for learning and inference. Typical uses are pattern completion, correction and classification. The FGrn paradigm provides great flexibility and modularity and appears as a promising candidate for building deep networks: the system can be easily extended by introducing new and different (in cardinality and in type) variables. Prior knowledge, or supervised information, can be introduced at different scales. The FGrn paradigm provides a handy way for building all kinds of architectures by interconnecting only three types of units: Single Input Single Output (SISO) blocks, Sources and Replicators. The network is designed like a circuit diagram and the belief messages flow bidirectionally in the whole system. The learning algorithms operate only locally within each block. The framework is demonstrated in this paper in a three-layer structure applied to images extracted from a standard data set.Comment: Submitted for journal publicatio

    Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation

    Get PDF
    In deep neural networks with convolutional layers, each layer typically has fixed-size/single-resolution receptive field (RF). Convolutional layers with a large RF capture global information from the input features, while layers with small RF size capture local details with high resolution from the input features. In this work, we introduce novel deep multi-resolution fully convolutional neural networks (MR-FCNN), where each layer has different RF sizes to extract multi-resolution features that capture the global and local details information from its input features. The proposed MR-FCNN is applied to separate a target audio source from a mixture of many audio sources. Experimental results show that using MR-FCNN improves the performance compared to feedforward deep neural networks (DNNs) and single resolution deep fully convolutional neural networks (FCNNs) on the audio source separation problem.Comment: arXiv admin note: text overlap with arXiv:1703.0801

    A Convolutional Neural Network model based on Neutrosophy for Noisy Speech Recognition

    Full text link
    Convolutional neural networks are sensitive to unknown noisy condition in the test phase and so their performance degrades for the noisy data classification task including noisy speech recognition. In this research, a new convolutional neural network (CNN) model with data uncertainty handling; referred as NCNN (Neutrosophic Convolutional Neural Network); is proposed for classification task. Here, speech signals are used as input data and their noise is modeled as uncertainty. In this task, using speech spectrogram, a definition of uncertainty is proposed in neutrosophic (NS) domain. Uncertainty is computed for each Time-frequency point of speech spectrogram as like a pixel. Therefore, uncertainty matrix with the same size of spectrogram is created in NS domain. In the next step, a two parallel paths CNN classification model is proposed. Speech spectrogram is used as input of the first path and uncertainty matrix for the second path. The outputs of two paths are combined to compute the final output of the classifier. To show the effectiveness of the proposed method, it has been compared with conventional CNN on the isolated words of Aurora2 dataset. The proposed method achieves the average accuracy of 85.96 in noisy train data. It is more robust against Car, Airport and Subway noises with accuracies 90, 88 and 81 in test sets A, B and C, respectively. Results show that the proposed method outperforms conventional CNN with the improvement of 6, 5 and 2 percentage in test set A, test set B and test sets C, respectively. It means that the proposed method is more robust against noisy data and handle these data effectively.Comment: International conference on Pattern Recognition and Image Analysis (IPRIA 2019
    corecore