11 research outputs found
Sample-level CNN Architectures for Music Auto-tagging Using Raw Waveforms
Recent work has shown that the end-to-end approach using convolutional neural
network (CNN) is effective in various types of machine learning tasks. For
audio signals, the approach takes raw waveforms as input using an 1-D
convolution layer. In this paper, we improve the 1-D CNN architecture for music
auto-tagging by adopting building blocks from state-of-the-art image
classification models, ResNets and SENets, and adding multi-level feature
aggregation to it. We compare different combinations of the modules in building
CNN architectures. The results show that they achieve significant improvements
over previous state-of-the-art models on the MagnaTagATune dataset and
comparable results on Million Song Dataset. Furthermore, we analyze and
visualize our model to show how the 1-D CNN operates.Comment: Accepted for publication at ICASSP 201
深層畳み込みニューラルネットワークによる音声波形を用いた音楽アーティスト分類
2018年度卒業論文要旨, 情報科学部情報科学
Generative Autoregressive Networks for 3D Dancing Move Synthesis from Music
This paper proposes a framework which is able to generate a sequence of
three-dimensional human dance poses for a given music. The proposed framework
consists of three components: a music feature encoder, a pose generator, and a
music genre classifier. We focus on integrating these components for generating
a realistic 3D human dancing move from music, which can be applied to
artificial agents and humanoid robots. The trained dance pose generator, which
is a generative autoregressive model, is able to synthesize a dance sequence
longer than 5,000 pose frames. Experimental results of generated dance
sequences from various songs show how the proposed method generates human-like
dancing move to a given music. In addition, a generated 3D dance sequence is
applied to a humanoid robot, showing that the proposed framework can make a
robot to dance just by listening to music.Comment: 8 pages, 10 figure
Motivic Pattern Classification of Music Audio Signals Combining Residual and LSTM Networks
Motivic pattern classification from music audio recordings is a challenging task. More so in the case of a cappella flamenco cantes, characterized by complex melodic variations, pitch instability, timbre changes, extreme vibrato oscillations, microtonal ornamentations, and noisy conditions of the recordings. Convolutional Neural Networks (CNN) have proven to be very effective algorithms in image classification. Recent work in large-scale audio classification has shown that CNN architectures, originally developed for image problems, can be applied successfully to audio event recognition and classification with little or no modifications to the networks. In this paper, CNN architectures are tested in a more nuanced problem: flamenco cantes intra-style classification using small motivic patterns. A new architecture is proposed that uses the advantages of residual CNN as feature extractors, and a bidirectional LSTM layer to exploit the sequential nature of musical audio data. We present a full end-to-end pipeline for audio music classification that includes a sequential pattern mining technique and a contour simplification method to extract relevant motifs from audio recordings. Mel-spectrograms of the extracted motifs are then used as the input for the different architectures tested. We investigate the usefulness of motivic patterns for the automatic classification of music recordings and the effect of the length of the audio and corpus size on the overall classification accuracy. Results show a relative accuracy improvement of up to 20.4% when CNN architectures are trained using acoustic representations from motivic patterns
Deep Learning for Black-Box Modeling of Audio Effects
Virtual analog modeling of audio effects consists of emulating the sound of an audio processor reference device. This digital simulation is normally done by designing mathematical models of these systems. It is often difficult because it seeks to accurately model all components within the effect unit, which usually contains various nonlinearities and time-varying components. Most existing methods for audio effects modeling are either simplified or optimized to a very specific circuit or type of audio effect and cannot be efficiently translated to other types of audio effects. Recently, deep neural networks have been explored as black-box modeling strategies to solve this task, i.e., by using only input–output measurements. We analyse different state-of-the-art deep learning models based on convolutional and recurrent neural networks, feedforward WaveNet architectures and we also introduce a new model based on the combination of the aforementioned models. Through objective perceptual-based metrics and subjective listening tests we explore the performance of these models when modeling various analog audio effects. Thus, we show virtual analog models of nonlinear effects, such as a tube preamplifier; nonlinear effects with memory, such as a transistor-based limiter and nonlinear time-varying effects, such as the rotating horn and rotating woofer of a Leslie speaker cabinet
Neural content-aware collaborative filtering for cold-start music recommendation
International audienceState-of-the-art music recommender systems are based on collaborative filtering, which builds upon learning similarities between users and songs from the available listening data. These approaches inherently face the cold-start problem, as they cannot recommend novel songs with no listening history. Content-aware recommendation addresses this issue by incorporating content information about the songs on top of collaborative filtering. However, methods falling in this category rely on a shallow user/item interaction that originates from a matrix factorization framework. In this work, we introduce neural content-aware collaborative filtering, a unified framework which alleviates these limits, and extends the recently introduced neural collaborative filtering to its content-aware counterpart. We propose a generative model which leverages deep learning for both extracting content information from low-level acoustic features and for modeling the interaction between users and songs embeddings. The deep content feature extractor can either directly predict the item embedding, or serve as a regularization prior, yielding two variants (strict and relaxed) of our model. Experimental results show that the proposed method reaches state-of-the-art results for a cold-start music recommendation task. We notably observe that exploiting deep neural networks for learning refined user/item interactions outperforms approaches using a more simple interaction model in a content-aware framework