80 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
How Low Can You Go? Reducing Frequency and Time Resolution in Current CNN Architectures for Music Auto-tagging
Automatic tagging of music is an important research topic in Music
Information Retrieval and audio analysis algorithms proposed for this task have
achieved improvements with advances in deep learning. In particular, many
state-of-the-art systems use Convolutional Neural Networks and operate on
mel-spectrogram representations of the audio. In this paper, we compare
commonly used mel-spectrogram representations and evaluate model performances
that can be achieved by reducing the input size in terms of both lesser amount
of frequency bands and larger frame rates. We use the MagnaTagaTune dataset for
comprehensive performance comparisons and then compare selected configurations
on the larger Million Song Dataset. The results of this study can serve
researchers and practitioners in their trade-off decision between accuracy of
the models, data storage size and training and inference times.Comment: The 28th European Signal Processing Conference (EUSIPCO
A Feature Learning Siamese Model for Intelligent Control of the Dynamic Range Compressor
In this paper, a siamese DNN model is proposed to learn the characteristics
of the audio dynamic range compressor (DRC). This facilitates an intelligent
control system that uses audio examples to configure the DRC, a widely used
non-linear audio signal conditioning technique in the areas of music
production, speech communication and broadcasting. Several alternative siamese
DNN architectures are proposed to learn feature embeddings that can
characterise subtle effects due to dynamic range compression. These models are
compared with each other as well as handcrafted features proposed in previous
work. The evaluation of the relations between the hyperparameters of DNN and
DRC parameters are also provided. The best model is able to produce a universal
feature embedding that is capable of predicting multiple DRC parameters
simultaneously, which is a significant improvement from our previous research.
The feature embedding shows better performance than handcrafted audio features
when predicting DRC parameters for both mono-instrument audio loops and
polyphonic music pieces.Comment: 8 pages, accepted in IJCNN 201
Deep Neural Networks for Music Tagging
PhDIn this thesis, I present my hypothesis, experiment results, and discussion that are related
to various aspects of deep neural networks for music tagging.
Music tagging is a task to automatically predict the suitable semantic label when music is
provided. Generally speaking, the input of music tagging systems can be any entity that
constitutes music, e.g., audio content, lyrics, or metadata, but only the audio content
is considered in this thesis. My hypothesis is that we can fi nd effective deep learning
practices for the task of music tagging task that improves the classi fication performance.
As a computational model to realise a music tagging system, I use deep neural networks.
Combined with the research problem, the scope of this thesis is the understanding,
interpretation, optimisation, and application of deep neural networks in the context of
music tagging systems.
The ultimate goal of this thesis is to provide insight that can help to improve deep
learning-based music tagging systems. There are many smaller goals in this regard.
Since using deep neural networks is a data-driven approach, it is crucial to understand the
dataset. Selecting and designing a better architecture is the next topic to discuss. Since
the tagging is done with audio input, preprocessing the audio signal becomes one of the
important research topics. After building (or training) a music tagging system, fi nding
a suitable way to re-use it for other music information retrieval tasks is a compelling
topic, in addition to interpreting the trained system.
The evidence presented in the thesis supports that deep neural networks are powerful
and credible methods for building a music tagging system
Deep Attention-based Representation Learning for Heart Sound Classification
Cardiovascular diseases are the leading cause of deaths and severely threaten
human health in daily life. On the one hand, there have been dramatically
increasing demands from both the clinical practice and the smart home
application for monitoring the heart status of subjects suffering from chronic
cardiovascular diseases. On the other hand, experienced physicians who can
perform an efficient auscultation are still lacking in terms of number.
Automatic heart sound classification leveraging the power of advanced signal
processing and machine learning technologies has shown encouraging results.
Nevertheless, human hand-crafted features are expensive and time-consuming. To
this end, we propose a novel deep representation learning method with an
attention mechanism for heart sound classification. In this paradigm,
high-level representations are learnt automatically from the recorded heart
sound data. Particularly, a global attention pooling layer improves the
performance of the learnt representations by estimating the contribution of
each unit in feature maps. The Heart Sounds Shenzhen (HSS) corpus (170 subjects
involved) is used to validate the proposed method. Experimental results
validate that, our approach can achieve an unweighted average recall of 51.2%
for classifying three categories of heart sounds, i. e., normal, mild, and
moderate/severe annotated by cardiologists with the help of Echocardiography
- …