16,398 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
EMD-based filtering (EMDF) of low-frequency noise for speech enhancement
An Empirical Mode Decomposition based filtering (EMDF) approach is presented as a post-processing stage for speech enhancement. This method is particularly effective in low frequency noise environments. Unlike previous EMD based denoising methods, this approach does not make the assumption that the contaminating noise signal is fractional Gaussian Noise. An adaptive method is developed to select the IMF index for separating the noise components from the speech based on the second-order IMF statistics. The low frequency noise components are then separated by a partial reconstruction from the IMFs. It is shown that the proposed EMDF technique is able to suppress residual noise from speech signals that were enhanced by the conventional optimallymodified log-spectral amplitude approach which uses a minimum statistics based noise estimate. A comparative performance study is included that demonstrates the effectiveness of the EMDF system in various noise environments, such as car interior noise, military vehicle noise and babble noise. In particular, improvements up to 10 dB are obtained in car noise environments. Listening tests were performed that confirm the results
Nonparametric estimation of the dynamic range of music signals
The dynamic range is an important parameter which measures the spread of
sound power, and for music signals it is a measure of recording quality. There
are various descriptive measures of sound power, none of which has strong
statistical foundations. We start from a nonparametric model for sound waves
where an additive stochastic term has the role to catch transient energy. This
component is recovered by a simple rate-optimal kernel estimator that requires
a single data-driven tuning. The distribution of its variance is approximated
by a consistent random subsampling method that is able to cope with the massive
size of the typical dataset. Based on the latter, we propose a statistic, and
an estimation method that is able to represent the dynamic range concept
consistently. The behavior of the statistic is assessed based on a large
numerical experiment where we simulate dynamic compression on a selection of
real music signals. Application of the method to real data also shows how the
proposed method can predict subjective experts' opinions about the hifi quality
of a recording
Data-driven multivariate and multiscale methods for brain computer interface
This thesis focuses on the development of data-driven multivariate and multiscale methods
for brain computer interface (BCI) systems. The electroencephalogram (EEG), the
most convenient means to measure neurophysiological activity due to its noninvasive nature,
is mainly considered. The nonlinearity and nonstationarity inherent in EEG and its
multichannel recording nature require a new set of data-driven multivariate techniques to
estimate more accurately features for enhanced BCI operation. Also, a long term goal
is to enable an alternative EEG recording strategy for achieving long-term and portable
monitoring.
Empirical mode decomposition (EMD) and local mean decomposition (LMD), fully
data-driven adaptive tools, are considered to decompose the nonlinear and nonstationary
EEG signal into a set of components which are highly localised in time and frequency. It
is shown that the complex and multivariate extensions of EMD, which can exploit common
oscillatory modes within multivariate (multichannel) data, can be used to accurately
estimate and compare the amplitude and phase information among multiple sources, a
key for the feature extraction of BCI system. A complex extension of local mean decomposition
is also introduced and its operation is illustrated on two channel neuronal
spike streams. Common spatial pattern (CSP), a standard feature extraction technique
for BCI application, is also extended to complex domain using the augmented complex
statistics. Depending on the circularity/noncircularity of a complex signal, one of the
complex CSP algorithms can be chosen to produce the best classification performance
between two different EEG classes.
Using these complex and multivariate algorithms, two cognitive brain studies are
investigated for more natural and intuitive design of advanced BCI systems. Firstly, a Yarbus-style auditory selective attention experiment is introduced to measure the user
attention to a sound source among a mixture of sound stimuli, which is aimed at improving
the usefulness of hearing instruments such as hearing aid. Secondly, emotion experiments
elicited by taste and taste recall are examined to determine the pleasure and displeasure
of a food for the implementation of affective computing. The separation between two
emotional responses is examined using real and complex-valued common spatial pattern
methods.
Finally, we introduce a novel approach to brain monitoring based on EEG recordings
from within the ear canal, embedded on a custom made hearing aid earplug. The new
platform promises the possibility of both short- and long-term continuous use for standard
brain monitoring and interfacing applications
A new approach to onset detection: towards an empirical grounding of theoretical and speculative ideologies of musical performance
This article assesses aspects of the current state of a project which aims, with the help of computers
and computer software, to segment soundfiles of vocal melodies into their component notes, identifying
precisely when the onset of each note occurs, and then tracking the pitch trajectory of each
note, especially in melodies employing a variety of non-standard temperaments, in which musical
intervals smaller than 100 cents are ubiquitous. From there, we may proceed further, to describe
many other “micro-features” of each of the notes, but for now our focus is on the onset times and
pitch trajectories
Single-Channel Signal Separation Using Spectral Basis Correlation with Sparse Nonnegative Tensor Factorization
A novel approach for solving the single-channel signal separation is presented the proposed sparse nonnegative tensor factorization under the framework of maximum a posteriori probability and adaptively fine-tuned using the hierarchical Bayesian approach with a new mixing mixture model. The mixing mixture is an analogy of a stereo signal concept given by one real and the other virtual microphones. An “imitated-stereo” mixture model is thus developed by weighting and time-shifting the original single-channel mixture. This leads to an artificial mixing system of dual channels which gives rise to a new form of spectral basis correlation diversity of the sources. Underlying all factorization algorithms is the principal difficulty in estimating the adequate number of latent components for each signal. This paper addresses these issues by developing a framework for pruning unnecessary components and incorporating a modified multivariate rectified Gaussian prior information into the spectral basis features. The parameters of the imitated-stereo model are estimated via the proposed sparse nonnegative tensor factorization with Itakura–Saito divergence. In addition, the separability conditions of the proposed mixture model are derived and demonstrated that the proposed method can separate real-time captured mixtures. Experimental testing on real audio sources has been conducted to verify the capability of the proposed method
Evaluating Ground Truth for ADRess as a Preprocess for Automatic Musical Instrument Identification
Most research in musical instrument identification has focused on labeling isolated samples or solo phrases. A robust instrument identification system capable of dealing with polytimbral recordings of instruments remains a necessity in music information retrieval. Experiments are described which evaluate the ground truth of ADRess as a sound source separation technique used as a preprocess to automatic musical instrument identification. The ground truth experiments are based on a number of basic acoustic features, while using a Gaussian Mixture Model as the classification algorithm. Using all 44 acoustic feature dimensions, successful identification rates are achieved
TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer
In this work, we address the problem of musical timbre transfer, where the
goal is to manipulate the timbre of a sound sample from one instrument to match
another instrument while preserving other musical content, such as pitch,
rhythm, and loudness. In principle, one could apply image-based style transfer
techniques to a time-frequency representation of an audio signal, but this
depends on having a representation that allows independent manipulation of
timbre as well as high-quality waveform generation. We introduce TimbreTron, a
method for musical timbre transfer which applies "image" domain style transfer
to a time-frequency representation of the audio signal, and then produces a
high-quality waveform using a conditional WaveNet synthesizer. We show that the
Constant Q Transform (CQT) representation is particularly well-suited to
convolutional architectures due to its approximate pitch equivariance. Based on
human perceptual evaluations, we confirmed that TimbreTron recognizably
transferred the timbre while otherwise preserving the musical content, for both
monophonic and polyphonic samples.Comment: 17 pages, published as a conference paper at ICLR 201
- …