193 research outputs found
TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer
In this work, we address the problem of musical timbre transfer, where the
goal is to manipulate the timbre of a sound sample from one instrument to match
another instrument while preserving other musical content, such as pitch,
rhythm, and loudness. In principle, one could apply image-based style transfer
techniques to a time-frequency representation of an audio signal, but this
depends on having a representation that allows independent manipulation of
timbre as well as high-quality waveform generation. We introduce TimbreTron, a
method for musical timbre transfer which applies "image" domain style transfer
to a time-frequency representation of the audio signal, and then produces a
high-quality waveform using a conditional WaveNet synthesizer. We show that the
Constant Q Transform (CQT) representation is particularly well-suited to
convolutional architectures due to its approximate pitch equivariance. Based on
human perceptual evaluations, we confirmed that TimbreTron recognizably
transferred the timbre while otherwise preserving the musical content, for both
monophonic and polyphonic samples.Comment: 17 pages, published as a conference paper at ICLR 201
Complex Neural Networks for Audio
Audio is represented in two mathematically equivalent ways: the real-valued time domain (i.e., waveform) and the complex-valued frequency domain (i.e., spectrum). There are advantages to the frequency-domain representation, e.g., the human auditory system is known to process sound in the frequency-domain. Furthermore, linear time-invariant systems are convolved with sources in the time-domain, whereas they may be factorized in the frequency-domain. Neural networks have become rather useful when applied to audio tasks such as machine listening and audio synthesis, which are related by their dependencies on high quality acoustic models. They ideally encapsulate fine-scale temporal structure, such as that encoded in the phase of frequency-domain audio, yet there are no authoritative deep learning methods for complex audio. This manuscript is dedicated to addressing the shortcoming. Chapter 2 motivates complex networks by their affinity with complex-domain audio, while Chapter 3 contributes methods for building and optimizing complex networks. We show that the naive implementation of Adam optimization is incorrect for complex random variables and show that selection of input and output representation has a significant impact on the performance of a complex network. Experimental results with novel complex neural architectures are provided in the second half of this manuscript. Chapter 4 introduces a complex model for binaural audio source localization. We show that, like humans, the complex model can generalize to different anatomical filters, which is important in the context of machine listening. The complex model\u27s performance is better than that of the real-valued models, as well as real- and complex-valued baselines. Chapter 5 proposes a two-stage method for speech enhancement. In the first stage, a complex-valued stochastic autoencoder projects complex vectors to a discrete space. In the second stage, long-term temporal dependencies are modeled in the discrete space. The autoencoder raises the performance ceiling for state of the art speech enhancement, but the dynamic enhancement model does not outperform other baselines. We discuss areas for improvement and note that the complex Adam optimizer improves training convergence over the naive implementation
Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer
In this paper, we propose a differentiable WORLD synthesizer and demonstrate
its use in end-to-end audio style transfer tasks such as (singing) voice
conversion and the DDSP timbre transfer task. Accordingly, our baseline
differentiable synthesizer has no model parameters, yet it yields adequate
synthesis quality. We can extend the baseline synthesizer by appending
lightweight black-box postnets which apply further processing to the baseline
output in order to improve fidelity. An alternative differentiable approach
considers extraction of the source excitation spectrum directly, which can
improve naturalness albeit for a narrower class of style transfer applications.
The acoustic feature parameterization used by our approaches has the added
benefit that it naturally disentangles pitch and timbral information so that
they can be modeled separately. Moreover, as there exists a robust means of
estimating these acoustic features from monophonic audio sources, it allows for
parameter loss terms to be added to an end-to-end objective function, which can
help convergence and/or further stabilize (adversarial) training.Comment: A revised version of this work has been accepted to the 154th AES
Convention; 12 pages, 4 figure
Système neuronal pour réponses à des questions de compréhension de scène auditives
Le présent projet introduit la tâche "réponse à des questions à contenu auditif" (Acoustic Question Answering-AQA) dans laquelle un agent intelligent doit répondre à une question sur le contenu d'une scène auditive. Dans un premier temps, une base de donnée (CLEAR) comprenant des scènes auditives ainsi que des paires question-réponse pour chacune d'elles est mise sur pied afin de permettre l'entraînement de systèmes à base de neurones. Cette tâche étant analogue à la tâche "réponse à des questions à contenu visuel" (Visual Question Answering-VQA), une étude préliminaire est réalisé en utilisant un réseau de neurones (FiLM) initialement développé pour la tâche VQA. Les scènes auditives sont d'abord transformées en représentation spectro-temporelle afin d'être traitées comme des images par le réseau FiLM. Cette étude a pour but de quantifier la performance d'un système initialement conçu pour des scènes visuelles dans un contexte acoustique. Dans la même lignée, une étude de l'efficacité de la technique visuelle de cartes de coordonnées convolutives (CoordConv) lorsqu'appliquée dans un contexte acoustique est réalisée. Finalement, un nouveau réseau de neurones adapté au contexte acoustique (NAAQA) est introduit.
NAAQA obtient de meilleures performances que FiLM sur la base de donnée CLEAR tout en étant environ 7 fois moins complexe
Social Influences on Songbird Behavior: From Song Learning to Motion Coordination
Social animals learn during development how to integrate successfully into their group. How do social interactions combine to maintain group cohesion? We first review how social environments can influence the development of vocal learners, such as songbirds and humans (Chapter 1). To bypass the complexity of natural social interactions and gain experimental control, we developed Virtual Social Environments, surrounding the bird with videos of manipulated playbacks. This way we were able to design sensory and social scenarios and test how social zebra finches adjust their behavior (Chapters 2 & 3). A serious challenge is that the color output of a video monitor does not match the color vision of zebra finches. To minimize chromatic distortion, we eliminated all of the colors from the videos, except in the beak and cheeks where we superimposed colors that match the sensitivity of zebra finch photoreceptors (Chapter 2). Birds strongly preferred to watch these manipulated ‘bird appropriate’ videos. We also designed Virtual Social Environments for assessing how observing movement patterns might affect behavior in real-time (Chapter 3). We found that presenting birds with manipulated movement patterns of virtual males promptly affects the mobility of birds watching the videos: birds move more when virtual males increase their movements, and they decrease their movements and ‘cuddle’ next to virtual males that stop moving. These results suggest that individuals adjust their activity levels to the statistical patterns of observed conspecific movements, which can explain zebra finch group cohesion. Finally, we studied the song development process in the absence of social input to determine how intrinsic biases and external stimuli shape song from undifferentiated syllables into well-defined categorical signals of adult song (Chapter 4). Do juveniles learn the statistics of early sub-song to guide vocal development? We trained juvenile zebra finches with playbacks of their own, highly variable, developing song and showed that these self-tutored birds developed distinct syllable types (categories) as fast as birds that were trained with a categorical, adult song template. Therefore, the statistical structure of early input seems to have no bearing on the development of phonetic categories. Overall, our results uncover social forces that influence individual behaviors, from motion coordination to vocal development, which have implications for how group structures and vocal culture are maintained
Probabilistic characterization and synthesis of complex driven systems
Thesis (Ph.D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2000.Includes bibliographical references (leaves 194-204).Real-world systems that have characteristic input-output patterns but don't provide access to their internal states are as numerous as they are difficult to model. This dissertation introduces a modeling language for estimating and emulating the behavior of such systems given time series data. As a benchmark test, a digital violin is designed from observing the performance of an instrument. Cluster-weighted modeling (CWM), a mixture density estimator around local models, is presented as a framework for function approximation and for the prediction and characterization of nonlinear time series. The general model architecture and estimation algorithm are presented and extended to system characterization tools such as estimator uncertainty, predictor uncertainty and the correlation dimension of the data set. Furthermore a real-time implementation, a Hidden-Markov architecture, and function approximation under constraints are derived within the framework. CWM is then applied in the context of different problems and data sets, leading to architectures such as cluster-weighted classification, cluster-weighted estimation, and cluster-weighted sampling. Each application relies on a specific data representation, specific pre and post-processing algorithms, and a specific hybrid of CWM. The third part of this thesis introduces data-driven modeling of acoustic instruments, a novel technique for audio synthesis. CWM is applied along with new sensor technology and various audio representations to estimate models of violin-family instruments. The approach is demonstrated by synthesizing highly accurate violin sounds given off-line input data as well as cello sounds given real-time input data from a cello player.by Bernd Schoner.Ph.D
Acoustically Inspired Probabilistic Time-domain Music Transcription and Source Separation.
PhD ThesisAutomatic music transcription (AMT) and source separation are important
computational tasks, which can help to understand, analyse and process music
recordings. The main purpose of AMT is to estimate, from an observed
audio recording, a latent symbolic representation of a piece of music (piano-roll).
In this sense, in AMT the duration and location of every note played is
reconstructed from a mixture recording. The related task of source separation
aims to estimate the latent functions or source signals that were mixed
together in an audio recording. This task requires not only the duration and
location of every event present in the mixture, but also the reconstruction
of the waveform of all the individual sounds. Most methods for AMT and
source separation rely on the magnitude of time-frequency representations
of the analysed recording, i.e., spectrograms, and often arbitrarily discard
phase information. On one hand, this decreases the time resolution in AMT.
On the other hand, discarding phase information corrupts the reconstruction
in source separation, because the phase of each source-spectrogram must
be approximated. There is thus a need for models that circumvent phase
approximation, while operating at sample-rate resolution.
This thesis intends to solve AMT and source separation together from
an unified perspective. For this purpose, Bayesian non-parametric signal
processing, covariance kernels designed for audio, and scalable variational
inference are integrated to form efficient and acoustically-inspired probabilistic
models. To circumvent phase approximation while keeping sample-rate
resolution, AMT and source separation are addressed from a Bayesian time-domain
viewpoint. That is, the posterior distribution over the waveform of
each sound event in the mixture is computed directly from the observed data.
For this purpose, Gaussian processes (GPs) are used to define priors over the
sources/pitches. GPs are probability distributions over functions, and its
kernel or covariance determines the properties of the functions sampled from
a GP. Finally, the GP priors and the available data (mixture recording) are
combined using Bayes' theorem in order to compute the posterior distributions
over the sources/pitches.
Although the proposed paradigm is elegant, it introduces two main challenges.
First, as mentioned before, the kernel of the GP priors determines the
properties of each source/pitch function, that is, its smoothness, stationariness,
and more importantly its spectrum. Consequently, the proposed model
requires the design of flexible kernels, able to learn the rich frequency content
and intricate properties of audio sources. To this end, spectral mixture
(SM) kernels are studied, and the Mat ern spectral mixture (MSM) kernel
is introduced, i.e. a modified version of the SM covariance function. The
MSM kernel introduces less strong smoothness, thus it is more suitable for
modelling physical processes. Second, the computational complexity of GP
inference scales cubically with the number of audio samples. Therefore, the
application of GP models to large audio signals becomes intractable. To
overcome this limitation, variational inference is used to make the proposed
model scalable and suitable for signals in the order of hundreds of thousands
of data points.
The integration of GP priors, kernels intended for audio, and variational
inference could enable AMT and source separation time-domain methods to
reconstruct sources and transcribe music in an efficient and informed manner.
In addition, AMT and source separation are current challenges, because
the spectra of the sources/pitches overlap with each other in intricate
ways. Thus, the development of probabilistic models capable of differentiating
sources/pitches in the time domain, despite the high similarity between
their spectra, opens the possibility to take a step towards solving source separation
and automatic music transcription. We demonstrate the utility of our
methods using real and synthesized music audio datasets for various types of
musical instruments
Recommended from our members
Bayesian methods in music modelling
This thesis presents several hierarchical generative Bayesian models of musical signals designed to improve the accuracy of existing multiple pitch detection systems and other musical signal processing applications whilst remaining feasible for real-time computation. At the lowest level the signal is modelled as a set of overlapping sinusoidal basis functions. The parameters of these basis functions are built into a prior framework based on principles known from musical theory and the physics of musical instruments. The model of a musical note optionally includes phenomena such as frequency and amplitude modulations, damping, volume, timbre and inharmonicity. The occurrence of note onsets in a performance of a piece of music is controlled by an underlying tempo process and the alignment of the timings to the underlying score of the music.
A variety of applications are presented for these models under differing inference constraints. Where full Bayesian inference is possible, reversible-jump Markov Chain Monte Carlo is employed to estimate the number of notes and partial frequency components in each frame of music. We also use approximate techniques such as model selection criteria and variational Bayes methods for inference in situations where computation time is limited or the amount of data to be processed is large. For the higher level score parameters, greedy search and conditional modes algorithms are found to be sufficiently accurate.
We emphasize the links between the models and inference algorithms developed in this thesis with that in existing and parallel work, and demonstrate the effects of making modifications to these models both theoretically and by means of experimental results
- …