193 research outputs found

    TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

    Full text link
    In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.Comment: 17 pages, published as a conference paper at ICLR 201

    Complex Neural Networks for Audio

    Get PDF
    Audio is represented in two mathematically equivalent ways: the real-valued time domain (i.e., waveform) and the complex-valued frequency domain (i.e., spectrum). There are advantages to the frequency-domain representation, e.g., the human auditory system is known to process sound in the frequency-domain. Furthermore, linear time-invariant systems are convolved with sources in the time-domain, whereas they may be factorized in the frequency-domain. Neural networks have become rather useful when applied to audio tasks such as machine listening and audio synthesis, which are related by their dependencies on high quality acoustic models. They ideally encapsulate fine-scale temporal structure, such as that encoded in the phase of frequency-domain audio, yet there are no authoritative deep learning methods for complex audio. This manuscript is dedicated to addressing the shortcoming. Chapter 2 motivates complex networks by their affinity with complex-domain audio, while Chapter 3 contributes methods for building and optimizing complex networks. We show that the naive implementation of Adam optimization is incorrect for complex random variables and show that selection of input and output representation has a significant impact on the performance of a complex network. Experimental results with novel complex neural architectures are provided in the second half of this manuscript. Chapter 4 introduces a complex model for binaural audio source localization. We show that, like humans, the complex model can generalize to different anatomical filters, which is important in the context of machine listening. The complex model\u27s performance is better than that of the real-valued models, as well as real- and complex-valued baselines. Chapter 5 proposes a two-stage method for speech enhancement. In the first stage, a complex-valued stochastic autoencoder projects complex vectors to a discrete space. In the second stage, long-term temporal dependencies are modeled in the discrete space. The autoencoder raises the performance ceiling for state of the art speech enhancement, but the dynamic enhancement model does not outperform other baselines. We discuss areas for improvement and note that the complex Adam optimizer improves training convergence over the naive implementation

    Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer

    Full text link
    In this paper, we propose a differentiable WORLD synthesizer and demonstrate its use in end-to-end audio style transfer tasks such as (singing) voice conversion and the DDSP timbre transfer task. Accordingly, our baseline differentiable synthesizer has no model parameters, yet it yields adequate synthesis quality. We can extend the baseline synthesizer by appending lightweight black-box postnets which apply further processing to the baseline output in order to improve fidelity. An alternative differentiable approach considers extraction of the source excitation spectrum directly, which can improve naturalness albeit for a narrower class of style transfer applications. The acoustic feature parameterization used by our approaches has the added benefit that it naturally disentangles pitch and timbral information so that they can be modeled separately. Moreover, as there exists a robust means of estimating these acoustic features from monophonic audio sources, it allows for parameter loss terms to be added to an end-to-end objective function, which can help convergence and/or further stabilize (adversarial) training.Comment: A revised version of this work has been accepted to the 154th AES Convention; 12 pages, 4 figure

    Système neuronal pour réponses à des questions de compréhension de scène auditives

    Get PDF
    Le présent projet introduit la tâche "réponse à des questions à contenu auditif" (Acoustic Question Answering-AQA) dans laquelle un agent intelligent doit répondre à une question sur le contenu d'une scène auditive. Dans un premier temps, une base de donnée (CLEAR) comprenant des scènes auditives ainsi que des paires question-réponse pour chacune d'elles est mise sur pied afin de permettre l'entraînement de systèmes à base de neurones. Cette tâche étant analogue à la tâche "réponse à des questions à contenu visuel" (Visual Question Answering-VQA), une étude préliminaire est réalisé en utilisant un réseau de neurones (FiLM) initialement développé pour la tâche VQA. Les scènes auditives sont d'abord transformées en représentation spectro-temporelle afin d'être traitées comme des images par le réseau FiLM. Cette étude a pour but de quantifier la performance d'un système initialement conçu pour des scènes visuelles dans un contexte acoustique. Dans la même lignée, une étude de l'efficacité de la technique visuelle de cartes de coordonnées convolutives (CoordConv) lorsqu'appliquée dans un contexte acoustique est réalisée. Finalement, un nouveau réseau de neurones adapté au contexte acoustique (NAAQA) est introduit. NAAQA obtient de meilleures performances que FiLM sur la base de donnée CLEAR tout en étant environ 7 fois moins complexe

    Social Influences on Songbird Behavior: From Song Learning to Motion Coordination

    Full text link
    Social animals learn during development how to integrate successfully into their group. How do social interactions combine to maintain group cohesion? We first review how social environments can influence the development of vocal learners, such as songbirds and humans (Chapter 1). To bypass the complexity of natural social interactions and gain experimental control, we developed Virtual Social Environments, surrounding the bird with videos of manipulated playbacks. This way we were able to design sensory and social scenarios and test how social zebra finches adjust their behavior (Chapters 2 & 3). A serious challenge is that the color output of a video monitor does not match the color vision of zebra finches. To minimize chromatic distortion, we eliminated all of the colors from the videos, except in the beak and cheeks where we superimposed colors that match the sensitivity of zebra finch photoreceptors (Chapter 2). Birds strongly preferred to watch these manipulated ‘bird appropriate’ videos. We also designed Virtual Social Environments for assessing how observing movement patterns might affect behavior in real-time (Chapter 3). We found that presenting birds with manipulated movement patterns of virtual males promptly affects the mobility of birds watching the videos: birds move more when virtual males increase their movements, and they decrease their movements and ‘cuddle’ next to virtual males that stop moving. These results suggest that individuals adjust their activity levels to the statistical patterns of observed conspecific movements, which can explain zebra finch group cohesion. Finally, we studied the song development process in the absence of social input to determine how intrinsic biases and external stimuli shape song from undifferentiated syllables into well-defined categorical signals of adult song (Chapter 4). Do juveniles learn the statistics of early sub-song to guide vocal development? We trained juvenile zebra finches with playbacks of their own, highly variable, developing song and showed that these self-tutored birds developed distinct syllable types (categories) as fast as birds that were trained with a categorical, adult song template. Therefore, the statistical structure of early input seems to have no bearing on the development of phonetic categories. Overall, our results uncover social forces that influence individual behaviors, from motion coordination to vocal development, which have implications for how group structures and vocal culture are maintained

    Probabilistic characterization and synthesis of complex driven systems

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2000.Includes bibliographical references (leaves 194-204).Real-world systems that have characteristic input-output patterns but don't provide access to their internal states are as numerous as they are difficult to model. This dissertation introduces a modeling language for estimating and emulating the behavior of such systems given time series data. As a benchmark test, a digital violin is designed from observing the performance of an instrument. Cluster-weighted modeling (CWM), a mixture density estimator around local models, is presented as a framework for function approximation and for the prediction and characterization of nonlinear time series. The general model architecture and estimation algorithm are presented and extended to system characterization tools such as estimator uncertainty, predictor uncertainty and the correlation dimension of the data set. Furthermore a real-time implementation, a Hidden-Markov architecture, and function approximation under constraints are derived within the framework. CWM is then applied in the context of different problems and data sets, leading to architectures such as cluster-weighted classification, cluster-weighted estimation, and cluster-weighted sampling. Each application relies on a specific data representation, specific pre and post-processing algorithms, and a specific hybrid of CWM. The third part of this thesis introduces data-driven modeling of acoustic instruments, a novel technique for audio synthesis. CWM is applied along with new sensor technology and various audio representations to estimate models of violin-family instruments. The approach is demonstrated by synthesizing highly accurate violin sounds given off-line input data as well as cello sounds given real-time input data from a cello player.by Bernd Schoner.Ph.D

    Acoustically Inspired Probabilistic Time-domain Music Transcription and Source Separation.

    Get PDF
    PhD ThesisAutomatic music transcription (AMT) and source separation are important computational tasks, which can help to understand, analyse and process music recordings. The main purpose of AMT is to estimate, from an observed audio recording, a latent symbolic representation of a piece of music (piano-roll). In this sense, in AMT the duration and location of every note played is reconstructed from a mixture recording. The related task of source separation aims to estimate the latent functions or source signals that were mixed together in an audio recording. This task requires not only the duration and location of every event present in the mixture, but also the reconstruction of the waveform of all the individual sounds. Most methods for AMT and source separation rely on the magnitude of time-frequency representations of the analysed recording, i.e., spectrograms, and often arbitrarily discard phase information. On one hand, this decreases the time resolution in AMT. On the other hand, discarding phase information corrupts the reconstruction in source separation, because the phase of each source-spectrogram must be approximated. There is thus a need for models that circumvent phase approximation, while operating at sample-rate resolution. This thesis intends to solve AMT and source separation together from an unified perspective. For this purpose, Bayesian non-parametric signal processing, covariance kernels designed for audio, and scalable variational inference are integrated to form efficient and acoustically-inspired probabilistic models. To circumvent phase approximation while keeping sample-rate resolution, AMT and source separation are addressed from a Bayesian time-domain viewpoint. That is, the posterior distribution over the waveform of each sound event in the mixture is computed directly from the observed data. For this purpose, Gaussian processes (GPs) are used to define priors over the sources/pitches. GPs are probability distributions over functions, and its kernel or covariance determines the properties of the functions sampled from a GP. Finally, the GP priors and the available data (mixture recording) are combined using Bayes' theorem in order to compute the posterior distributions over the sources/pitches. Although the proposed paradigm is elegant, it introduces two main challenges. First, as mentioned before, the kernel of the GP priors determines the properties of each source/pitch function, that is, its smoothness, stationariness, and more importantly its spectrum. Consequently, the proposed model requires the design of flexible kernels, able to learn the rich frequency content and intricate properties of audio sources. To this end, spectral mixture (SM) kernels are studied, and the Mat ern spectral mixture (MSM) kernel is introduced, i.e. a modified version of the SM covariance function. The MSM kernel introduces less strong smoothness, thus it is more suitable for modelling physical processes. Second, the computational complexity of GP inference scales cubically with the number of audio samples. Therefore, the application of GP models to large audio signals becomes intractable. To overcome this limitation, variational inference is used to make the proposed model scalable and suitable for signals in the order of hundreds of thousands of data points. The integration of GP priors, kernels intended for audio, and variational inference could enable AMT and source separation time-domain methods to reconstruct sources and transcribe music in an efficient and informed manner. In addition, AMT and source separation are current challenges, because the spectra of the sources/pitches overlap with each other in intricate ways. Thus, the development of probabilistic models capable of differentiating sources/pitches in the time domain, despite the high similarity between their spectra, opens the possibility to take a step towards solving source separation and automatic music transcription. We demonstrate the utility of our methods using real and synthesized music audio datasets for various types of musical instruments