20 research outputs found
A Generative Product-of-Filters Model of Audio
We propose the product-of-filters (PoF) model, a generative model that
decomposes audio spectra as sparse linear combinations of "filters" in the
log-spectral domain. PoF makes similar assumptions to those used in the classic
homomorphic filtering approach to signal processing, but replaces hand-designed
decompositions built of basic signal processing operations with a learned
decomposition based on statistical inference. This paper formulates the PoF
model and derives a mean-field method for posterior inference and a variational
EM algorithm to estimate the model's free parameters. We demonstrate PoF's
potential for audio processing on a bandwidth expansion task, and show that PoF
can serve as an effective unsupervised feature extractor for a speaker
identification task.Comment: ICLR 2014 conference-track submission. Added link to the source cod
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder
Non-parallel many-to-many voice conversion remains an interesting but
challenging speech processing task. Many style-transfer-inspired methods such
as generative adversarial networks (GANs) and variational autoencoders (VAEs)
have been proposed. Recently, AutoVC, a conditional autoencoders (CAEs) based
method achieved state-of-the-art results by disentangling the speaker identity
and speech content using information-constraining bottlenecks, and it achieves
zero-shot conversion by swapping in a different speaker's identity embedding to
synthesize a new voice. However, we found that while speaker identity is
disentangled from speech content, a significant amount of prosodic information,
such as source F0, leaks through the bottleneck, causing target F0 to fluctuate
unnaturally. Furthermore, AutoVC has no control of the converted F0 and thus
unsuitable for many applications. In the paper, we modified and improved
autoencoder-based voice conversion to disentangle content, F0, and speaker
identity at the same time. Therefore, we can control the F0 contour, generate
speech with F0 consistent with the target speaker, and significantly improve
quality and similarity. We support our improvement through quantitative and
qualitative analysis
Audio Imputation Using the Non-negative Hidden Markov Model
Abstract. Missing data in corrupted audio recordings poses a challeng-ing problem for audio signal processing. In this paper we present an approach that allows us to estimate missing values in the time-frequency domain of audio signals. The proposed approach, based on the Non-negative Hidden Markov Model, enables more temporally coherent es-timation for the missing data by taking into account both the spectral and temporal information of the audio signal. This approach is able to reconstruct highly corrupted audio signals with large parts of the spectro-gram missing. We demonstrate this approach on real-world polyphonic music signals. The initial experimental results show that our approach has advantages over a previous missing data imputation method.
The visual microphone: Passive recovery of sound from video
When sound hits an object, it causes small vibrations of the object's surface. We show how, using only high-speed video of the object, we can extract those minute vibrations and partially recover the sound that produced them, allowing us to turn everyday objects---a glass of water, a potted plant, a box of tissues, or a bag of chips---into visual microphones. We recover sounds from high-speed footage of a variety of objects with different properties, and use both real and simulated data to examine some of the factors that affect our ability to visually recover sound. We evaluate the quality of recovered sounds using intelligibility and SNR metrics and provide input and recovered audio samples for direct comparison. We also explore how to leverage the rolling shutter in regular consumer cameras to recover audio from standard frame-rate videos, and use the spatial resolution of our method to visualize how sound-related vibrations vary over an object's surface, which we can use to recover the vibration modes of an object.Qatar Computing Research InstituteNational Science Foundation (U.S.) (CGV-1111415)National Science Foundation (U.S.). Graduate Research Fellowship (Grant 1122374)Massachusetts Institute of Technology. Department of MathematicsMicrosoft Research (PhD Fellowship
Universal speech models for speaker independent single channel source separation
Supervised and semi-supervised source separation algorithms based on non-negative matrix factorization have been shown to be quite effective. However, they require isolated training examples of one or more sources, which is often difficult to obtain. This limits the practical applicability of these algorithms. We examine the problem of efficiently utilizing general training data in the absence of specific training examples. Specifically, we propose a method to learn a universal speech model from a general corpus of speech and show how to use this model to separate speech from other sound sources. This model is used in lieu of a speech model trained on speaker-dependent training examples, and thus circumvents the aforementioned problem. Our experimental results show that our method achieves nearly the same performance as when speaker-dependent training examples are used. Furthermore, we show that our method improves performance when training data of the non-speech source is available.
Sound Recognition in Mixtures
Abstract. In this paper, we describe a method for recognizing sound sources in a mixture. While many audio-based content analysis methods focus on detecting or classifying target sounds in a discriminative manner, we approach this as a regression problem, in which we estimate the relative proportions of sound sources in the given mixture. Using certain source separation ideas, we directly estimate these proportions from the mixture without actually separating the sources. We also introduce a method for learning a transition matrix to temporally constrain the problem. We demonstrate the proposed method on a mixture of five classes of sounds and show that it is quite effective in correctly estimating the relative proportions of the sounds in the mixture