268 research outputs found
Adversarially Trained Autoencoders for Parallel-Data-Free Voice Conversion
We present a method for converting the voices between a set of speakers. Our
method is based on training multiple autoencoder paths, where there is a single
speaker-independent encoder and multiple speaker-dependent decoders. The
autoencoders are trained with an addition of an adversarial loss which is
provided by an auxiliary classifier in order to guide the output of the encoder
to be speaker independent. The training of the model is unsupervised in the
sense that it does not require collecting the same utterances from the speakers
nor does it require time aligning over phonemes. Due to the use of a single
encoder, our method can generalize to converting the voice of out-of-training
speakers to speakers in the training dataset. We present subjective tests
corroborating the performance of our method
ALCAP: Alignment-Augmented Music Captioner
Music captioning has gained significant attention in the wake of the rising
prominence of streaming media platforms. Traditional approaches often
prioritize either the audio or lyrics aspect of the music, inadvertently
ignoring the intricate interplay between the two. However, a comprehensive
understanding of music necessitates the integration of both these elements. In
this study, we delve into this overlooked realm by introducing a method to
systematically learn multimodal alignment between audio and lyrics through
contrastive learning. This not only recognizes and emphasizes the synergy
between audio and lyrics but also paves the way for models to achieve deeper
cross-modal coherence, thereby producing high-quality captions. We provide both
theoretical and empirical results demonstrating the advantage of the proposed
method, which achieves new state-of-the-art on two music captioning datasets
Visual Feature Attribution using Wasserstein GANs
Attributing the pixels of an input image to a certain category is an
important and well-studied problem in computer vision, with applications
ranging from weakly supervised localisation to understanding hidden effects in
the data. In recent years, approaches based on interpreting a previously
trained neural network classifier have become the de facto state-of-the-art and
are commonly used on medical as well as natural image datasets. In this paper,
we discuss a limitation of these approaches which may lead to only a subset of
the category specific features being detected. To address this problem we
develop a novel feature attribution technique based on Wasserstein Generative
Adversarial Networks (WGAN), which does not suffer from this limitation. We
show that our proposed method performs substantially better than the
state-of-the-art for visual attribution on a synthetic dataset and on real 3D
neuroimaging data from patients with mild cognitive impairment (MCI) and
Alzheimer's disease (AD). For AD patients the method produces compellingly
realistic disease effect maps which are very close to the observed effects.Comment: Accepted to CVPR 201
- …