23 research outputs found
Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech
Effective speech emotional representations play a key role in Speech Emotion
Recognition (SER) and Emotional Text-To-Speech (TTS) tasks. However, emotional
speech samples are more difficult and expensive to acquire compared with
Neutral style speech, which causes one issue that most related works
unfortunately neglect: imbalanced datasets. Models might overfit to the
majority Neutral class and fail to produce robust and effective emotional
representations. In this paper, we propose an Emotion Extractor to address this
issue. We use augmentation approaches to train the model and enable it to
extract effective and generalizable emotional representations from imbalanced
datasets. Our empirical results show that (1) for the SER task, the proposed
Emotion Extractor surpasses the state-of-the-art baseline on three imbalanced
datasets; (2) the produced representations from our Emotion Extractor benefit
the TTS model, and enable it to synthesize more expressive speech.Comment: Accepted by INTERSPEECH202
Kernel-convoluted Deep Neural Networks with Data Augmentation
The Mixup method (Zhang et al. 2018), which uses linearly interpolated data,
has emerged as an effective data augmentation tool to improve generalization
performance and the robustness to adversarial examples. The motivation is to
curtail undesirable oscillations by its implicit model constraint to behave
linearly at in-between observed data points and promote smoothness. In this
work, we formally investigate this premise, propose a way to explicitly impose
smoothness constraints, and extend it to incorporate with implicit model
constraints. First, we derive a new function class composed of
kernel-convoluted models (KCM) where the smoothness constraint is directly
imposed by locally averaging the original functions with a kernel function.
Second, we propose to incorporate the Mixup method into KCM to expand the
domains of smoothness. In both cases of KCM and the KCM adapted with the Mixup,
we provide risk analysis, respectively, under some conditions for kernels. We
show that the upper bound of the excess risk is not slower than that of the
original function class. The upper bound of the KCM with the Mixup remains
dominated by that of the KCM if the perturbation of the Mixup vanishes faster
than where is a sample size. Using CIFAR-10 and CIFAR-100
datasets, our experiments demonstrate that the KCM with the Mixup outperforms
the Mixup method in terms of generalization and robustness to adversarial
examples
On the benefits of defining vicinal distributions in latent space
The vicinal risk minimization (VRM) principle is an empirical risk
minimization (ERM) variant that replaces Dirac masses with vicinal functions.
There is strong numerical and theoretical evidence showing that VRM outperforms
ERM in terms of generalization if appropriate vicinal functions are chosen.
Mixup Training (MT), a popular choice of vicinal distribution, improves the
generalization performance of models by introducing globally linear behavior in
between training examples. Apart from generalization, recent works have shown
that mixup trained models are relatively robust to input
perturbations/corruptions and at the same time are calibrated better than their
non-mixup counterparts. In this work, we investigate the benefits of defining
these vicinal distributions like mixup in latent space of generative models
rather than in input space itself. We propose a new approach - \textit{VarMixup
(Variational Mixup)} - to better sample mixup images by using the latent
manifold underlying the data. Our empirical studies on CIFAR-10, CIFAR-100, and
Tiny-ImageNet demonstrate that models trained by performing mixup in the latent
manifold learned by VAEs are inherently more robust to various input
corruptions/perturbations, are significantly better calibrated, and exhibit
more local-linear loss landscapes.Comment: Accepted at Elsevier Pattern Recognition Letters (2021), Best Paper
Award at CVPR 2021 Workshop on Adversarial Machine Learning in Real-World
Computer Vision (AML-CV), Also accepted at ICLR 2021 Workshops on
Robust-Reliable Machine Learning (Oral) and Generalization beyond the
training distribution (Abstract
Smooth image-to-image translations with latent space interpolations
Multi-domain image-to-image (I2I) translations can transform a source image
according to the style of a target domain. One important, desired
characteristic of these transformations, is their graduality, which corresponds
to a smooth change between the source and the target image when their
respective latent-space representations are linearly interpolated. However,
state-of-the-art methods usually perform poorly when evaluated using
inter-domain interpolations, often producing abrupt changes in the appearance
or non-realistic intermediate images. In this paper, we argue that one of the
main reasons behind this problem is the lack of sufficient inter-domain
training data and we propose two different regularization methods to alleviate
this issue: a new shrinkage loss, which compacts the latent space, and a Mixup
data-augmentation strategy, which flattens the style representations between
domains. We also propose a new metric to quantitatively evaluate the degree of
the interpolation smoothness, an aspect which is not sufficiently covered by
the existing I2I translation metrics. Using both our proposed metric and
standard evaluation protocols, we show that our regularization techniques can
improve the state-of-the-art multi-domain I2I translations by a large margin.
Our code will be made publicly available upon the acceptance of this article