42 research outputs found
Learning to detect dysarthria from raw speech
Speech classifiers of paralinguistic traits traditionally learn from diverse
hand-crafted low-level features, by selecting the relevant information for the
task at hand. We explore an alternative to this selection, by learning jointly
the classifier, and the feature extraction. Recent work on speech recognition
has shown improved performance over speech features by learning from the
waveform. We extend this approach to paralinguistic classification and propose
a neural network that can learn a filterbank, a normalization factor and a
compression power from the raw speech, jointly with the rest of the
architecture. We apply this model to dysarthria detection from sentence-level
audio recordings. Starting from a strong attention-based baseline on which
mel-filterbanks outperform standard low-level descriptors, we show that
learning the filters or the normalization and compression improves over fixed
features by 10% absolute accuracy. We also observe a gain over OpenSmile
features by learning jointly the feature extraction, the normalization, and the
compression factor with the architecture. This constitutes a first attempt at
learning jointly all these operations from raw audio for a speech
classification task.Comment: 5 pages, 3 figures, submitted to ICASS
Learning weakly supervised multimodal phoneme embeddings
Recent works have explored deep architectures for learning multimodal speech
representation (e.g. audio and images, articulation and audio) in a supervised
way. Here we investigate the role of combining different speech modalities,
i.e. audio and visual information representing the lips movements, in a weakly
supervised way using Siamese networks and lexical same-different side
information. In particular, we ask whether one modality can benefit from the
other to provide a richer representation for phone recognition in a weakly
supervised setting. We introduce mono-task and multi-task methods for merging
speech and visual modalities for phone recognition. The mono-task learning
consists in applying a Siamese network on the concatenation of the two
modalities, while the multi-task learning receives several different
combinations of modalities at train time. We show that multi-task learning
enhances discriminability for visual and multimodal inputs while minimally
impacting auditory inputs. Furthermore, we present a qualitative analysis of
the obtained phone embeddings, and show that cross-modal visual input can
improve the discriminability of phonological features which are visually
discernable (rounding, open/close, labial place of articulation), resulting in
representations that are closer to abstract linguistic features than those
based on audio only
DNArch: Learning Convolutional Neural Architectures by Backpropagation
We present Differentiable Neural Architectures (DNArch), a method that
jointly learns the weights and the architecture of Convolutional Neural
Networks (CNNs) by backpropagation. In particular, DNArch allows learning (i)
the size of convolutional kernels at each layer, (ii) the number of channels at
each layer, (iii) the position and values of downsampling layers, and (iv) the
depth of the network. To this end, DNArch views neural architectures as
continuous multidimensional entities, and uses learnable differentiable masks
along each dimension to control their size. Unlike existing methods, DNArch is
not limited to a predefined set of possible neural components, but instead it
is able to discover entire CNN architectures across all feasible combinations
of kernel sizes, widths, depths and downsampling. Empirically, DNArch finds
performant CNN architectures for several classification and dense prediction
tasks on sequential and image data. When combined with a loss term that
controls the network complexity, DNArch constrains its search to architectures
that respect a predefined computational budget during training
Fader Networks: Manipulating Images by Sliding Attributes
This paper introduces a new encoder-decoder architecture that is trained to
reconstruct images by disentangling the salient information of the image and
the values of attributes directly in the latent space. As a result, after
training, our model can generate different realistic versions of an input image
by varying the attribute values. By using continuous attribute values, we can
choose how much a specific attribute is perceivable in the generated image.
This property could allow for applications where users can modify an image
using sliding knobs, like faders on a mixing console, to change the facial
expression of a portrait, or to update the color of some objects. Compared to
the state-of-the-art which mostly relies on training adversarial networks in
pixel space by altering attribute values at train time, our approach results in
much simpler training schemes and nicely scales to multiple attributes. We
present evidence that our model can significantly change the perceived value of
the attributes while preserving the naturalness of images.Comment: NIPS 201