1,657 research outputs found
Learning An Invariant Speech Representation
Recognition of speech, and in particular the ability to generalize and learn
from small sets of labelled examples like humans do, depends on an appropriate
representation of the acoustic input. We formulate the problem of finding
robust speech features for supervised learning with small sample complexity as
a problem of learning representations of the signal that are maximally
invariant to intraclass transformations and deformations. We propose an
extension of a theory for unsupervised learning of invariant visual
representations to the auditory domain and empirically evaluate its validity
for voiced speech sound classification. Our version of the theory requires the
memory-based, unsupervised storage of acoustic templates -- such as specific
phones or words -- together with all the transformations of each that normally
occur. A quasi-invariant representation for a speech segment can be obtained by
projecting it to each template orbit, i.e., the set of transformed signals, and
computing the associated one-dimensional empirical probability distributions.
The computations can be performed by modules of filtering and pooling, and
extended to hierarchical architectures. In this paper, we apply a single-layer,
multicomponent representation for phonemes and demonstrate improved accuracy
and decreased sample complexity for vowel classification compared to standard
spectral, cepstral and perceptual features.Comment: CBMM Memo No. 022, 5 pages, 2 figure
Topic Identification for Speech without ASR
Modern topic identification (topic ID) systems for speech use automatic
speech recognition (ASR) to produce speech transcripts, and perform supervised
classification on such ASR outputs. However, under resource-limited conditions,
the manually transcribed speech required to develop standard ASR systems can be
severely limited or unavailable. In this paper, we investigate alternative
unsupervised solutions to obtaining tokenizations of speech in terms of a
vocabulary of automatically discovered word-like or phoneme-like units, without
depending on the supervised training of ASR systems. Moreover, using automatic
phoneme-like tokenizations, we demonstrate that a convolutional neural network
based framework for learning spoken document representations provides
competitive performance compared to a standard bag-of-words representation, as
evidenced by comprehensive topic ID evaluations on both single-label and
multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
Robust ASR using Support Vector Machines
The improved theoretical properties of Support Vector Machines with respect to other machine learning alternatives due to their max-margin training paradigm have led us to suggest them as a good technique for robust speech recognition. However, important shortcomings have had to be circumvented, the most important being the normalisation of the time duration of different realisations of the acoustic speech units.
In this paper, we have compared two approaches in noisy environments: first, a hybrid HMM–SVM solution where a fixed number of frames is selected by means of an HMM segmentation and second, a normalisation kernel called Dynamic Time Alignment Kernel (DTAK) first introduced in Shimodaira et al. [Shimodaira, H., Noma, K., Nakai, M., Sagayama, S., 2001. Support vector machine with dynamic time-alignment kernel for speech recognition. In: Proc. Eurospeech, Aalborg, Denmark, pp. 1841–1844] and based on DTW (Dynamic Time Warping). Special attention has been paid to the adaptation of both alternatives to noisy environments, comparing two types of parameterisations and performing suitable feature normalisation operations. The results show that the DTA Kernel provides important advantages over the baseline HMM system in medium to bad noise conditions, also outperforming the results of the hybrid system.Publicad
Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data
It is well known that recognizers personalized to each user are much more
effective than user-independent recognizers. With the popularity of smartphones
today, although it is not difficult to collect a large set of audio data for
each user, it is difficult to transcribe it. However, it is now possible to
automatically discover acoustic tokens from unlabeled personal data in an
unsupervised way. We therefore propose a multi-task deep learning framework
called a phoneme-token deep neural network (PTDNN), jointly trained from
unsupervised acoustic tokens discovered from unlabeled data and very limited
transcribed data for personalized acoustic modeling. We term this scenario
"weakly supervised". The underlying intuition is that the high degree of
similarity between the HMM states of acoustic token models and phoneme models
may help them learn from each other in this multi-task learning framework.
Initial experiments performed over a personalized audio data set recorded from
Facebook posts demonstrated that very good improvements can be achieved in both
frame accuracy and word accuracy over popularly-considered baselines such as
fDLR, speaker code and lightly supervised adaptation. This approach complements
existing speaker adaptation approaches and can be used jointly with such
techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201
- …