Search CORE

6,200 research outputs found

Learning An Invariant Speech Representation

Author: Evangelopoulos Georgios
Poggio Tomaso
Rosasco Lorenzo
Voinea Stephen
Zhang Chiyuan
Publication venue
Publication date: 01/01/2014
Field of study

Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates -- such as specific phones or words -- together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.Comment: CBMM Memo No. 022, 5 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

A Subband-Based SVM Front-End for Robust ASR

Author: Ager Matthew
Cvetkovic Zoran
Sollich Peter
Yousafzai Jibran
Publication venue
Publication date: 24/12/2013
Field of study

This work proposes a novel support vector machine (SVM) based robust automatic speech recognition (ASR) front-end that operates on an ensemble of the subband components of high-dimensional acoustic waveforms. The key issues of selecting the appropriate SVM kernels for classification in frequency subbands and the combination of individual subband classifiers using ensemble methods are addressed. The proposed front-end is compared with state-of-the-art ASR front-ends in terms of robustness to additive noise and linear filtering. Experiments performed on the TIMIT phoneme classification task demonstrate the benefits of the proposed subband based SVM front-end: it outperforms the standard cepstral front-end in the presence of noise and linear filtering for signal-to-noise ratio (SNR) below 12-dB. A combination of the proposed front-end with a conventional front-end such as MFCC yields further improvements over the individual front ends across the full range of noise levels

arXiv.org e-Print Archive

King's Research Portal

Time-Domain Isolated Phoneme Classification Using Reconstructed Phase Spaces

Author: Indrebo Kevin M
Johnson Michael T
Lindgren Andrew C.
Liu Xiaolin
Povinelli Richard J.
Ye Jinjin
Publication venue: e-Publications@Marquette
Publication date: 01/07/2005
Field of study

This paper introduces a novel time-domain approach to modeling and classifying speech phoneme waveforms. The approach is based on statistical models of reconstructed phase spaces, which offer significant theoretical benefits as representations that are known to be topologically equivalent to the state dynamics of the underlying production system. The lag and dimension parameters of the reconstruction process for speech are examined in detail, comparing common estimation heuristics for these parameters with corresponding maximum likelihood recognition accuracy over the TIMIT data set. Overall accuracies are compared with a Mel-frequency cepstral baseline system across five different phonetic classes within TIMIT, and a composite classifier using both cepstral and phase space features is developed. Results indicate that although the accuracy of the phase space approach by itself is still currently below that of baseline cepstral methods, a combined approach is capable of increasing speaker independent phoneme accuracy

epublications@Marquette

Blind Normalization of Speech From Different Channels

Author: Boll S. F.
Cox R. V.
David N. Levin
Gales M. J. F.
Levin D. N.
Levin D. N.
Levin D. N.
Roweis S. T.
Tenenbaum J. B.
Young S.
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 02/04/2002
Field of study

We show how to construct a channel-independent representation of speech that has propagated through a noisy reverberant channel. This is done by blindly rescaling the cepstral time series by a non-linear function, with the form of this scale function being determined by previously encountered cepstra from that channel. The rescaled form of the time series is an invariant property of it in the following sense: it is unaffected if the time series is transformed by any time-independent invertible distortion. Because a linear channel with stationary noise and impulse response transforms cepstra in this way, the new technique can be used to remove the channel dependence of a cepstral time series. In experiments, the method achieved greater channel-independence than cepstral mean normalization, and it was comparable to the combination of cepstral mean normalization and spectral subtraction, despite the fact that no measurements of channel noise or reverberations were required (unlike spectral subtraction).Comment: 25 pages, 7 figure

arXiv.org e-Print Archive

Crossref