Search CORE

4,554 research outputs found

Exploiting models intrinsic robustness for noisy speech recognition

Author: Cerisara Christophe
Fohr Dominique
Illina Irina
Mella Odile
Publication venue: HAL CCSD
Publication date: 01/01/2004
Field of study

Colloque avec actes et comité de lecture. internationale.International audienceWe propose in this paper an original approach to build masks in the framework of missing data recognition. The proposed soft masks are estimated from the models themselves, and not from the test signal as it is usually the case. They represent the intrinsic robustness of model's log-spectral coefficients. The method is validated with cepstral models, on two synthetic and two real-life noises, at different signal-to-noise ratios. We further discuss how such masks can be combined with other signal-based masks and noise compensation techniques

INRIA a CCSD electronic archive server

Exploiting Low-dimensional Structures to Enhance DNN Based Acoustic Modeling in Speech Recognition

Author: Asaei Afsaneh
Bourlard Herve
Dighe Pranay
Luyet Gil
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/01/2016
Field of study

We propose to model the acoustic space of deep neural network (DNN) class-conditional posterior probabilities as a union of low-dimensional subspaces. To that end, the training posteriors are used for dictionary learning and sparse coding. Sparse representation of the test posteriors using this dictionary enables projection to the space of training data. Relying on the fact that the intrinsic dimensions of the posterior subspaces are indeed very small and the matrix of all posteriors belonging to a class has a very low rank, we demonstrate how low-dimensional structures enable further enhancement of the posteriors and rectify the spurious errors due to mismatch conditions. The enhanced acoustic modeling method leads to improvements in continuous speech recognition task using hybrid DNN-HMM (hidden Markov model) framework in both clean and noisy conditions, where upto 15.4% relative reduction in word error rate (WER) is achieved

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Representation of Time-Varying Stimuli by a Network Exhibiting Oscillations on a Faster Time Scale

Author: A Bragin
A Bruns
A Delorme
A Gunawardana
A Rokem
AL Giraud
BJ Rhodes
C Börgers
C Tallon-Baudry
CM Glaze
CM Gray
DV Buonomano
EM Izhikevich
FE Theunissen
G Buzsáki
G Laurent
GB Christianson
GB Ermentrout
H Bourlard
HK Hartline
I Nelken
J Beshel
J Fritz
JB Kruskal
JJ Hopfield
JM Palva
JP Donoghue
KJ de Jong
KJ Maloney
LC Osborne
M Bastiaansen
M Bazhenov
M Bazhenov
M Shamir
Maoz Shamir
MS Olufsen
N Brunel
N Brunel
Nancy Kopell
O Ghitza
O Jensen
Oded Ghitza
P Lakatos
P Tass
Peter E. Latham
R Gütig
R Van Rullen
R VanRullen
RC deCharms
RD Traub
RT Canolty
S Furukawa
S Greenberg
S Panzeri
SK Kuffler
SL Hooper
SM Chase
Steven Epstein
T Gruber
V Digilakis
W Maass
Y Loewenstein
ZN Aldworth
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/05/2009
Field of study

Sensory processing is associated with gamma frequency oscillations (30–80 Hz) in sensory cortices. This raises the question whether gamma oscillations can be directly involved in the representation of time-varying stimuli, including stimuli whose time scale is longer than a gamma cycle. We are interested in the ability of the system to reliably distinguish different stimuli while being robust to stimulus variations such as uniform time-warp. We address this issue with a dynamical model of spiking neurons and study the response to an asymmetric sawtooth input current over a range of shape parameters. These parameters describe how fast the input current rises and falls in time. Our network consists of inhibitory and excitatory populations that are sufficient for generating oscillations in the gamma range. The oscillations period is about one-third of the stimulus duration. Embedded in this network is a subpopulation of excitatory cells that respond to the sawtooth stimulus and a subpopulation of cells that respond to an onset cue. The intrinsic gamma oscillations generate a temporally sparse code for the external stimuli. In this code, an excitatory cell may fire a single spike during a gamma cycle, depending on its tuning properties and on the temporal structure of the specific input; the identity of the stimulus is coded by the list of excitatory cells that fire during each cycle. We quantify the properties of this representation in a series of simulations and show that the sparseness of the code makes it robust to uniform warping of the time scale. We find that resetting of the oscillation phase at stimulus onset is important for a reliable representation of the stimulus and that there is a tradeoff between the resolution of the neural representation of the stimulus and robustness to time-warp. Author Summary Sensory processing of time-varying stimuli, such as speech, is associated with high-frequency oscillatory cortical activity, the functional significance of which is still unknown. One possibility is that the oscillations are part of a stimulus-encoding mechanism. Here, we investigate a computational model of such a mechanism, a spiking neuronal network whose intrinsic oscillations interact with external input (waveforms simulating short speech segments in a single acoustic frequency band) to encode stimuli that extend over a time interval longer than the oscillation's period. The network implements a temporally sparse encoding, whose robustness to time warping and neuronal noise we quantify. To our knowledge, this study is the first to demonstrate that a biophysically plausible model of oscillations occurring in the processing of auditory input may generate a representation of signals that span multiple oscillation cycles.National Science Foundation (DMS-0211505); Burroughs Wellcome Fund; U.S. Air Force Office of Scientific Researc

CiteSeerX

Crossref

Boston University Institutional Repository (OpenBU)

Directory of Open Access Journals

PubMed Central

End-to-end Recurrent Denoising Autoencoder Embeddings for Speaker Identification

Author: Peláez-Moreno Carmen
Rituerto-González Esther
Publication venue
Publication date: 20/07/2020
Field of study

Speech 'in-the-wild' is a handicap for speaker recognition systems due to the variability induced by real-life conditions, such as environmental noise and emotions in the speaker. Taking advantage of representation learning, on this paper we aim to design a recurrent denoising autoencoder that extracts robust speaker embeddings from noisy spectrograms to perform speaker identification. The end-to-end proposed architecture uses a feedback loop to encode information regarding the speaker into low-dimensional representations extracted by a spectrogram denoising autoencoder. We employ data augmentation techniques by additively corrupting clean speech with real life environmental noise and make use of a database with real stressed speech. We prove that the joint optimization of both the denoiser and the speaker identification module outperforms independent optimization of both modules under stress and noise distortions as well as hand-crafted features.Comment: 8 pages + 2 of references + 5 of images. Submitted on Monday 20th of July to Elsevier Signal Processing Short Communication

arXiv.org e-Print Archive

Universidad Carlos III de Madrid e-Archivo