4,554 research outputs found
Exploiting models intrinsic robustness for noisy speech recognition
Colloque avec actes et comité de lecture. internationale.International audienceWe propose in this paper an original approach to build masks in the framework of missing data recognition. The proposed soft masks are estimated from the models themselves, and not from the test signal as it is usually the case. They represent the intrinsic robustness of model's log-spectral coefficients. The method is validated with cepstral models, on two synthetic and two real-life noises, at different signal-to-noise ratios. We further discuss how such masks can be combined with other signal-based masks and noise compensation techniques
Exploiting Low-dimensional Structures to Enhance DNN Based Acoustic Modeling in Speech Recognition
We propose to model the acoustic space of deep neural network (DNN)
class-conditional posterior probabilities as a union of low-dimensional
subspaces. To that end, the training posteriors are used for dictionary
learning and sparse coding. Sparse representation of the test posteriors using
this dictionary enables projection to the space of training data. Relying on
the fact that the intrinsic dimensions of the posterior subspaces are indeed
very small and the matrix of all posteriors belonging to a class has a very low
rank, we demonstrate how low-dimensional structures enable further enhancement
of the posteriors and rectify the spurious errors due to mismatch conditions.
The enhanced acoustic modeling method leads to improvements in continuous
speech recognition task using hybrid DNN-HMM (hidden Markov model) framework in
both clean and noisy conditions, where upto 15.4% relative reduction in word
error rate (WER) is achieved
Representation of Time-Varying Stimuli by a Network Exhibiting Oscillations on a Faster Time Scale
Sensory processing is associated with gamma frequency oscillations (30–80 Hz) in sensory cortices. This raises the question whether gamma oscillations can be directly involved in the representation of time-varying stimuli, including stimuli whose time scale is longer than a gamma cycle. We are interested in the ability of the system to reliably distinguish different stimuli while being robust to stimulus variations such as uniform time-warp. We address this issue with a dynamical model of spiking neurons and study the response to an asymmetric sawtooth input current over a range of shape parameters. These parameters describe how fast the input current rises and falls in time. Our network consists of inhibitory and excitatory populations that are sufficient for generating oscillations in the gamma range. The oscillations period is about one-third of the stimulus duration. Embedded in this network is a subpopulation of excitatory cells that respond to the sawtooth stimulus and a subpopulation of cells that respond to an onset cue. The intrinsic gamma oscillations generate a temporally sparse code for the external stimuli. In this code, an excitatory cell may fire a single spike during a gamma cycle, depending on its tuning properties and on the temporal structure of the specific input; the identity of the stimulus is coded by the list of excitatory cells that fire during each cycle. We quantify the properties of this representation in a series of simulations and show that the sparseness of the code makes it robust to uniform warping of the time scale. We find that resetting of the oscillation phase at stimulus onset is important for a reliable representation of the stimulus and that there is a tradeoff between the resolution of the neural representation of the stimulus and robustness to time-warp.
Author Summary
Sensory processing of time-varying stimuli, such as speech, is associated with high-frequency oscillatory cortical activity, the functional significance of which is still unknown. One possibility is that the oscillations are part of a stimulus-encoding mechanism. Here, we investigate a computational model of such a mechanism, a spiking neuronal network whose intrinsic oscillations interact with external input (waveforms simulating short speech segments in a single acoustic frequency band) to encode stimuli that extend over a time interval longer than the oscillation's period. The network implements a temporally sparse encoding, whose robustness to time warping and neuronal noise we quantify. To our knowledge, this study is the first to demonstrate that a biophysically plausible model of oscillations occurring in the processing of auditory input may generate a representation of signals that span multiple oscillation cycles.National Science Foundation (DMS-0211505); Burroughs Wellcome Fund; U.S. Air Force Office of Scientific Researc
End-to-end Recurrent Denoising Autoencoder Embeddings for Speaker Identification
Speech 'in-the-wild' is a handicap for speaker recognition systems due to the
variability induced by real-life conditions, such as environmental noise and
emotions in the speaker. Taking advantage of representation learning, on this
paper we aim to design a recurrent denoising autoencoder that extracts robust
speaker embeddings from noisy spectrograms to perform speaker identification.
The end-to-end proposed architecture uses a feedback loop to encode information
regarding the speaker into low-dimensional representations extracted by a
spectrogram denoising autoencoder. We employ data augmentation techniques by
additively corrupting clean speech with real life environmental noise and make
use of a database with real stressed speech. We prove that the joint
optimization of both the denoiser and the speaker identification module
outperforms independent optimization of both modules under stress and noise
distortions as well as hand-crafted features.Comment: 8 pages + 2 of references + 5 of images. Submitted on Monday 20th of
July to Elsevier Signal Processing Short Communication
- …