3,153 research outputs found
Wavenet based low rate speech coding
Traditional parametric coding of speech facilitates low rate but provides
poor reconstruction quality because of the inadequacy of the model used. We
describe how a WaveNet generative speech model can be used to generate high
quality speech from the bit stream of a standard parametric coder operating at
2.4 kb/s. We compare this parametric coder with a waveform coder based on the
same generative model and show that approximating the signal waveform incurs a
large rate penalty. Our experiments confirm the high performance of the WaveNet
based coder and show that the speech produced by the system is able to
additionally perform implicit bandwidth extension and does not significantly
impair recognition of the original speaker for the human listener, even when
that speaker has not been used during the training of the generative model.Comment: 5 pages, 2 figure
A Generative Product-of-Filters Model of Audio
We propose the product-of-filters (PoF) model, a generative model that
decomposes audio spectra as sparse linear combinations of "filters" in the
log-spectral domain. PoF makes similar assumptions to those used in the classic
homomorphic filtering approach to signal processing, but replaces hand-designed
decompositions built of basic signal processing operations with a learned
decomposition based on statistical inference. This paper formulates the PoF
model and derives a mean-field method for posterior inference and a variational
EM algorithm to estimate the model's free parameters. We demonstrate PoF's
potential for audio processing on a bandwidth expansion task, and show that PoF
can serve as an effective unsupervised feature extractor for a speaker
identification task.Comment: ICLR 2014 conference-track submission. Added link to the source cod
The Study of Correlation Structures of DNA Sequences: A Critical Review
The study of correlation structure in the primary sequences of DNA is
reviewed. The issues reviewed include: symmetries among 16 base-base
correlation functions, accurate estimation of correlation measures, the
relationship between and Lorentzian spectra, heterogeneity in DNA
sequences, different modeling strategies of the correlation structure of DNA
sequences, the difference of correlation structure between coding and
non-coding regions (besides the period-3 pattern), and source of broad
distribution of domain sizes. Although some of the results remain
controversial, a body of work on this topic constitutes a good starting point
for future studies.Comment: LaTeX, two figures, postscript is expected to be 46 pages. To appear
in the special issue of Computer & Chemistry (1997
A Subband-Based SVM Front-End for Robust ASR
This work proposes a novel support vector machine (SVM) based robust
automatic speech recognition (ASR) front-end that operates on an ensemble of
the subband components of high-dimensional acoustic waveforms. The key issues
of selecting the appropriate SVM kernels for classification in frequency
subbands and the combination of individual subband classifiers using ensemble
methods are addressed. The proposed front-end is compared with state-of-the-art
ASR front-ends in terms of robustness to additive noise and linear filtering.
Experiments performed on the TIMIT phoneme classification task demonstrate the
benefits of the proposed subband based SVM front-end: it outperforms the
standard cepstral front-end in the presence of noise and linear filtering for
signal-to-noise ratio (SNR) below 12-dB. A combination of the proposed
front-end with a conventional front-end such as MFCC yields further
improvements over the individual front ends across the full range of noise
levels
Using a low-bit rate speech enhancement variable post-filter as a speech recognition system pre-filter to improve robustness to GSM speech
Includes bibliographical references.Performance of speech recognition systems degrades when they are used to recognize speech that has been transmitted through GS1 (Global System for Mobile Communications) voice communication channels (GSM speech). This degradation is mainly due to GSM speech coding and GSM channel noise on speech signals transmitted through the network. This poor recognition of GSM channel speech limits the use of speech recognition applications over GSM networks. If speech recognition technology is to be used unlimitedly over GSM networks recognition accuracy of GSM channel speech has to be improved. Different channel normalization techniques have been developed in an attempt to improve recognition accuracy of voice channel modified speech in general (not specifically for GSM channel speech). These techniques can be classified into three broad categories, namely, model modification, signal pre-processing and feature processing techniques. In this work, as a contribution toward improving the robustness of speech recognition systems to GSM speech, the use of a low-bit speech enhancement post-filter as a speech recognition system pre-filter is proposed. This filter is to be used in recognition systems in combination with channel normalization techniques
Information Loss in the Human Auditory System
From the eardrum to the auditory cortex, where acoustic stimuli are decoded,
there are several stages of auditory processing and transmission where
information may potentially get lost. In this paper, we aim at quantifying the
information loss in the human auditory system by using information theoretic
tools.
To do so, we consider a speech communication model, where words are uttered
and sent through a noisy channel, and then received and processed by a human
listener.
We define a notion of information loss that is related to the human word
recognition rate. To assess the word recognition rate of humans, we conduct a
closed-vocabulary intelligibility test. We derive upper and lower bounds on the
information loss. Simulations reveal that the bounds are tight and we observe
that the information loss in the human auditory system increases as the signal
to noise ratio (SNR) decreases. Our framework also allows us to study whether
humans are optimal in terms of speech perception in a noisy environment.
Towards that end, we derive optimal classifiers and compare the human and
machine performance in terms of information loss and word recognition rate. We
observe a higher information loss and lower word recognition rate for humans
compared to the optimal classifiers. In fact, depending on the SNR, the machine
classifier may outperform humans by as much as 8 dB. This implies that for the
speech-in-stationary-noise setup considered here, the human auditory system is
sub-optimal for recognizing noisy words
- …