132 research outputs found
Some applications of a priori knowledge in multi-stream HMM and HMM/ANN based ASR
Multi-band ASR was largely inspired by the extremely high level of redundancy in the spectral signal representation which can be inferred from Fletcher's product-of-errors rule for human speech perception. Indeed, the main aim of the multi-band approach is to exploit this redundancy in order to overcome the problem of data mismatch (while making no assumptions about noise type) by focusing recognition on sub-bands estimated to contain reliable, or "clean speech like", data. However, multi-band processing also presents the opportunity to introduce a number of other ideas from phonetics, non-linear phonology and auditory processing into the recognition process. In particular: we can weight sub-bands, or sub-band combinations, according to the most likely frequency range of characteristic features for the phoneme whose presence we are testing for; we can allow some degree of asynchrony between sub-bands, and we can preprocess each sub-band according the kind of acoustic features which we expect to find there. Besides combining sub-band experts, we can also combine multiple full-band experts, where each expert is perhaps suited to extracting complementary sources of speech information, or is robust to different kinds of noise. In this article we present an outline of some of the recent work at IDIAP, and cooperating institutions, in bringing together ideas from different areas of speech science within the framework of multi-stream HMM and HMM/ANN based ASR
Neural Models of Subcortical Auditory Processing
An important feature of the auditory system is its ability to distinguish many simultaneous
sound sources. The primary goal of this work was to understand how a robust, preattentive
analysis of the auditory scene is accomplished by the subcortical auditory system.
Reasonably accurate modelling of the morphology and organisation of the relevant auditory
nuclei, was seen as being of great importance. The formulation of plausible models and their
subsequent simulation was found to be invaluable in elucidating biological processes and in
highlighting areas of uncertainty.
In the thesis, a review of important aspects of mammalian auditory processing is presented
and used as a basis for the subsequent modelling work. For each aspect of auditory
processing modelled, psychophysical results are described and existing models reviewed,
before the models used here are described and simulated. Auditory processes which are
modelled include the peripheral system, and the production of tonotopic maps of the
spectral content of complex acoustic stimuli, and of modulation frequency or periodicity. A
model of the formation of sequential associations between successive sounds is described,
and the model is shown to be capable of emulating a wide range of psychophysical
behaviour. The grouping of related spectral components and the development of pitch
perception is also investigated. Finally a critical assessment of the work and ideas for future
developments are presented.
The principal contributions of this work are the further development of a model for pitch
perception and the development of a novel architecture for the sequential association of
those groups. In the process of developing these ideas, further insights into subcortical
auditory processing were gained, and explanations for a number of puzzling psychophysical
characteristics suggested.Royal Naval Engineering College, Manadon, Plymout
Data utility modelling for mismatch reduction
In the "missing data" (MD) approach to noise robust automatic speech recognition (ASR), speech models are trained on clean data, and during recognition sections of spectral data dominated by noise are detected and treated as "missing". However, this all-or-nothing hard decision about which data is missing does not accurately reflect the probabilistic nature of missing data detection. Recent work has shown greatly improved performance by the "soft missing data" (SMD) approach, in which the "missing" status of each data value is represented by a continuous probability rather than a 0/1 value. This probability is then used to weight between the different likelihood contributions which the MD model normally assigns to each spectral observation according to its "missing" status. This article presents an analysis which shows that the SMD approach effectively implements a Maximum A-Posteriori (MAP) decoding strategy with missing or uncertain data, subject to the interpretation that the missing/not-missing probabilities are weights for a mixture pdf which models the pdf for each hidden clean data input, after conditioning by the noisy data input, a local noise estimate, and any information which may be available. An important feature of this "soft data" model is that control over the "evidence pdf" can provide a principled framework not only for ignoring unreliable data, but also for focusing attention on more discriminative features, and for data enhancement
Data utility modelling for mismatch reduction
In the "missing data" (MD) approach to noise robust automatic speech recognition (ASR), speech models are trained on clean data, and during recognition sections of spectral data dominated by noise are detected and treated as "missing". However, this all-or-nothing hard decision about which data is missing does not accurately reflect the probabilistic nature of missing data detection. Recent work has shown greatly improved performance by the "soft missing data" (SMD) approach, in which the "missing" status of each data value is represented by a continuous probability rather than a 0/1 value. This probability is then used to weight between the different likelihood contributions which the MD model normally assigns to each spectral observation according to its "missing" status. This article presents an analysis which shows that the SMD approach effectively implements a Maximum A-Posteriori (MAP) decoding strategy with missing or uncertain data, subject to the interpretation that the missing/not-missing probabilities are weights for a mixture pdf which models the pdf for each hidden clean data input, after conditioning by the noisy data input, a local noise estimate, and any information which may be available. An important feature of this "soft data" model is that control over the "evidence pdf" can provide a principled framework not only for ignoring unreliable data, but also for focusing attention on more discriminative features, and for data enhancement
A Computational Model of Auditory Feature Extraction and Sound Classification
This thesis introduces a computer model that incorporates responses similar to
those found in the cochlea, in sub-corticai auditory processing, and in auditory
cortex. The principle aim of this work is to show that this can form the basis
for a biologically plausible mechanism of auditory stimulus classification. We will
show that this classification is robust to stimulus variation and time compression.
In addition, the response of the system is shown to support multiple, concurrent,
behaviourally relevant classifications of natural stimuli (speech).
The model incorporates transient enhancement, an ensemble of spectro -
temporal filters, and a simple measure analogous to the idea of visual salience
to produce a quasi-static description of the stimulus suitable either for classification
with an analogue artificial neural network or, using appropriate rate coding,
a classifier based on artificial spiking neurons. We also show that the spectotemporal
ensemble can be derived from a limited class of 'formative' stimuli, consistent
with a developmental interpretation of ensemble formation. In addition,
ensembles chosen on information theoretic grounds consist of filters with relatively
simple geometries, which is consistent with reports of responses in mammalian
thalamus and auditory cortex.
A powerful feature of this approach is that the ensemble response, from
which salient auditory events are identified, amounts to stimulus-ensemble driven
method of segmentation which respects the envelope of the stimulus, and leads
to a quasi-static representation of auditory events which is suitable for spike rate
coding.
We also present evidence that the encoded auditory events may form the
basis of a representation-of-similarity, or second order isomorphism, which implies
a representational space that respects similarity relationships between stimuli
including novel stimuli
- âŠ