132 research outputs found

    Some applications of a priori knowledge in multi-stream HMM and HMM/ANN based ASR

    Get PDF
    Multi-band ASR was largely inspired by the extremely high level of redundancy in the spectral signal representation which can be inferred from Fletcher's product-of-errors rule for human speech perception. Indeed, the main aim of the multi-band approach is to exploit this redundancy in order to overcome the problem of data mismatch (while making no assumptions about noise type) by focusing recognition on sub-bands estimated to contain reliable, or "clean speech like", data. However, multi-band processing also presents the opportunity to introduce a number of other ideas from phonetics, non-linear phonology and auditory processing into the recognition process. In particular: we can weight sub-bands, or sub-band combinations, according to the most likely frequency range of characteristic features for the phoneme whose presence we are testing for; we can allow some degree of asynchrony between sub-bands, and we can preprocess each sub-band according the kind of acoustic features which we expect to find there. Besides combining sub-band experts, we can also combine multiple full-band experts, where each expert is perhaps suited to extracting complementary sources of speech information, or is robust to different kinds of noise. In this article we present an outline of some of the recent work at IDIAP, and cooperating institutions, in bringing together ideas from different areas of speech science within the framework of multi-stream HMM and HMM/ANN based ASR

    Neural Models of Subcortical Auditory Processing

    Get PDF
    An important feature of the auditory system is its ability to distinguish many simultaneous sound sources. The primary goal of this work was to understand how a robust, preattentive analysis of the auditory scene is accomplished by the subcortical auditory system. Reasonably accurate modelling of the morphology and organisation of the relevant auditory nuclei, was seen as being of great importance. The formulation of plausible models and their subsequent simulation was found to be invaluable in elucidating biological processes and in highlighting areas of uncertainty. In the thesis, a review of important aspects of mammalian auditory processing is presented and used as a basis for the subsequent modelling work. For each aspect of auditory processing modelled, psychophysical results are described and existing models reviewed, before the models used here are described and simulated. Auditory processes which are modelled include the peripheral system, and the production of tonotopic maps of the spectral content of complex acoustic stimuli, and of modulation frequency or periodicity. A model of the formation of sequential associations between successive sounds is described, and the model is shown to be capable of emulating a wide range of psychophysical behaviour. The grouping of related spectral components and the development of pitch perception is also investigated. Finally a critical assessment of the work and ideas for future developments are presented. The principal contributions of this work are the further development of a model for pitch perception and the development of a novel architecture for the sequential association of those groups. In the process of developing these ideas, further insights into subcortical auditory processing were gained, and explanations for a number of puzzling psychophysical characteristics suggested.Royal Naval Engineering College, Manadon, Plymout

    Data utility modelling for mismatch reduction

    Get PDF
    In the "missing data" (MD) approach to noise robust automatic speech recognition (ASR), speech models are trained on clean data, and during recognition sections of spectral data dominated by noise are detected and treated as "missing". However, this all-or-nothing hard decision about which data is missing does not accurately reflect the probabilistic nature of missing data detection. Recent work has shown greatly improved performance by the "soft missing data" (SMD) approach, in which the "missing" status of each data value is represented by a continuous probability rather than a 0/1 value. This probability is then used to weight between the different likelihood contributions which the MD model normally assigns to each spectral observation according to its "missing" status. This article presents an analysis which shows that the SMD approach effectively implements a Maximum A-Posteriori (MAP) decoding strategy with missing or uncertain data, subject to the interpretation that the missing/not-missing probabilities are weights for a mixture pdf which models the pdf for each hidden clean data input, after conditioning by the noisy data input, a local noise estimate, and any information which may be available. An important feature of this "soft data" model is that control over the "evidence pdf" can provide a principled framework not only for ignoring unreliable data, but also for focusing attention on more discriminative features, and for data enhancement

    Data utility modelling for mismatch reduction

    Get PDF
    In the "missing data" (MD) approach to noise robust automatic speech recognition (ASR), speech models are trained on clean data, and during recognition sections of spectral data dominated by noise are detected and treated as "missing". However, this all-or-nothing hard decision about which data is missing does not accurately reflect the probabilistic nature of missing data detection. Recent work has shown greatly improved performance by the "soft missing data" (SMD) approach, in which the "missing" status of each data value is represented by a continuous probability rather than a 0/1 value. This probability is then used to weight between the different likelihood contributions which the MD model normally assigns to each spectral observation according to its "missing" status. This article presents an analysis which shows that the SMD approach effectively implements a Maximum A-Posteriori (MAP) decoding strategy with missing or uncertain data, subject to the interpretation that the missing/not-missing probabilities are weights for a mixture pdf which models the pdf for each hidden clean data input, after conditioning by the noisy data input, a local noise estimate, and any information which may be available. An important feature of this "soft data" model is that control over the "evidence pdf" can provide a principled framework not only for ignoring unreliable data, but also for focusing attention on more discriminative features, and for data enhancement

    A Computational Model of Auditory Feature Extraction and Sound Classification

    Get PDF
    This thesis introduces a computer model that incorporates responses similar to those found in the cochlea, in sub-corticai auditory processing, and in auditory cortex. The principle aim of this work is to show that this can form the basis for a biologically plausible mechanism of auditory stimulus classification. We will show that this classification is robust to stimulus variation and time compression. In addition, the response of the system is shown to support multiple, concurrent, behaviourally relevant classifications of natural stimuli (speech). The model incorporates transient enhancement, an ensemble of spectro - temporal filters, and a simple measure analogous to the idea of visual salience to produce a quasi-static description of the stimulus suitable either for classification with an analogue artificial neural network or, using appropriate rate coding, a classifier based on artificial spiking neurons. We also show that the spectotemporal ensemble can be derived from a limited class of 'formative' stimuli, consistent with a developmental interpretation of ensemble formation. In addition, ensembles chosen on information theoretic grounds consist of filters with relatively simple geometries, which is consistent with reports of responses in mammalian thalamus and auditory cortex. A powerful feature of this approach is that the ensemble response, from which salient auditory events are identified, amounts to stimulus-ensemble driven method of segmentation which respects the envelope of the stimulus, and leads to a quasi-static representation of auditory events which is suitable for spike rate coding. We also present evidence that the encoded auditory events may form the basis of a representation-of-similarity, or second order isomorphism, which implies a representational space that respects similarity relationships between stimuli including novel stimuli

    Acoustical measurements on stages of nine U.S. concert halls

    Get PDF
    • 

    corecore