7 research outputs found

    MAP Combination of Multi-Stream HMM or HMM/ANN Experts

    Get PDF
    Automatic speech recognition (ASR) performance falls dramatically with the level of mismatch between training and test data. The human ability to recognise speech when a large proportion of frequencies are dominated by noise has inspired the "missing data" and "multi-band" approaches to noise robust ASR. "Missing data" ASR identifies low SNR spectral data in each data frame and then ignores it. Multi-band ASR trains a separate model for each position of missing data, estimates a reliability weight for each model, then combines model outputs in a weighted sum. A problem with both approaches is that local data reliability estimation is inherently inaccurate and also assumes that all of the training data was clean. In this article we present a model in which adaptive multi-band expert weighting is incorporated naturally into the maximum a posteriori (MAP) decoding process

    Low cost duration modelling for noise robust speech recognition

    Get PDF
    State transition matrices as used in standard HMM decoders have two widely perceived limitations. One is that the implicit Geometric state duration distributions which they model do not accurately reflect true duration distributions. The other is that they impose no hard limit on maximum duration with the result that state transition probabilities often have little influence when combined with acoustic probabilities, which are of a different order of magnitude. Explicit duration models were developed in the past to address the first problem. These were not widely taken up because their performance advantage in clean speech recognition was often not sufficiently great to offset the extra complexity which they introduced. However, duration models have much greater potential when applied to noisy speech recognition. In this paper we present a simple and generic form of explicit duration model and show that this leads to strong performance improvements when applied to connected digit recognition in noise

    Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR

    Get PDF
    In this article we review several successful extensions to the standard Hidden-Markov-Model/Artificial Neural Network (HMM/ANN) hybrid, which have recently made important contributions to the field of noise robust automatic speech recognition. The first extension to the standard hybrid was the ``multi-band hybrid'', in which a separate ANN is trained on each frequency subband, followed by some form of weighted combination of \ANN state posterior probability outputs prior to decoding. However, due to the inaccurate assumption of subband independence, this system usually gives degraded performance, except in the case of narrow-band noise. All of the systems which we review overcome this independence assumption and give improved performance in noise, while also improving or not significantly degrading performance with clean speech. The ``all-combinations multi-band'' hybrid trains a separate ANN for each subband combination. This, however, typically requires a large number of ANNs. The ``all-combinations multi-stream'' hybrid trains an ANN expert for every combination of just a small number of complementary data streams. Multiple ANN posteriors combination using maximum a-posteriori (MAP) weighting gives rise to the further successful strategy of hypothesis level combination by MAP selection. An alternative strategy for exploiting the classification capacity of ANNs is the ``tandem hybrid'' approach in which one or more ANN classifiers are trained with multi-condition data to generate discriminative and noise robust features for input to a standard ASR system. The ``multi-stream tandem hybrid'' trains an ANN for a number of complementary feature streams, permitting multi-stream data fusion. The ``narrow-band tandem hybrid'' trains an ANN for a number of particularly narrow frequency subbands. This gives improved robustness to noises not seen during training. Of the systems presented, all of the multi-stream systems provide generic models for multi-modal data fusion. Test results for each system are presented and discusse

    Activity Report 2002

    Get PDF

    Articulatory features for conversational speech recognition

    Get PDF
    corecore