1 research outputs found

    Speech Recognition by Indexing and Sequencing

    No full text
    Recognition by Indexing and Sequencing (RISq) is a general-purpose method for classification of temporal vector sequences, that was previously applied to human activity recognition. We have developed an advanced version of RISq and have adapted it to continuous speech recognition, a challenging task most commonly solved today with Hidden Markov Models (HMMs). Despite being very popular, HMMs suffer from some intrinsic problems, such as the need for a very large amount of training data in order to build a model for each class to be recognized. As opposed to HMMs, RISq consists of a non-parametric algorithm that can be trained with as little as one exemplar for each class. This makes RISq more suitable for tasks where only a limited amount of training data is available or where there is significant intra-class variation. Moreover, from a practical standpoint even the most powerful HMM-based speech recognizers have achieved only modest performance improvements over the last two decades. This lead to renovated interest in non-parametric exemplar-based techniques such as RISq. Because of its ease of training, RISq can adapt extremely fast to a new speaker. In fact, it is possible to train RISq for speakers of different languages or with significantly different accents. The basic RISq methodology consists of a two-step classification algorithm: first the training samples closest to each input sample are identified and weighted with a parallel algorithm (indexing). Then a maximum weighted bipartite graph matching is found between the input sequence and a training sequence, respecting an additional temporal constraint (sequencing). The adaptation of RISq to speech recognition lead to several general-purpose improvements of the basic algorithm. The basic methodology was extended to use dissimilarity measures as well as similarity. In order to account for variability in the input data, several exemplars of each class to be recognized are used for training. Moreover, we have devised a novel algorithm to automatically identify parts of the training sequences (i.e. words). Finally, we have developed a segmentation approach to apply RISq to continuous speech, as well as a post-processing technique to identify the timing of detected words. Experimental evaluation was conducted on the standard TIMIT speech database, as well as on a small corpus recorded in our lab. The performance of RISq was compared with that of Sphinx, a state-of-the-art HMM-based system. Results show how RISq performs at least as well as Sphinx in most of our tests, outperforming Sphinx in the recognition of a non-native American English speaker
    corecore