2,602 research outputs found

    Comparing SPHINX vs. SONIC Italian Children Speech Recognition Systems

    Get PDF
    Our previous experiences have showed that both CSLR SONIC and CMU SPHINX are two versatile and powerful tools for Automatic Speech Recognition (ASR). Encouraged by the good results we had, these two systems have been compared in another important challenge of ASR: the recognition of children\u27s speech. In this work, SPHINX has been used to build from scratch a recognizer for Italian children\u27s speech and the results have been compared to those obtained with SONIC, both in previous and in some new experiments, which were designed in order to have uniform experimental conditions between the two different systems. This report describes the training process and the evaluation methodology regarding a speaker-independent phonetic-recognition task. First, we briefly describe the system architectures and their differences, and then we analyze the task, the corpus and the techniques adopted to face the recognition problem. The scores of multiple tests in terms of Phonetic Error Rate (PER) and an analysis on differences of the two systems are shown in the final discussion. SONIC has turned out to have the best overall performance and it obtained a minimum PER of 12.4% with VTLN and SMAPLR adaptation. SPHINX was the easiest system to train and test and its performance (PER of 17.2% with comparable adaptations) was only some percentage points far from those in SONIC

    Lexical Access Model for Italian -- Modeling human speech processing: identification of words in running speech toward lexical access based on the detection of landmarks and other acoustic cues to features

    Full text link
    Modelling the process that a listener actuates in deriving the words intended by a speaker requires setting a hypothesis on how lexical items are stored in memory. This work aims at developing a system that imitates humans when identifying words in running speech and, in this way, provide a framework to better understand human speech processing. We build a speech recognizer for Italian based on the principles of Stevens' model of Lexical Access in which words are stored as hierarchical arrangements of distinctive features (Stevens, K. N. (2002). "Toward a model for lexical access based on acoustic landmarks and distinctive features," J. Acoust. Soc. Am., 111(4):1872-1891). Over the past few decades, the Speech Communication Group at the Massachusetts Institute of Technology (MIT) developed a speech recognition system for English based on this approach. Italian will be the first language beyond English to be explored; the extension to another language provides the opportunity to test the hypothesis that words are represented in memory as a set of hierarchically-arranged distinctive features, and reveal which of the underlying mechanisms may have a language-independent nature. This paper also introduces a new Lexical Access corpus, the LaMIT database, created and labeled specifically for this work, that will be provided freely to the speech research community. Future developments will test the hypothesis that specific acoustic discontinuities - called landmarks - that serve as cues to features, are language independent, while other cues may be language-dependent, with powerful implications for understanding how the human brain recognizes speech.Comment: Submitted to Language and Speech, 202

    Rhythmic unit extraction and modelling for automatic language identification

    Get PDF
    International audienceThis paper deals with an approach to Automatic Language Identification based on rhythmic modelling. Beside phonetics and phonotactics, rhythm is actually one of the most promising features to be considered for language identification, even if its extraction and modelling are not a straightforward issue. Actually, one of the main problems to address is what to model. In this paper, an algorithm of rhythm extraction is described: using a vowel detection algorithm, rhythmic units related to syllables are segmented. Several parameters are extracted (consonantal and vowel duration, cluster complexity) and modelled with a Gaussian Mixture. Experiments are performed on read speech for 7 languages (English, French, German, Italian, Japanese, Mandarin and Spanish) and results reach up to 86 ± 6% of correct discrimination between stress-timed mora-timed and syllable-timed classes of languages, and to 67 ± 8% percent of correct language identification on average for the 7 languages with utterances of 21 seconds. These results are commented and compared with those obtained with a standard acoustic Gaussian mixture modelling approach (88 ± 5% of correct identification for the 7-languages identification task)

    Adaptation of Hybrid ANN/HMM Models using Linear Hidden Transformations and Conservative Training

    Get PDF
    International audienceA technique is proposed for the adaptation of automatic speech recognition systems using Hybrid models combining Artificial Neural Networks with Hidden Markov Models. The application of linear transformations not only to the input features, but also to the outputs of the internal layers is investigated. The motivation is that the outputs of an internal layer represent a projection of the input pattern into a space where it should be easier to learn the classification or transformation expected at the output of the network. A new solution, called Conservative Training, is proposed that compensates for the lack of adaptation samples in certain classes. Supervised adaptation experiments with different corpora and for different adaptation types are described. The results show that the proposed approach always outperforms the use of transformations in the feature space and yields even better results when combined with linear input transformations
    • 

    corecore