2,355 research outputs found

    The MIT Summit Speech Recognition System: A Progress Report

    Get PDF
    Recently, we initiated a project to develop a phonetically-based spoken language understanding system called SUMMIT. In contrast to many of the past efforts that make use of heuristic rules whose development requires intense knowledge engineering, our approach attempts to express the speech knowledge within a formal framework using well-defined mathematical tools. In our system, features and decision strategies are discovered and trained automatically, using a large body of speech data. This paper describes the system, and documents its current performance

    Phonologically-Informed Speech Coding for Automatic Speech Recognition-based Foreign Language Pronunciation Training

    Full text link
    Automatic speech recognition (ASR) and computer-assisted pronunciation training (CAPT) systems used in foreign-language educational contexts are often not developed with the specific task of second-language acquisition in mind. Systems that are built for this task are often excessively targeted to one native language (L1) or a single phonemic contrast and are therefore burdensome to train. Current algorithms have been shown to provide erroneous feedback to learners and show inconsistencies between human and computer perception. These discrepancies have thus far hindered more extensive application of ASR in educational systems. This thesis reviews the computational models of the human perception of American English vowels for use in an educational context; exploring and comparing two types of acoustic representation: a low-dimensionality linguistically-informed formant representation and more traditional Mel frequency cepstral coefficients (MFCCs). We first compare two algorithms for phoneme classification (support vector machines and long short-term memory recurrent neural networks) trained on American English vowel productions from the TIMIT corpus. We then conduct a perceptual study of non-native English vowel productions perceived by native American English speakers. We compare the results of the computational experiment and the human perception experiment to assess human/model agreement. Dissimilarities between human and model classification are explored. More phonologically-informed audio signal representations should create a more human-aligned, less L1-dependent vowel classification system with higher interpretability that can be further refined with more phonetic- and/or phonological-based research. Results show that linguistically-informed speech coding produces results that better align with human classification, supporting use of the proposed coding for ASR-based CAPT

    Speech Communication

    Get PDF
    Contains table of contents for Part IV, table of contents for Section 1 and reports on five research projects.Apple Computer, Inc.C.J. Lebel FellowshipNational Institutes of Health (Grant T32-NS07040)National Institutes of Health (Grant R01-NS04332)National Institutes of Health (Grant R01-NS21183)National Institutes of Health (Grant P01-NS23734)U.S. Navy / Naval Electronic Systems Command (Contract N00039-85-C-0254)U.S. Navy - Office of Naval Research (Contract N00014-82-K-0727

    Articulatory features for conversational speech recognition

    Get PDF

    Physiologically-Motivated Feature Extraction Methods for Speaker Recognition

    Get PDF
    Speaker recognition has received a great deal of attention from the speech community, and significant gains in robustness and accuracy have been obtained over the past decade. However, the features used for identification are still primarily representations of overall spectral characteristics, and thus the models are primarily phonetic in nature, differentiating speakers based on overall pronunciation patterns. This creates difficulties in terms of the amount of enrollment data and complexity of the models required to cover the phonetic space, especially in tasks such as identification where enrollment and testing data may not have similar phonetic coverage. This dissertation introduces new features based on vocal source characteristics intended to capture physiological information related to the laryngeal excitation energy of a speaker. These features, including RPCC, GLFCC and TPCC, represent the unique characteristics of speech production not represented in current state-of-the-art speaker identification systems. The proposed features are evaluated through three experimental paradigms including cross-lingual speaker identification, cross song-type avian speaker identification and mono-lingual speaker identification. The experimental results show that the proposed features provide information about speaker characteristics that is significantly different in nature from the phonetically-focused information present in traditional spectral features. The incorporation of the proposed glottal source features offers significant overall improvement to the robustness and accuracy of speaker identification tasks

    Voice Analysis to Differentiate the Dopaminergic Response in People With Parkinson's Disease

    Get PDF
    Humans' voice offers the widest variety of motor phenomena of any human activity. However, its clinical evaluation in people with movement disorders such as Parkinson's disease (PD) lags behind current knowledge on advanced analytical automatic speech processing methodology. Here, we use deep learning-based speech processing to differentially analyze voice recordings in 14 people with PD before and after dopaminergic medication using personalized Convolutional Recurrent Neural Networks (p-CRNN) and Phone Attribute Codebooks (PAC). p-CRNN yields an accuracy of 82.35% in the binary classification of ON and OFF motor states at a sensitivity/specificity of 0.86/0.78. The PAC-based approach's accuracy was slightly lower with 73.08% at a sensitivity/specificity of 0.69/0.77, but this method offers easier interpretation and understanding of the computational biomarkers. Both p-CRNN and PAC provide a differentiated view and novel insights into the distinctive components of the speech of persons with PD. Both methods detect voice qualities that are amenable to dopaminergic treatment, including active phonetic and prosodic features. Our findings may pave the way for quantitative measurements of speech in persons with PD

    Role of Selected Spectral Attributes in the Perception of Synthetic Vowels

    Get PDF
    This thesis is an experimental study regarding the identification and discrimination of vowels, studied using synthetic stimuli. The acoustic attributes of synthetic stimuli vary, which raises the question of how different spectral attributes are linked to the behaviour of the subjects. The spectral attributes used are formants and spectral moments (centre of gravity, standard deviation, skewness and kurtosis). Two types of experiments are used, related to the identification and discrimination of the stimuli, respectively. The discrimination is studied by using both the attentive procedures that require a response from the subject, and the preattentive procedures that require no response. Together, the studies offer information about the identification and discrimination of synthetic vowels in 15 different languages. Furthermore, this thesis discusses the role of various spectral attributes in the speech perception processes. The thesis is divided into three studies. The first is based only on attentive methods, whereas the other two concentrate on the relationship between identification and discrimination experiments. The neurophysiological methods (EEG recordings) reveal the role of attention in processing, and are used in discrimination experiments, while the results reveal differences in perceptual processes based on the language, attention and experimental procedure.Siirretty Doriast

    A detection-based pattern recognition framework and its applications

    Get PDF
    The objective of this dissertation is to present a detection-based pattern recognition framework and demonstrate its applications in automatic speech recognition and broadcast news video story segmentation. Inspired by the studies of modern cognitive psychology and real-world pattern recognition systems, a detection-based pattern recognition framework is proposed to provide an alternative solution for some complicated pattern recognition problems. The primitive features are first detected and the task-specific knowledge hierarchy is constructed level by level; then a variety of heterogeneous information sources are combined together and the high-level context is incorporated as additional information at certain stages. A detection-based framework is a â divide-and-conquerâ design paradigm for pattern recognition problems, which will decompose a conceptually difficult problem into many elementary sub-problems that can be handled directly and reliably. Some information fusion strategies will be employed to integrate the evidence from a lower level to form the evidence at a higher level. Such a fusion procedure continues until reaching the top level. Generally, a detection-based framework has many advantages: (1) more flexibility in both detector design and fusion strategies, as these two parts can be optimized separately; (2) parallel and distributed computational components in primitive feature detection. In such a component-based framework, any primitive component can be replaced by a new one while other components remain unchanged; (3) incremental information integration; (4) high level context information as additional information sources, which can be combined with bottom-up processing at any stage. This dissertation presents the basic principles, criteria, and techniques for detector design and hypothesis verification based on the statistical detection and decision theory. In addition, evidence fusion strategies were investigated in this dissertation. Several novel detection algorithms and evidence fusion methods were proposed and their effectiveness was justified in automatic speech recognition and broadcast news video segmentation system. We believe such a detection-based framework can be employed in more applications in the future.Ph.D.Committee Chair: Lee, Chin-Hui; Committee Member: Clements, Mark; Committee Member: Ghovanloo, Maysam; Committee Member: Romberg, Justin; Committee Member: Yuan, Min
    • …
    corecore