162 research outputs found

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Phonetically transparent technique for the automatic transcription of speech

    Get PDF

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    An automated lexical stress classification tool for assessing dysprosody in childhood apraxia of speech

    Get PDF
    Childhood apraxia of speech (CAS) commonly affects the production of lexical stress contrast in polysyllabic words. Automated classification tools have the potential to increase reliability and efficiency in measuring lexical stress. Here, factors affecting the accuracy of a custom-built deep neural network (DNN)-based classification tool are evaluated. Sixteen children with typical development (TD) and 26 with CAS produced 50 polysyllabic words. Words with strong–weak (SW, e.g., dinosaur) or WS (e.g., banana) stress were fed to the classification tool, and the accuracy measured (a) against expert judgment, (b) for speaker group, and (c) with/without prior knowledge of phonemic errors in the sample. The influence of segmental features and participant factors on tool accuracy was analysed. Linear mixed modelling showed significant interaction between group and stress type, surviving adjustment for age and CAS severity. For TD, agreement for SW and WS words was >80%, but CAS speech was higher for SW (>80%) than WS (~60%). Prior knowledge of segmental errors conferred no clear advantage. Automatic lexical stress classification shows promise for identifying errors in children’s speech at diagnosis or with treatment-related change, but accuracy for WS words in apraxic speech needs improvement. Further training of algorithms using larger sets of labelled data containing impaired speech and WS words may increase accuracy

    Phonologically-Informed Speech Coding for Automatic Speech Recognition-based Foreign Language Pronunciation Training

    Full text link
    Automatic speech recognition (ASR) and computer-assisted pronunciation training (CAPT) systems used in foreign-language educational contexts are often not developed with the specific task of second-language acquisition in mind. Systems that are built for this task are often excessively targeted to one native language (L1) or a single phonemic contrast and are therefore burdensome to train. Current algorithms have been shown to provide erroneous feedback to learners and show inconsistencies between human and computer perception. These discrepancies have thus far hindered more extensive application of ASR in educational systems. This thesis reviews the computational models of the human perception of American English vowels for use in an educational context; exploring and comparing two types of acoustic representation: a low-dimensionality linguistically-informed formant representation and more traditional Mel frequency cepstral coefficients (MFCCs). We first compare two algorithms for phoneme classification (support vector machines and long short-term memory recurrent neural networks) trained on American English vowel productions from the TIMIT corpus. We then conduct a perceptual study of non-native English vowel productions perceived by native American English speakers. We compare the results of the computational experiment and the human perception experiment to assess human/model agreement. Dissimilarities between human and model classification are explored. More phonologically-informed audio signal representations should create a more human-aligned, less L1-dependent vowel classification system with higher interpretability that can be further refined with more phonetic- and/or phonological-based research. Results show that linguistically-informed speech coding produces results that better align with human classification, supporting use of the proposed coding for ASR-based CAPT

    Using challenges to enhance a learning game for pronunciation training of English as a second language

    Get PDF
    Producción CientíficaLearning games have a remarkable potential for education. They provide an emergent form of social participation that deserves the assessment of their usefulness and efficiency in learning processes. This study describes a novel learning game for foreign pronunciation training in which players can challenge each other. Native Spanish speakers performed several pronunciation activities during a one-month competition using a mobile application, designed under a minimal pairs approach, to improve their pronunciation of English as a foreign language. This game took place in a competitive scenario in which students had to challenge other participants in order to get high scores and climb up a leaderboard. Results show intense practice supported by a significant number of activities and playing regularity, so the most active and motivated players in the competition achieved significant pronunciation improvement results. The integration of automatic speech recognition (ASR) and text-to-speech (TTS) technology allowed users to improve their pronunciation while being immersed in a highly motivational game.Ministerio de Economía, Industria y Competitividad - Fondo Europeo de Desarrollo Regional (grant TIN2014-59852-R)Junta de Castilla y Leon (grant VA050G18

    Automated classification of humpback whale (Megaptera novaeangliae) songs using hidden Markov models

    No full text
    Humpback whales songs have been widely investigated in the past few decades. This study proposes a new approach for the classification of the calls detected in the songs with the use of Hidden Markov Models (HMMs). HMMs have been used once before for such task but in an unsupervised algorithm with promising results. Here HMMs were trained and two models were employed to classify the calls into their component units and subunits. The results show that classification of humpback whale songs from one year to another is possible even with limited training. The classification is fully automated apart from the labelling of the training set and the input of the initial HMM prototype models. Two different models for the song structure are considered: one based on song units and one based on subunits. The latter model is shown to achieve better recognition results with a reduced need for updating when applied to a variety of recordings from different years and different geographic locations

    Assessment of severe apnoea through voice analysis, automatic speech, and speaker recognition techniques

    Full text link
    The electronic version of this article is the complete one and can be found online at: http://asp.eurasipjournals.com/content/2009/1/982531This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.The activities described in this paper were funded by the Spanish Ministry of Science and Technology as part of the TEC2006-13170-C02-02 Project

    The phonetics of second language learning and bilingualism

    Get PDF
    This chapter provides an overview of major theories and findings in the field of second language (L2) phonetics and phonology. Four main conceptual frameworks are discussed and compared: the Perceptual Assimilation Model-L2, the Native Language Magnet Theory, the Automatic Selection Perception Model, and the Speech Learning Model. These frameworks differ in terms of their empirical focus, including the type of learner (e.g., beginner vs. advanced) and target modality (e.g., perception vs. production), and in terms of their theoretical assumptions, such as the basic unit or window of analysis that is relevant (e.g., articulatory gestures, position-specific allophones). Despite the divergences among these theories, three recurring themes emerge from the literature reviewed. First, the learning of a target L2 structure (segment, prosodic pattern, etc.) is influenced by phonetic and/or phonological similarity to structures in the native language (L1). In particular, L1-L2 similarity exists at multiple levels and does not necessarily benefit L2 outcomes. Second, the role played by certain factors, such as acoustic phonetic similarity between close L1 and L2 sounds, changes over the course of learning, such that advanced learners may differ from novice learners with respect to the effect of a specific variable on observed L2 behavior. Third, the connection between L2 perception and production (insofar as the two are hypothesized to be linked) differs significantly from the perception-production links observed in L1 acquisition. In service of elucidating the predictive differences among these theories, this contribution discusses studies that have investigated L2 perception and/or production primarily at a segmental level. In addition to summarizing the areas in which there is broad consensus, the chapter points out a number of questions which remain a source of debate in the field today.https://drive.google.com/open?id=1uHX9K99Bl31vMZNRWL-YmU7O2p1tG2wHhttps://drive.google.com/open?id=1uHX9K99Bl31vMZNRWL-YmU7O2p1tG2wHhttps://drive.google.com/open?id=1uHX9K99Bl31vMZNRWL-YmU7O2p1tG2wHAccepted manuscriptAccepted manuscrip
    corecore