1,099 research outputs found

    Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

    Full text link
    We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10\% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.Comment: Accepted for INTERSPEECH 201

    Pronunciation modeling for Cantonese speech recognition.

    Get PDF
    Kam Patgi.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaf 103).Abstracts in English and Chinese.Chapter Chapter 1. --- Introduction --- p.1Chapter 1.1 --- Automatic Speech Recognition --- p.1Chapter 1.2 --- Pronunciation Modeling in ASR --- p.2Chapter 1.3 --- Obj ectives of the Thesis --- p.5Chapter 1.4 --- Thesis Outline --- p.5Reference --- p.7Chapter Chapter 2. --- The Cantonese Dialect --- p.9Chapter 2.1 --- Cantonese - A Typical Chinese Dialect --- p.10Chapter 2.1.1 --- Cantonese Phonology --- p.11Chapter 2.1.2 --- Cantonese Phonetics --- p.12Chapter 2.2 --- Pronunciation Variation in Cantonese --- p.13Chapter 2.2.1 --- Phone Change and Sound Change --- p.14Chapter 2.2.2 --- Notation for Different Sound Units --- p.16Chapter 2.3 --- Summary --- p.17Reference --- p.18Chapter Chapter 3. --- Large-Vocabulary Continuous Speech Recognition for Cantonese --- p.19Chapter 3.1 --- Feature Representation of the Speech Signal --- p.20Chapter 3.2 --- Probabilistic Framework of ASR --- p.20Chapter 3.3 --- Hidden Markov Model for Acoustic Modeling --- p.21Chapter 3.4 --- Pronunciation Lexicon --- p.25Chapter 3.5 --- Statistical Language Model --- p.25Chapter 3.6 --- Decoding --- p.26Chapter 3.7 --- The Baseline Cantonese LVCSR System --- p.26Chapter 3.7.1 --- System Architecture --- p.26Chapter 3.7.2 --- Speech Databases --- p.28Chapter 3.8 --- Summary --- p.29Reference --- p.30Chapter Chapter 4. --- Pronunciation Model --- p.32Chapter 4.1 --- Pronunciation Modeling at Different Levels --- p.33Chapter 4.2 --- Phone-level pronunciation model and its Application --- p.35Chapter 4.2.1 --- IF Confusion Matrix (CM) --- p.35Chapter 4.2.2 --- Decision Tree Pronunciation Model (DTPM) --- p.38Chapter 4.2.3 --- Refinement of Confusion Matrix --- p.41Chapter 4.3 --- Summary --- p.43References --- p.44Chapter Chapter 5. --- Pronunciation Modeling at Lexical Level --- p.45Chapter 5.1 --- Construction of PVD --- p.46Chapter 5.2 --- PVD Pruning by Word Unigram --- p.48Chapter 5.3 --- Recognition Experiments --- p.49Chapter 5.3.1 --- Experiment 1 ´ؤPronunciation Modeling in LVCSR --- p.49Chapter 5.3.2 --- Experiment 2 ´ؤ Pronunciation Modeling in Domain Specific task --- p.58Chapter 5.3.3 --- Experiment 3 ´ؤ PVD Pruning by Word Unigram --- p.62Chapter 5.4 --- Summary --- p.63Reference --- p.64Chapter Chapter 6. --- Pronunciation Modeling at Acoustic Model Level --- p.66Chapter 6.1 --- Hierarchy of HMM --- p.67Chapter 6.2 --- Sharing of Mixture Components --- p.68Chapter 6.3 --- Adaptation of Mixture Components --- p.70Chapter 6.4 --- Combination of Mixture Component Sharing and Adaptation --- p.74Chapter 6.5 --- Recognition Experiments --- p.78Chapter 6.6 --- Result Analysis --- p.80Chapter 6.6.1 --- Performance of Sharing Mixture Components --- p.81Chapter 6.6.2 --- Performance of Mixture Component Adaptation --- p.84Chapter 6.7 --- Summary --- p.85Reference --- p.87Chapter Chapter 7. --- Pronunciation Modeling at Decoding Level --- p.88Chapter 7.1 --- Search Process in Cantonese LVCSR --- p.88Chapter 7.2 --- Model-Level Search Space Expansion --- p.90Chapter 7.3 --- State-Level Output Probability Modification --- p.92Chapter 7.4 --- Recognition Experiments --- p.93Chapter 7.4.1 --- Experiment 1 ´ؤModel-Level Search Space Expansion --- p.93Chapter 7.4.2 --- Experiment 2 ´ؤ State-Level Output Probability Modification …… --- p.94Chapter 7.5 --- Summary --- p.96Reference --- p.97Chapter Chapter 8. --- Conclusions and Suggestions for Future Work --- p.98Chapter 8.1 --- Conclusions --- p.98Chapter 8.2 --- Suggestions for Future Work --- p.100Reference --- p.103Appendix I Base Syllable Table --- p.104Appendix II Cantonese Initials and Finals --- p.105Appendix III IF confusion matrix --- p.106Appendix IV Phonetic Question Set --- p.112Appendix V CDDT and PCDT --- p.11

    Multilingual Speech Processing in the context of Under-resourced Languages

    Get PDF

    Robust Speech Recognition for Adverse Environments

    Get PDF

    DEVELOPING AN ONLINE CORPUS OF FORMOSAN LANGUAGES

    Get PDF
    Information technologies have now matured to the point of enabling researchers to create a repository of language resources, especially for those languages facing the crisis of endangerment. The development of an online platform of corpora, made possible by recent advances in data storage, character-encoding and web technology, has profound consequences for the accessibility, quantity, quality and interoperability of linguistic field data. This is of particular significance for Formosan languages in Taiwan, many of which are on the verge of extinction. As a response to the recognition of this burgeoning problem, the key objectives of the establishment of the NTU Corpus of Formosan Languages aim to document and thus preserve valuable linguistic data, as well as relevant ethnological and cultural information. This paper will introduce some of the theoretical bases behind this initiative, as well as the procedures, transcription conventions, database normalization, in-house system and three special features in the creation of this corpus

    Analyzing Prosody with Legendre Polynomial Coefficients

    Full text link
    This investigation demonstrates the effectiveness of Legendre polynomial coefficients representing prosodic contours within the context of two different tasks: nativeness classification and sarcasm detection. By making use of accurate representations of prosodic contours to answer fundamental linguistic questions, we contribute significantly to the body of research focused on analyzing prosody in linguistics as well as modeling prosody for machine learning tasks. Using Legendre polynomial coefficient representations of prosodic contours, we answer prosodic questions about differences in prosody between native English speakers and non-native English speakers whose first language is Mandarin. We also learn more about prosodic qualities of sarcastic speech. We additionally perform machine learning classification for both tasks, (achieving an accuracy of 72.3% for nativeness classification, and achieving 81.57% for sarcasm detection). We recommend that linguists looking to analyze prosodic contours make use of Legendre polynomial coefficients modeling; the accuracy and quality of the resulting prosodic contour representations makes them highly interpretable for linguistic analysis
    • …
    corecore