175 research outputs found

    Tone classification of syllable -segmented Thai speech based on multilayer perceptron

    Get PDF
    Thai is a monosyllabic and tonal language. Thai makes use of tone to convey lexical information about the meaning of a syllable. Thai has five distinctive tones and each tone is well represented by a single F0 contour pattern. In general, a Thai syllable with a different tone has a different lexical meaning. Thus, to completely recognize a spoken Thai syllable, a speech recognition system has not only to recognize a base syllable but also to correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system.;In this study, a tone classification of syllable-segmented Thai speech which incorporates the effects of tonal coarticulation, stress and intonation was developed. Automatic syllable segmentation, which performs the segmentation on the training and test utterances into syllable units, was also developed. The acoustical features including fundamental frequency (F0), duration, and energy extracted from the processing syllable and neighboring syllables were used as the main discriminating features. A multilayer perceptron (MLP) trained by backpropagation method was employed to classify these features. The proposed system was evaluated on 920 test utterances spoken by five male and three female Thai speakers who also uttered the training speech. The proposed system achieved an average accuracy rate of 91.36%

    Large vocabulary Cantonese speech recognition using neural networks.

    Get PDF
    Tsik Chung Wai Benjamin.Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.Includes bibliographical references (leaves 67-70).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Automatic Speech Recognition --- p.1Chapter 1.2 --- Cantonese Speech Recognition --- p.3Chapter 1.3 --- Neural Networks --- p.4Chapter 1.4 --- About this Thesis --- p.5Chapter 2 --- The Phonology of Cantonese --- p.6Chapter 2.1 --- The Syllabic Structure of Cantonese Syllable --- p.7Chapter 2.2 --- The Tone System of Cantonese --- p.9Chapter 3 --- Review of Automatic Speech Recognition Systems --- p.12Chapter 3.1 --- Hidden Markov Model Approach --- p.12Chapter 3.2 --- Neural Networks Approach --- p.13Chapter 3.2.1 --- Multi-Layer Perceptrons (MLP) --- p.13Chapter 3.2.2 --- Time-Delay Neural Networks (TDNN) --- p.15Chapter 3.2.3 --- Recurrent Neural Networks --- p.17Chapter 3.3 --- Integrated Approach --- p.18Chapter 3.4 --- Mandarin and Cantonese Speech Recognition Systems --- p.19Chapter 4 --- The Speech Corpus and Database --- p.21Chapter 4.1 --- Design of the Speech Corpus --- p.21Chapter 4.2 --- Speech Database Acquisition --- p.23Chapter 5 --- Feature Parameters Extraction --- p.24Chapter 5.1 --- Endpoint Detection --- p.25Chapter 5.2 --- Speech Processing --- p.26Chapter 5.3 --- Speech Segmentation --- p.27Chapter 5.4 --- Phoneme Feature Extraction --- p.29Chapter 5.5 --- Tone Feature Extraction --- p.30Chapter 6 --- The Design of the System --- p.33Chapter 6.1 --- Towards Large Vocabulary System --- p.34Chapter 6.2 --- Overview of the Isolated Cantonese Syllable Recognition System --- p.36Chapter 6.3 --- The Primary Level: Phoneme Classifiers and Tone Classifier --- p.38Chapter 6.4 --- The Intermediate Level: Ending Corrector --- p.42Chapter 6.5 --- The Secondary Level: Syllable Classifier --- p.43Chapter 6.5.1 --- Concatenation with Correction Approach --- p.44Chapter 6.5.2 --- Fuzzy ART Approach --- p.45Chapter 7 --- Computer Simulation --- p.49Chapter 7.1 --- Experimental Conditions --- p.49Chapter 7.2 --- Experimental Results of the Primary Level Classifiers --- p.50Chapter 7.3 --- Overall Performance of the System --- p.57Chapter 7.4 --- Discussions --- p.61Chapter 8 --- Further Works --- p.62Chapter 8.1 --- Enhancement on Speech Segmentation --- p.62Chapter 8.2 --- Towards Speaker-Independent System --- p.63Chapter 8.3 --- Towards Speech-to-Text System --- p.64Chapter 9 --- Conclusions --- p.65Bibliography --- p.67Appendix A. Cantonese Syllable Full Set List --- p.7

    A review of Yorùbá Automatic Speech Recognition

    Get PDF
    Automatic Speech Recognition (ASR) has recorded appreciable progress both in technology and application.Despite this progress, there still exist wide performance gap between human speech recognition (HSR) and ASR which has inhibited its full adoption in real life situation.A brief review of research progress on Yorùbá Automatic Speech Recognition (ASR) is presented in this paper focusing of variability as factor contributing to performance gap between HSR and ASR with a view of x-raying the advances recorded, major obstacles, and chart a way forward for development of ASR for Yorùbá that is comparable to those of other tone languages and of developed nations.This is done through extensive surveys of literatures on ASR with focus on Yorùbá.Though appreciable progress has been recorded in advancement of ASR in the developed world, reverse is the case for most of the developing nations especially those of Africa.Yorùbá like most of languages in Africa lacks both human and materials resources needed for the development of functional ASR system much less taking advantage of its potentials benefits. Results reveal that attaining an ultimate goal of ASR performance comparable to human level requires deep understanding of variability factors

    Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR

    Full text link
    An end-to-end (E2E) ASR model implicitly learns a prior Internal Language Model (ILM) from the training transcripts. To fuse an external LM using Bayes posterior theory, the log likelihood produced by the ILM has to be accurately estimated and subtracted. In this paper we propose two novel approaches to estimate the ILM based on Listen-Attend-Spell (LAS) framework. The first method is to replace the context vector of the LAS decoder at every time step with a vector that is learned with training transcripts. Furthermore, we propose another method that uses a lightweight feed-forward network to directly map query vector to context vector in a dynamic sense. Since the context vectors are learned by minimizing the perplexities on training transcripts, and their estimation is independent of encoder output, hence the ILMs are accurately learned for both methods. Experiments show that the ILMs achieve the lowest perplexity, indicating the efficacy of the proposed methods. In addition, they also significantly outperform the shallow fusion method, as well as two previously proposed ILM Estimation (ILME) approaches on several datasets.Comment: Proceedings of INTERSPEEC

    End-to-end Lip-reading: A Preliminary Study

    Get PDF
    Deep lip-reading is the combination of the domains of computer vision and natural language processing. It uses deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have not yet been able to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures. This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. Four main contributions have been made: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline1 to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal2 to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading
    corecore