4,320 research outputs found
Word hypothesis from undifferentiated, errorful phonetic strings
This thesis investigates a dynamic programming approach to word hypothesis in the context of a speaker independent, large vocabulary, continuous speech recognition system. Using a method known as Dynamic Time Warping, an undifferentiated phonetic string (one without word boundaries) is parsed to produce all possible words contained in a domain specific lexicon. Dynamic Time Warping is a common method of sequence comparison used in matching the acoustic feature vectors representing an unknown input utterance and some reference utterance. The cumulative least cost path, when compared with some threshold can be used as a decision criterion for recognition. This thesis attempts to extend the DTW technique using strings of phonetic symbols, instead. Three variables that were found to affect the parsing process include: (1) minimum distance threshold, (2) the number of word candidates accepted at any given phonetic index, and (3) the lexical search space used for reference pattern comparisons. The performance of this parser as a function of these variables is discussed. Also discussed is the performance of the parser at a variety of input error conditions
Utterance verification in large vocabulary spoken language understanding system
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (leaves 87-89).by Huan Yao.M.Eng
Confusion modelling for lip-reading
Lip-reading is mostly used as a means of communication by people with hearing di�fficulties. Recent work has explored the automation of this process, with the aim
of building a speech recognition system entirely driven by lip movements. However, this work has so far produced poor results because of factors such as high variability
of speaker features, diffi�culties in mapping from visual features to speech sounds, and high co-articulation of visual features.
The motivation for the work in this thesis is inspired by previous work in dysarthric speech recognition [Morales, 2009]. Dysathric speakers have poor control over their
articulators, often leading to a reduced phonemic repertoire. The premise of this thesis is that recognition of the visual speech signal is a similar problem to recog-
nition of dysarthric speech, in that some information about the speech signal has been lost in both cases, and this brings about a systematic pattern of errors in the
decoded output.
This work attempts to exploit the systematic nature of these errors by modelling them in the framework of a weighted finite-state transducer cascade. Results
indicate that the technique can achieve slightly lower error rates than the conventional approach. In addition, it explores some interesting more general questions for
automated lip-reading
Adjusted Viterbi training for hidden Markov models
To estimate the emission parameters in hidden Markov models one commonly uses
the EM algorithm or its variation. Our primary motivation, however, is the
Philips speech recognition system wherein the EM algorithm is replaced by the
Viterbi training algorithm. Viterbi training is faster and computationally less
involved than EM, but it is also biased and need not even be consistent. We
propose an alternative to the Viterbi training -- adjusted Viterbi training --
that has the same order of computational complexity as Viterbi training but
gives more accurate estimators. Elsewhere, we studied the adjusted Viterbi
training for a special case of mixtures, supporting the theory by simulations.
This paper proves the adjusted Viterbi training to be also possible for more
general hidden Markov models.Comment: 45 pages, 2 figure
System-independent ASR error detection and classification using Recurrent Neural Network
This paper addresses errors in continuous Automatic Speech Recognition (ASR) in two stages: error detection and error type classification. Unlike the majority of research in this field, we propose to handle the recognition errors independently from the ASR decoder. We first establish an effective set of generic features derived exclusively from the recognizer output to compensate for the absence of ASR decoder information. Then, we apply a variant Recurrent Neural Network (V-RNN) based models for error detection and error type classification. Such model learn additional information to the recognized word classification using label dependency. As a result, experiments on Multi-Genre Broadcast Media corpus have shown that the proposed generic features setup leads to achieve competitive performances, compared to state of the art systems in both tasks. Furthermore, we have shown that V-RNN trained on the proposed feature set appear to be an effective classifier for the ASR error detection with an Accuracy of 85.43%
Acoustic-phonetic constraints in continuous speech recognition: a case study using the digit vocabulary.
Thesis (Ph.D.)—Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1985.Includes bibliographical references (leaves 155-159).This electronic version was scanned from a copy of the thesis on file at the Speech Communication Group. The certified thesis is available in the Institute Archives and Special Collections.Vinton-Hayes Fellowship.
DARPA, monitored through the Office of Naval Research.
System Development Foundation.Ph.D
An overview of artificial intelligence and robotics. Volume 1: Artificial intelligence. Part B: Applications
Artificial Intelligence (AI) is an emerging technology that has recently attracted considerable attention. Many applications are now under development. This report, Part B of a three part report on AI, presents overviews of the key application areas: Expert Systems, Computer Vision, Natural Language Processing, Speech Interfaces, and Problem Solving and Planning. The basic approaches to such systems, the state-of-the-art, existing systems and future trends and expectations are covered
Prosodic detail in Neapolitan Italian
Recent findings on phonetic detail have been taken as supporting exemplar-based approaches to prosody. Through four experiments on both production and perception of both melodic and temporal detail in Neapolitan Italian, we show that prosodic detail is not incompatible with abstractionist approaches either. Specifically, we suggest that the exploration of prosodic detail leads to a refined understanding of the relationships between the richly specified and continuous varying phonetic information on one side, and coarse phonologically structured contrasts on the other, thus offering insights on how pragmatic information is conveyed by prosody
- …