Search CORE

2 research outputs found

A comparison-based approach to mispronunciation detection

Author: Lee Ann, Ph. D. Massachusetts Institute of Technology
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2012
Field of study

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 89-92).This thesis focuses on the problem of detecting word-level mispronunciations in nonnative speech. Conventional automatic speech recognition-based mispronunciation detection systems have the disadvantage of requiring a large amount of language-specific, annotated training data. Some systems even require a speech recognizer in the target language and another one in the students' native language. To reduce human labeling effort and for generalization across all languages, we propose a comparison-based framework which only requires word-level timing information from the native training data. With the assumption that the student is trying to enunciate the given script, dynamic time warping (DTW) is carried out between a student's utterance (nonnative speech) and a teacher's utterance (native speech), and we focus on detecting mis-alignment in the warping path and the distance matrix. The first stage of the system locates word boundaries in the nonnative utterance. To handle the problem that nonnative speech often contains intra-word pauses, we run DTW with a silence model which can align the two utterances, detect and remove silences at the same time. In order to segment each word into smaller, acoustically similar, units for a finer-grained analysis, we develop a phoneme-like unit segmentor which works by segmenting the selfsimilarity matrix into low-distance regions along the diagonal. Both phone-level and wordlevel features that describe the degree of mis-alignment between the two utterances are extracted, and the problem is formulated as a classification task. SVM classifiers are trained, and three voting schemes are considered for the cases where there are more than one matching reference utterance. The system is evaluated on the Chinese University Chinese Learners of English (CUCHLOE) corpus, and the TIMIT corpus is used as the native corpus. Experimental results have shown 1) the effectiveness of the silence model in guiding DTW to capture the word boundaries in nonnative speech more accurately, 2) the complimentary performance of the word-level and the phone-level features, and 3) the stable performance of the system with or without phonetic units labeling.by Ann Lee.S.M

DSpace@MIT

Towards robust word discovery by self similarity matrix comparison

Author: Bimbot Frédéric
Gravier Guillaume
Muscariello Armando
Publication venue: HAL CCSD
Publication date: 22/05/2011
Field of study

International audienceWord discovery is the task of discovering and collecting occurrences of repeating words in the absence of prior acoustic and linguistic knowledge, or training material. The capability of extracting such patterns (or motifs) represents a preliminary step towards automatic mining of contentful information in spoken documents. The absence of modelling and training data, forces the use of direct pattern matching of speech templates, which, in turn, is sensitive to speech variability, like the inter-speaker one, for instance. In the present work, a variability tolerant pattern recognition technique is proposed that relies on the comparison of self similarity matrices of speech sequences. The joint use of such technique and a dynamic time warping dissimilarity measure, is shown to account for more variability with respect to the DTW-based system alone, as demonstrated on several hours of broadcast news shows

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1