9,525 research outputs found
Pronunciation Understood : How intelligible do you think you are?
This study aims at automatically estimating probability of individual words of Japanese English (JE) being perceived correctly by American listeners and clarifying what kinds of (combinations of) segmental, prosodic, and linguistic errors in the words are more fatal to their correct perception. From a JE speech database, a balanced set of 360 utterances by 90 male speakers are firstly selected. Then, a listening experiment is done where 6 Americans are asked to transcribe all the utterances. Next, using speech and language technology, values of many segmental, prosodic, and linguistic attributes of the words are extracted. Finally, the relation between transcription rate of each word and its attribute values is analyzed by the Classification And Regression Tree (CART) method to predict probability of each of the JE words being transcribed correctly. Performance of the machine prediction is compared with that of the human prediction by four American teachers and three Japanese ones. This method is shown to be comparable to the best American teacher of the four. This paper also describes differences in perceiving intelligibility of the pronunciation between American teachers and Japanese ones
Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech
We describe a statistical approach for modeling dialogue acts in
conversational speech, i.e., speech-act-like units such as Statement, Question,
Backchannel, Agreement, Disagreement, and Apology. Our model detects and
predicts dialogue acts based on lexical, collocational, and prosodic cues, as
well as on the discourse coherence of the dialogue act sequence. The dialogue
model is based on treating the discourse structure of a conversation as a
hidden Markov model and the individual dialogue acts as observations emanating
from the model states. Constraints on the likely sequence of dialogue acts are
modeled via a dialogue act n-gram. The statistical dialogue grammar is combined
with word n-grams, decision trees, and neural networks modeling the
idiosyncratic lexical and prosodic manifestations of each dialogue act. We
develop a probabilistic integration of speech recognition with dialogue
modeling, to improve both speech recognition and dialogue act classification
accuracy. Models are trained and evaluated using a large hand-labeled database
of 1,155 conversations from the Switchboard corpus of spontaneous
human-to-human telephone speech. We achieved good dialogue act labeling
accuracy (65% based on errorful, automatically recognized words and prosody,
and 71% based on word transcripts, compared to a chance baseline accuracy of
35% and human accuracy of 84%) and a small reduction in word recognition error.Comment: 35 pages, 5 figures. Changes in copy editing (note title spelling
changed
Machine learning approaches to improving mispronunciation detection on an imbalanced corpus
This thesis reports the investigations into the task of phone-level pronunciation error detection, the performance of which is heavily affected by the imbalanced distribution of the classes in a manually annotated data set of non-native English (Read Aloud responses from the TOEFL Junior Pilot assessment). In order to address problems caused by this extreme class imbalance, two machine learning approaches, cost-sensitive learning and over-sampling, are explored to improve the classification performance. Specifically, approaches which assigned weights inversely proportional to class frequencies and synthetic minority over-sampling technique (SMOTE) were applied to a range of classifiers using feature sets that included information about the acoustic signal, the linguistic properties of the utterance, and word identity. Empirical experiments demonstrate that both balancing approaches lead to a substantial performance improvement (in terms of f1 score) over the baseline on this extremely imbalanced data set. In addition, this thesis also discusses which features are the most important and which classifiers are most effective for the task of identifying phone-level pronunciation errors in non-native speech
Automatic Pronunciation Assessment -- A Review
Pronunciation assessment and its application in computer-aided pronunciation
training (CAPT) have seen impressive progress in recent years. With the rapid
growth in language processing and deep learning over the past few years, there
is a need for an updated review. In this paper, we review methods employed in
pronunciation assessment for both phonemic and prosodic. We categorize the main
challenges observed in prominent research trends, and highlight existing
limitations, and available resources. This is followed by a discussion of the
remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding
Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme
Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
Methods for pronunciation assessment in computer aided language learning
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 149-176).Learning a foreign language is a challenging endeavor that entails acquiring a wide range of new knowledge including words, grammar, gestures, sounds, etc. Mastering these skills all require extensive practice by the learner and opportunities may not always be available. Computer Aided Language Learning (CALL) systems provide non-threatening environments where foreign language skills can be practiced where ever and whenever a student desires. These systems often have several technologies to identify the different types of errors made by a student. This thesis focuses on the problem of identifying mispronunciations made by a foreign language student using a CALL system. We make several assumptions about the nature of the learning activity: it takes place using a dialogue system, it is a task- or game-oriented activity, the student should not be interrupted by the pronunciation feedback system, and that the goal of the feedback system is to identify severe mispronunciations with high reliability. Detecting mispronunciations requires a corpus of speech with human judgements of pronunciation quality. Typical approaches to collecting such a corpus use an expert phonetician to both phonetically transcribe and assign judgements of quality to each phone in a corpus. This is time consuming and expensive. It also places an extra burden on the transcriber. We describe a novel method for obtaining phone level judgements of pronunciation quality by utilizing non-expert, crowd-sourced, word level judgements of pronunciation. Foreign language learners typically exhibit high variation and pronunciation shapes distinct from native speakers that make analysis for mispronunciation difficult. We detail a simple, but effective method for transforming the vowel space of non-native speakers to make mispronunciation detection more robust and accurate. We show that this transformation not only enhances performance on a simple classification task, but also results in distributions that can be better exploited for mispronunciation detection. This transformation of the vowel is exploited to train a mispronunciation detector using a variety of features derived from acoustic model scores and vowel class distributions. We confirm that the transformation technique results in a more robust and accurate identification of mispronunciations than traditional acoustic models.by Mitchell A. Peabody.Ph.D
- …