12 research outputs found

    Developing Speech Recognition and Synthesis Technologies to Support Computer-Aided Pronunciation Training for Chinese Learners of English

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    Methods for pronunciation assessment in computer aided language learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 149-176).Learning a foreign language is a challenging endeavor that entails acquiring a wide range of new knowledge including words, grammar, gestures, sounds, etc. Mastering these skills all require extensive practice by the learner and opportunities may not always be available. Computer Aided Language Learning (CALL) systems provide non-threatening environments where foreign language skills can be practiced where ever and whenever a student desires. These systems often have several technologies to identify the different types of errors made by a student. This thesis focuses on the problem of identifying mispronunciations made by a foreign language student using a CALL system. We make several assumptions about the nature of the learning activity: it takes place using a dialogue system, it is a task- or game-oriented activity, the student should not be interrupted by the pronunciation feedback system, and that the goal of the feedback system is to identify severe mispronunciations with high reliability. Detecting mispronunciations requires a corpus of speech with human judgements of pronunciation quality. Typical approaches to collecting such a corpus use an expert phonetician to both phonetically transcribe and assign judgements of quality to each phone in a corpus. This is time consuming and expensive. It also places an extra burden on the transcriber. We describe a novel method for obtaining phone level judgements of pronunciation quality by utilizing non-expert, crowd-sourced, word level judgements of pronunciation. Foreign language learners typically exhibit high variation and pronunciation shapes distinct from native speakers that make analysis for mispronunciation difficult. We detail a simple, but effective method for transforming the vowel space of non-native speakers to make mispronunciation detection more robust and accurate. We show that this transformation not only enhances performance on a simple classification task, but also results in distributions that can be better exploited for mispronunciation detection. This transformation of the vowel is exploited to train a mispronunciation detector using a variety of features derived from acoustic model scores and vowel class distributions. We confirm that the transformation technique results in a more robust and accurate identification of mispronunciations than traditional acoustic models.by Mitchell A. Peabody.Ph.D

    Hizketa-ezagutzan oinarritutako estrategiak, euskarazko online OBHI (Ordenagailu Bidezko Hizkuntza Ikaskuntza) sistemetarako

    Get PDF
    211 p. (eng) 217 p. (eusk.)Tesi honetan, euskarazko hizketa-ezagutze automatikoaren bi inplementazio aztertzen dira, Ordenagailu Bidezko Hizkuntza Ikaskuntza (OBHI) sistemetarako: Ordenagailu Bidezko Ebakera Lanketa (OBEL) eta Ahozko Gramatika Praktika (AGP). OBEL sistema klasikoan, erabiltzaileari esaldi bat irakurrarazten zaio, eta fonema bakoitzerako puntuazio bat jasotzen du bueltan. AGPn, Hitzez Hitzeko Esaldi Egiaztapena (HHEE) teknika proposatu dugu, ariketak ebatzi ahala egiaztatzen dituen sistema. Bi sistemon oinarrian, esakuntza egiaztatzeko teknikak daude, Goodness of Pronunciation (GOP) puntuazioa, adibididez.Sistema horiek inplementatzeko, eredu akustikoak entrenatu behar dira, eta, horretarako, Basque Speecon-like datu-basea erabili dugu, euskararako publikoki erabilgarri dagoen datu-base bakarra. Eredu akustiko onak lortzearren, datu-basean egokitzapenak egin behar izan dira hiztegi alternatibadun bat sortuz, eta fasekako entrenamendua ere probatu da. % 12.21eko PER (fonemen errore-tasa) lortu da hala.Lehendabiziko sistema laborategiko baldintzetan testatu da, eta emaitza lehiakorrak lortu dira.Hala ere, tesi honetako OBEL eta AGP sistemen helburua da bezero/zerbitzari motako arkitektura batean ezartzea, ikasleek edonondik atzi dezaten. Hori ahalbidetzeko, HTML5eko zehaztapenak erabili dira audioa zerbitzarira grabatu ahala bidaltzeko, eta, gainera, onlineko batezbesteko- eta bariantza-normalizazio cepstraleko (CMVN, Cepstral Mean and Variance Normalisation) teknika berri bat proposatu da erabiltzaileek grabatutako audio-seinaleen kanal desberdintasunen eragina txikiagotzeko. Teknika hori tesi honetan aurkeztutako metodo batean oinarriturik dago: normalizazio anitzeko puntuatzea (MNS, Multi Normalization Scoring), eta onlineko ahots-aktibitatearen detektagailu (VAD, Voice Activity Detector) berri bat ere proposatu da metodo horretan oinarriturik. Azkenik, parametro desberdinak ebaluatu dira neurona-sareak erabiliz, eta ondorioztatu da GOP puntuazioa dela eraginkorrena

    Hizketa-ezagutzan oinarritutako estrategiak, euskarazko online OBHI (Ordenagailu Bidezko Hizkuntza Ikaskuntza) sistemetarako

    Get PDF
    211 p. (eng) 217 p. (eusk.)Tesi honetan, euskarazko hizketa-ezagutze automatikoaren bi inplementazio aztertzen dira, Ordenagailu Bidezko Hizkuntza Ikaskuntza (OBHI) sistemetarako: Ordenagailu Bidezko Ebakera Lanketa (OBEL) eta Ahozko Gramatika Praktika (AGP). OBEL sistema klasikoan, erabiltzaileari esaldi bat irakurrarazten zaio, eta fonema bakoitzerako puntuazio bat jasotzen du bueltan. AGPn, Hitzez Hitzeko Esaldi Egiaztapena (HHEE) teknika proposatu dugu, ariketak ebatzi ahala egiaztatzen dituen sistema. Bi sistemon oinarrian, esakuntza egiaztatzeko teknikak daude, Goodness of Pronunciation (GOP) puntuazioa, adibididez.Sistema horiek inplementatzeko, eredu akustikoak entrenatu behar dira, eta, horretarako, Basque Speecon-like datu-basea erabili dugu, euskararako publikoki erabilgarri dagoen datu-base bakarra. Eredu akustiko onak lortzearren, datu-basean egokitzapenak egin behar izan dira hiztegi alternatibadun bat sortuz, eta fasekako entrenamendua ere probatu da. % 12.21eko PER (fonemen errore-tasa) lortu da hala.Lehendabiziko sistema laborategiko baldintzetan testatu da, eta emaitza lehiakorrak lortu dira.Hala ere, tesi honetako OBEL eta AGP sistemen helburua da bezero/zerbitzari motako arkitektura batean ezartzea, ikasleek edonondik atzi dezaten. Hori ahalbidetzeko, HTML5eko zehaztapenak erabili dira audioa zerbitzarira grabatu ahala bidaltzeko, eta, gainera, onlineko batezbesteko- eta bariantza-normalizazio cepstraleko (CMVN, Cepstral Mean and Variance Normalisation) teknika berri bat proposatu da erabiltzaileek grabatutako audio-seinaleen kanal desberdintasunen eragina txikiagotzeko. Teknika hori tesi honetan aurkeztutako metodo batean oinarriturik dago: normalizazio anitzeko puntuatzea (MNS, Multi Normalization Scoring), eta onlineko ahots-aktibitatearen detektagailu (VAD, Voice Activity Detector) berri bat ere proposatu da metodo horretan oinarriturik. Azkenik, parametro desberdinak ebaluatu dira neurona-sareak erabiliz, eta ondorioztatu da GOP puntuazioa dela eraginkorrena

    Robust speech recognition with spectrogram factorisation

    Get PDF
    Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

    Computer tool for use by children with learning difficulties in spelling

    Get PDF
    The development of a computer tool to be used by children with learning difficulties in spelling is described in this thesis. Children with spelling disabilities were observed by the author, and their errors were recorded. Based on analysis of these errors, a scheme of error classification was devised. It was hypothesized that there were regularities in the errors; that the classification scheme describing these errors could provide adequate information to enable a computer program to 'debug' the children's errors and to reconstruct the intended words; and that the children would be able to recognize correct spellings even if they could not produce them. Two computer programs, the EDITCOST and the PHONCODE programs, were developed. These incorporated information about the types of errors that were made by the children, described in terms of the classification scheme. They were used both to test the hypotheses and as potential components of a larger program to be used as a compensatory tool. The main conclusions drawn from this research are: The errors made by children with learning difficulties in spelling show regularities in both the phoneme-grapheme correspondences and at the level of the orthography. The classification scheme developed, based on the children's errors, provides a description of these errors. It provides adequate information to enable a computer program to 'debug' the children's errors and to reconstruct the intended words. Computer tools in the form of interactive spelling correctors are able to offer a correction for a substantial proportion of the child's errors, and could be extended to provide more information about the children's errors. They are also suitable for use with other groups of children

    Introduction to Psycholiguistics

    Get PDF

    Modeling of Polish Intonation for Statistical-Parametric Speech Synthesis

    Get PDF
    Wydział NeofilologiiBieżąca praca prezentuje próbę budowy neurobiologicznie umotywowanego modelu mapowań pomiędzy wysokopoziomowymi dyskretnymi kategoriami lingwistycznymi a ciągłym sygnałem częstotliwości podstawowej w polskiej neutralnej mowie czytanej, w oparciu o konwolucyjne sieci neuronowe. Po krótkim wprowadzeniu w problem badawczy w kontekście intonacji, syntezy mowy oraz luki pomiędzy fonetyką a fonologią, praca przedstawia opis uczenia modelu na podstawie specjalnego korpusu mowy oraz ewaluację naturalności konturu F0 generowanego przez wyuczony model za pomocą eksperymentów percepcyjnych typu ABX oraz MOS przy użyciu specjalnie w tym celu zbudowanego resyntezatora Neural Source Filter. Następnie, prezentowane są wyniki eksploracji fonologiczno-fonetycznych mapowań wyuczonych przez model. W tym celu wykorzystana została jedna z tzw. metod wyjaśniających dla sztucznej inteligencji – Layer-wise Relevance Propagation. W pracy przedstawione zostały wyniki powstałej na tej podstawie obszernej analizy ilościowej istotności dla konturu częstotliwości podstawowej każdej z 1297 specjalnie wygenerowanych lingwistycznych kategorii wejściowych modelu oraz ich wielorakich grupowań na różnorodnych poziomach abstrakcji. Pracę kończy dogłębna analiza oraz interpretacja uzyskanych wyników oraz rozważania na temat mocnych oraz słabych stron zastosowanych metod, a także lista proponowanych usprawnień.This work presents an attempt to build a neurobiologically inspired Convolutional Neural Network-based model of the mappings between discrete high-level linguistic categories into a continuous signal of fundamental frequency in Polish neutral read speech. After a brief introduction of the current research problem in the context of intonation, speech synthesis and the phonetic-phonology gap, the work goes on to describe the training of the model on a special speech corpus, and an evaluation of the naturalness of the F0 contour produced by the trained model through ABX and MOS perception experiments conducted with help of a specially built Neural Source Filter resynthesizer. Finally, an in-depth exploration of the phonology-to-phonetics mappings learned by the model is presented; the Layer-wise Relevance Propagation explainability method was used to perform an extensive quantitative analysis of the relevance of 1297 specially engineered linguistic input features and their groupings at various levels of abstraction for the specific contours of the fundamental frequency. The work ends with an in-depth interpretation of these results and a discussion of the advantages and disadvantages of the current method, and lists a number of potential future improvements.Badania przedstawione w pracy zostały cz˛e´sciowo zrealizowane w ramach grantu badawczego Harmonia nr UMO-2014/14/M/HS2/00631 przyznanego przez Narodowe Centrum Nauki

    L’individualità del parlante nelle scienze fonetiche: applicazioni tecnologiche e forensi

    Full text link
    corecore