8 research outputs found

    The Prosody of Uncertainty for Spoken Dialogue Intelligent Tutoring Systems

    Get PDF
    The speech medium is more than an audio conveyance of word strings. It contains meta information about the content of the speech. The prosody of speech, pauses and intonation, adds an extra dimension of diagnostic information about the quality of a speaker\u27s answers, suggesting an important avenue of research for spoken dialogue tutoring systems. Tutoring systems that are sensitive to such cues may employ different tutoring strategies based on detected student uncertainty, and they may be able to perform more precise assessment of the area of student difficulty. However, properly identifying the cues can be challenging, typically requiring thousands of hand labeled utterances for training in machine learning. This study proposes and explores means of exploiting alternate automatically generated information, utterance correctness and the amount of practice a student has had, as indicators of student uncertainty. It finds correlations with various prosodic features and these automatic indicators and compares the result with a small set of annotated utterances, and finally demonstrates a Bayesian classifier based on correctness scores as class labels

    Hierarchy-based Partition Models: Using Classification Hierarchies to

    Get PDF
    We propose a novel machine learning technique that can be used to estimate probability distributions for categorical random variables that are equipped with a natural set of classification hierarchies, such as words equipped with word class hierarchies, wordnet hierarchies, and suffix and affix hierarchies. We evaluate the estimator on bigram language modelling with a hierarchy based on word suffixes, using English, Danish, and Finnish data from the Europarl corpus with training sets of up to 1–1.5 million words. The results show that the proposed estimator outperforms modified Kneser-Ney smoothing in terms of perplexity on unseen data. This suggests that important information is hidden in the classification hierarchies that we routinely use in computational linguistics, but that we are unable to utilize this information fully because our current statistical techniques are either based on simple counting models or designed for sample spaces with a distance metric, rather than sample spaces with a non-metric topology given by a classification hierarchy. Keywords: machine learning; categorical variables; classification hierarchies; language modelling; statistical estimatio

    Gesture in Automatic Discourse Processing

    Get PDF
    Computers cannot fully understand spoken language without access to the wide range of modalities that accompany speech. This thesis addresses the particularly expressive modality of hand gesture, and focuses on building structured statistical models at the intersection of speech, vision, and meaning.My approach is distinguished in two key respects. First, gestural patterns are leveraged to discover parallel structures in the meaning of the associated speech. This differs from prior work that attempted to interpret individual gestures directly, an approach that was prone to a lack of generality across speakers. Second, I present novel, structured statistical models for multimodal language processing, which enable learning about gesture in its linguistic context, rather than in the abstract.These ideas find successful application in a variety of language processing tasks: resolving ambiguous noun phrases, segmenting speech into topics, and producing keyframe summaries of spoken language. In all three cases, the addition of gestural features -- extracted automatically from video -- yields significantly improved performance over a state-of-the-art text-only alternative. This marks the first demonstration that hand gesture improves automatic discourse processing

    A Vectorized Processing Algorithm for Continuous Speech Recognition and Associated FPGA-Based Architecture

    Get PDF
    This work analyzes Continuous Automatic Speech Recognition (CSR) and in contrast to prior work, it shows that the CSR algorithms can be specified in a highly parallel form. Through use of the MATLAB software package, the parallelism is exploited to create a compact, vectorized algorithm that is able to execute the CSR task. After an in-depth analysis of the SPHINX 3 Large Vocabulary Continuous Speech Recognition (LVCSR) engine the major functional units were redesigned in the MATLAB environment, taking special effort to flatten the algorithms and restructure the data to allow for matrix-based computations. Performing this conversion resulted in reducing the original 14,000 lines of C++ code into less then 200 lines of highly-vectorized operations, substantially increasing the potential Instruction Line Parallelism of the system. Using this vector model as a baseline, a custom hardware system was then created that is capable of performing the speech recognition task in real-time on a Xilinx Virtex-4 FPGA device. Through the creation independent hardware engines for each stage of the speech recognition process, the throughput of each is maximized by customizing the logic to the specific task. Further, a unique architecture was designed that allows for the creation of a static data path throughout the hardware, effectively removing the need for complex bus arbitration in the system. By making using of shared memory resources and applying a token passing scheme to the system, both the data movement within the design as well as the amount of active data are continually minimized during run-time. These results provide a novel method for perform speech recognition in both hardware and software, helping to further the development of systems capable of recognizing human speech

    Methods for pronunciation assessment in computer aided language learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 149-176).Learning a foreign language is a challenging endeavor that entails acquiring a wide range of new knowledge including words, grammar, gestures, sounds, etc. Mastering these skills all require extensive practice by the learner and opportunities may not always be available. Computer Aided Language Learning (CALL) systems provide non-threatening environments where foreign language skills can be practiced where ever and whenever a student desires. These systems often have several technologies to identify the different types of errors made by a student. This thesis focuses on the problem of identifying mispronunciations made by a foreign language student using a CALL system. We make several assumptions about the nature of the learning activity: it takes place using a dialogue system, it is a task- or game-oriented activity, the student should not be interrupted by the pronunciation feedback system, and that the goal of the feedback system is to identify severe mispronunciations with high reliability. Detecting mispronunciations requires a corpus of speech with human judgements of pronunciation quality. Typical approaches to collecting such a corpus use an expert phonetician to both phonetically transcribe and assign judgements of quality to each phone in a corpus. This is time consuming and expensive. It also places an extra burden on the transcriber. We describe a novel method for obtaining phone level judgements of pronunciation quality by utilizing non-expert, crowd-sourced, word level judgements of pronunciation. Foreign language learners typically exhibit high variation and pronunciation shapes distinct from native speakers that make analysis for mispronunciation difficult. We detail a simple, but effective method for transforming the vowel space of non-native speakers to make mispronunciation detection more robust and accurate. We show that this transformation not only enhances performance on a simple classification task, but also results in distributions that can be better exploited for mispronunciation detection. This transformation of the vowel is exploited to train a mispronunciation detector using a variety of features derived from acoustic model scores and vowel class distributions. We confirm that the transformation technique results in a more robust and accurate identification of mispronunciations than traditional acoustic models.by Mitchell A. Peabody.Ph.D

    Plataforma embarcada de reconhecimento autom?tico da fala para o aux?lio de pessoas com mobilidade reduzida

    Get PDF
    A busca por maior independ?ncia e autonomia para as pessoas com defici?ncia tem se apresentado como um fator decisivo ao proporcionar uma melhoria na qualidade de vida desses indiv?duos atrav?s do uso de tecnologias assistivas. A fala se constitui na mais b?sica, comum e eficiente forma de comunica??o entre os seres humanos, de modo que a entrada de comandos por voz pode ser uma alternativa para que pessoas com mobilidade reduzida, e que tenham preservada boa capacidade das habilidades da fala, realizem o controle do computador ou outros dispositivos. O objetivo deste trabalho consiste no desenvolvimento de uma interface de comandos por voz, atrav?s do reconhecimento autom?tico da fala, que seja facilmente adaptada e incorporada a sistemas e ferramentas de aux?lio ao controle do ambiente dom?stico (dom?tica). Com esse intuito, foram executadas duas abordagens de desenvolvimento. A primeira consistiu de um experimento piloto realizado com o intuito de formar uma base inicial de conhecimento no desenvolvimento de aplica??es utilizando o reconhecimento de comandos por voz. Esta etapa baseou-se na utiliza??o de um m?dulo de hardware espec?fico, que recebe os comando de voz diretamente atrav?s de um microfone, constituindo-se de um sistema dependente de locutor capaz de reconhecer comandos de palavras isoladas para o controle das luzes de umLED RGB. J? a segunda abordagem, integra componentes de hardware aberto e software livre e de c?digo aberto, sendo os comandos de voz fornecidos ao sistema atrav?s de um smartphone configurado com softphone VoIP (Voz sobre IP). Nesse ?ltimo caso, o softphone, ent?o, se registra no servidor de comunica??o Asterisk, que implementa uma central telef?nica com unidade de resposta aud?vel (URA). Integrada ao servidor, est? a ferramenta de reconhecimento da fala, Julius. Esses componentes est?o embarcados na plataforma Beaglebone Black, de baixo custo. O sistema ? dependente de locutor e capaz de reconhecer frases com tr?s palavras para o controle da ilumina??o, televis?o e acesso a portas de umambiente dom?stico hipot?tico constitu?do de sala, cozinha, quarto, banheiro e ?rea externa. Os resultados obtidos a partir dos testes realizados indicam taxas de acerto de 95,9% e 94,77% para as interfaces desenvolvidas na primeira e segunda abordagens, respectivamente. Esses ?ndices sugeremque ? vi?vel o emprego dos m?dulos de reconhecimento desenvolvidos na implementa??o de solu??es de tecnologias assistivas
    corecore