110 research outputs found

    Automatic Framework to Aid Therapists to Diagnose Children who Stutter

    Get PDF

    Frame-level features conveying phonetic information for language and speaker recognition

    Get PDF
    150 p.This Thesis, developed in the Software Technologies Working Group of the Departmentof Electricity and Electronics of the University of the Basque Country, focuseson the research eld of spoken language and speaker recognition technologies.More specically, the research carried out studies the design of a set of featuresconveying spectral acoustic and phonotactic information, searches for the optimalfeature extraction parameters, and analyses the integration and usage of the featuresin language recognition systems, and the complementarity of these approacheswith regard to state-of-the-art systems. The study reveals that systems trained onthe proposed set of features, denoted as Phone Log-Likelihood Ratios (PLLRs), arehighly competitive, outperforming in several benchmarks other state-of-the-art systems.Moreover, PLLR-based systems also provide complementary information withregard to other phonotactic and acoustic approaches, which makes them suitable infusions to improve the overall performance of spoken language recognition systems.The usage of this features is also studied in speaker recognition tasks. In this context,the results attained by the approaches based on PLLR features are not as remarkableas the ones of systems based on standard acoustic features, but they still providecomplementary information that can be used to enhance the overall performance ofthe speaker recognition systems

    Unsupervised spoken keyword spotting and learning of acoustically meaningful units

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 103-106).The problem of keyword spotting in audio data has been explored for many years. Typically researchers use supervised methods to train statistical models to detect keyword instances. However, such supervised methods require large quantities of annotated data that is unlikely to be available for the majority of languages in the world. This thesis addresses this lack-of-annotation problem and presents two completely unsupervised spoken keyword spotting systems that do not require any transcribed data. In the first system, a Gaussian Mixture Model is trained to label speech frames with a Gaussian posteriorgram, without any transcription information. Given several spoken samples of a keyword, a segmental dynamic time warping is used to compare the Gaussian posteriorgrams between keyword samples and test utterances. The keyword detection result is then obtained by ranking the distortion scores of all the test utterances. In the second system, to avoid the need for spoken samples, a Joint-Multigram model is used to build a mapping from the keyword text samples to the Gaussian component indices. A keyword instance in the test data can be detected by calculating the similarity score of the Gaussian component index sequences between keyword samples and test utterances. The proposed two systems are evaluated on the TIMIT and MIT Lecture corpus. The result demonstrates the viability and effectiveness of the two systems. Furthermore, encouraged by the success of using unsupervised methods to perform keyword spotting, we present some preliminary investigation on the unsupervised detection of acoustically meaningful units in speech.by Yaodong Zhang.S.M

    Towards an automatic speech recognition system for use by deaf students in lectures

    Get PDF
    According to the Royal National Institute for Deaf people there are nearly 7.5 million hearing-impaired people in Great Britain. Human-operated machine transcription systems, such as Palantype, achieve low word error rates in real-time. The disadvantage is that they are very expensive to use because of the difficulty in training operators, making them impractical for everyday use in higher education. Existing automatic speech recognition systems also achieve low word error rates, the disadvantages being that they work for read speech in a restricted domain. Moving a system to a new domain requires a large amount of relevant data, for training acoustic and language models. The adopted solution makes use of an existing continuous speech phoneme recognition system as a front-end to a word recognition sub-system. The subsystem generates a lattice of word hypotheses using dynamic programming with robust parameter estimation obtained using evolutionary programming. Sentence hypotheses are obtained by parsing the word lattice using a beam search and contributing knowledge consisting of anti-grammar rules, that check the syntactic incorrectness’ of word sequences, and word frequency information. On an unseen spontaneous lecture taken from the Lund Corpus and using a dictionary containing "2637 words, the system achieved 815% words correct with 15% simulated phoneme error, and 73.1% words correct with 25% simulated phoneme error. The system was also evaluated on 113 Wall Street Journal sentences. The achievements of the work are a domain independent method, using the anti- grammar, to reduce the word lattice search space whilst allowing normal spontaneous English to be spoken; a system designed to allow integration with new sources of knowledge, such as semantics or prosody, providing a test-bench for determining the impact of different knowledge upon word lattice parsing without the need for the underlying speech recognition hardware; the robustness of the word lattice generation using parameters that withstand changes in vocabulary and domain
    • …
    corecore