7,742 research outputs found

    Analysing recognition errors in unlimited-vocabulary speech recognition

    Full text link

    Advances in unlimited-vocabulary speech recognition for morphologically rich languages

    Get PDF
    Automatic speech recognition systems are devices or computer programs that convert human speech into text or make actions based on what is said to the system. Typical applications include dictation, automatic transcription of large audio or video databases, speech-controlled user interfaces, and automated telephone services, for example. If the recognition system is not limited to a certain topic and vocabulary, covering the words in the target languages as well as possible while maintaining a high recognition accuracy becomes an issue. The conventional way to model the target language, especially in English recognition systems, is to limit the recognition to the most common words of the language. A vocabulary of 60 000 words is usually enough to cover the language adequately for arbitrary topics. On the other hand, in morphologically rich languages, such as Finnish, Estonian and Turkish, long words can be formed by inflecting and compounding, which makes it difficult to cover the language adequately by vocabulary-based approaches. This thesis deals with methods that can be used to build efficient speech recognition systems for morphologically rich languages. Before training the statistical n-gram language models on a large text corpus, the words in the corpus are automatically segmented into smaller fragments, referred to as morphs. The morphs are then used as modelling units of the n-gram models instead of whole words. This makes it possible to train the model on the whole text corpus without limiting the vocabulary and enables the model to create even unseen words by joining morphs together. Since the segmentation algorithm is unsupervised and data-driven, it can be readily used for many languages. Speech recognition experiments are made on various Finnish recognition tasks and some of the experiments are also repeated on an Estonian task. It is shown that the morph-based language models reduce recognition errors when compared to word-based models. It seems to be important, however, that the n-gram models are allowed to use long morph contexts, especially if the morphs used by the model are short. This can be achieved by using growing and pruning algorithms to train variable-length n-gram models. The thesis also presents data structures that can be used for representing the variable-length n-gram models efficiently in recognition systems. By analysing the recognition errors made by Finnish recognition systems it is found out that speaker adaptive training and discriminative training methods help to reduce errors in different situations. The errors are also analysed according to word frequencies and manually defined error classes

    Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

    Get PDF
    We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.Peer reviewe

    Keskusteluavustimen kehittäminen kuulovammaisia varten automaattista puheentunnistusta käyttäen

    Get PDF
    Understanding and participating in conversations has been reported as one of the biggest challenges hearing impaired people face in their daily lives. These communication problems have been shown to have wide-ranging negative consequences, affecting their quality of life and the opportunities available to them in education and employment. A conversational assistance application was investigated to alleviate these problems. The application uses automatic speech recognition technology to provide real-time speech-to-text transcriptions to the user, with the goal of helping deaf and hard of hearing persons in conversational situations. To validate the method and investigate its usefulness, a prototype application was developed for testing purposes using open-source software. A user test was designed and performed with test participants representing the target user group. The results indicate that the Conversation Assistant method is valid, meaning it can help the hearing impaired to follow and participate in conversational situations. Speech recognition accuracy, especially in noisy environments, was identified as the primary target for further development for increased usefulness of the application. Conversely, recognition speed was deemed to be sufficient and already surpass the transcription speed of human transcribers.Keskustelupuheen ymmärtäminen ja keskusteluihin osallistuminen on raportoitu yhdeksi suurimmista haasteista, joita kuulovammaiset kohtaavat jokapäiväisessä elämässään. Näillä viestintäongelmilla on osoitettu olevan laaja-alaisia negatiivisia vaikutuksia, jotka heijastuvat elämänlaatuun ja heikentävät kuulovammaisten yhdenvertaisia osallistumismahdollisuuksia opiskeluun ja työelämään. Työssä kehitettiin ja arvioitiin apusovellusta keskustelupuheen ymmärtämisen ja keskusteluihin osallistumisen helpottamiseksi. Sovellus käyttää automaattista puheentunnistusta reaaliaikaiseen puheen tekstittämiseen kuuroja ja huonokuuloisia varten. Menetelmän toimivuuden vahvistamiseksi ja sen hyödyllisyyden tutkimiseksi siitä kehitettiin prototyyppisovellus käyttäjätestausta varten avointa lähdekoodia hyödyntäen. Testaamista varten suunniteltiin ja toteutettiin käyttäjäkoe sovelluksen kohderyhmää edustavilla koekäyttäjillä. Saadut tulokset viittaavat siihen, että työssä esitetty Keskusteluavustin on toimiva ja hyödyllinen apuväline huonokuuloisille ja kuuroille. Puheentunnistustarkkuus erityisesti meluisissa olosuhteissa osoittautui ensisijaiseksi kehityskohteeksi apusovelluksen hyödyllisyyden lisäämiseksi. Puheentunnistuksen nopeus arvioitiin puolestaan jo riittävän nopeaksi, ylittäen selkeästi kirjoitustulkkien kirjoitusnopeuden

    IMAGINE Final Report

    No full text

    MILITARY COMMUNICATIONS AND INFORMATION SYSTEMS CONFERENCE Is speech technology ready for use now?

    Get PDF
    Research and development in speech technology has been performed for almost 30 years now. Coming from experimental systems, a set of products has been developed in this time. From the view of the potential users, the main question remains: has the technology reached a state where it can be used meaningfully? This paper will discuss this question- and will give an overview of the tasks that speech recognition can solve and the state for usability for each of the task

    The use of scaffolding-based software in developing pronunciation

    Get PDF
    This research looked at the use of a scaffolding-based software in helping learners to develop pronunciation and fluency modelled on standard American English. The study used Vygotsky’s zone of proximal development (ZPD) theory and Scaffolding Learning. Principles as a basis for observing how learners of English progressed through the learning process. Firstly, the research examined an accent-reduction software to find out how the software design supports scaffolding principles. To determine the effectiveness of the software on learners’ general pronunciation, pre-test and post-test were used. The data obtained from the pre-test and the post-test showed a significant improvement in learners’ general pronunciation after using the pronunciation learning software.Secondly, case studies were conducted to investigate Persian ESL learners’ progress in pronouncing English consonants that are absent from the phonemic inventory of Persian. The selected cases were recorded during class time, while they were working with the software. The obtained recordings were then analysed using PRAAT, a speech analysis programme. Later, two raters helped the researcher to determine the quality of the sounds produced by the learners. The results from the case study showed that with the appropriate scaffolds provided by the software, in the form of explicit instruction, native models and multimodal feedback, the learners were found to have the microgenesis improvements towards the native model and progressed within the ZPD to pronounce the consonants that were absent from the inventory system of their first language. Finally, learners’ perceptions of the software were asked in an interview session after the instructional programme. Based on their responses to the interview questions, it was found that the learners positively perceived the use of the scaffolding-based accent reduction software to improve their general pronunciation

    Viseme-based Lip-Reading using Deep Learning

    Get PDF
    Research in Automated Lip Reading is an incredibly rich discipline with so many facets that have been the subject of investigation including audio-visual data, feature extraction, classification networks and classification schemas. The most advanced and up-to-date lip-reading systems can predict entire sentences with thousands of different words and the majority of them use ASCII characters as the classification schema. The classification performance of such systems however has been insufficient and the need to cover an ever expanding range of vocabulary using as few classes as possible is challenge. The work in this thesis contributes to the area concerning classification schemas by proposing an automated lip reading model that predicts sentences using visemes as a classification schema. This is an alternative schema to using ASCII characters, which is the conventional class system used to predict sentences. This thesis provides a review of the current trends in deep learning- based automated lip reading and analyses a gap in the research endeavours of automated lip-reading by contributing towards work done in the region of classification schema. A whole new line of research is opened up whereby an alternative way to do lip-reading is explored and in doing so, lip-reading performance results for predicting s entences from a benchmark dataset are attained which improve upon the current state-of-the-art. In this thesis, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The lip-reading system predicts sentences as a two-stage procedure with visemes being recognised as the first stage and words being classified as the second stage. This is such that the second-stage has to both overcome the one-to-many mapping problem posed in lip-reading where one set of visemes can map to several words, and the problem of visemes being confused or misclassified to begin with. To develop the proposed lip-reading system, a number of tasks have been performed in this thesis. These include the classification of continuous sequences of visemes; and the proposal of viseme-to-word conversion models that are both effective in their conversion performance of predicting words, and robust to the possibility of viseme confusion or misclassification. The initial system reported has been testified on the challenging BBC Lip Reading Sentences 2 (LRS2) benchmark dataset attaining a word accuracy rate of 64.6%. Compared with the state-of-the-art works in lip reading sentences reported at the time, the system had achieved a significantly improved performance. The lip reading system is further improved upon by using a language model that has been demonstrated to be effective at discriminating between homopheme words and being robust to incorrectly classified visemes. An improved performance in predicting spoken sentences from the LRS2 dataset is yielded with an attained word accuracy rate of 79.6% which is still better than another lip-reading system trained and evaluated on the the same dataset that attained a word accuracy rate 77.4% and it is to the best of our knowledge the next best observed result attained on LRS2

    EFL Students´ Experience with Learning Pronunciation

    Get PDF
    Cílem této bakalářské práce je zmapovat zkušenost studentů, kteří studují anglický jazyk jako obor, s výukou anglické výslovnosti během jejich studia na základních a středních školách, a zároveň zjistit jejich názor ohledně důrazu, který by měl být na výslovnost během školních hodin kladen. K tomuto účelu byl vypracován anonymní dotazník, který v elektronické podobě vyplnilo 30 studentů Technické univerzity v Liberci. Autor se ve zkušenostech studentů zaměřuje především na pozici výslovnosti mezi ostatními jazykovými prvky a zároveň zkoumá, na jaké prvky byl o hodinách kladen největší a nejmenší důraz. Dále je analyzován význam pozice učitele při osvojování anglické výslovnosti a jsou vyčteny další zdroje, které přispěly k zdokonalení anglické výslovnosti studentů. Výzkum ukázal, že dotazovaní studenti uznávají důležitost výslovnosti v hodinách angličtiny, ovšem jejich vlastní zkušenosti s jejich názory nekorelují.The subject matter of this thesis is to investigate and analyse the experience of EFL students with teaching / learning pronunciation from elementary to secondary school and at the same time investigate and analyse their opinion about the emphasis which should be given to pronunciation in school classes. To collect their opinions, the author created an anonymous questionnaire which was filled out by thirty EFL students of the Technical University in Liberec. Concerning the students´ experience, the author mainly focuses on the position of pronunciation among other language skills and systems and also explores to which of them was given the most and the least attention in school classes. Next, the importance of the teacher in teaching / learning pronunciation and the importance of other sources which contributed to students´ improvement of English pronunciation were analysed. The research shown, that the respondents acknowledge the importance of pronunciation in English classes but their own experiences do not correlate with their opinions

    Towards an automatic speech recognition system for use by deaf students in lectures

    Get PDF
    According to the Royal National Institute for Deaf people there are nearly 7.5 million hearing-impaired people in Great Britain. Human-operated machine transcription systems, such as Palantype, achieve low word error rates in real-time. The disadvantage is that they are very expensive to use because of the difficulty in training operators, making them impractical for everyday use in higher education. Existing automatic speech recognition systems also achieve low word error rates, the disadvantages being that they work for read speech in a restricted domain. Moving a system to a new domain requires a large amount of relevant data, for training acoustic and language models. The adopted solution makes use of an existing continuous speech phoneme recognition system as a front-end to a word recognition sub-system. The subsystem generates a lattice of word hypotheses using dynamic programming with robust parameter estimation obtained using evolutionary programming. Sentence hypotheses are obtained by parsing the word lattice using a beam search and contributing knowledge consisting of anti-grammar rules, that check the syntactic incorrectness’ of word sequences, and word frequency information. On an unseen spontaneous lecture taken from the Lund Corpus and using a dictionary containing "2637 words, the system achieved 815% words correct with 15% simulated phoneme error, and 73.1% words correct with 25% simulated phoneme error. The system was also evaluated on 113 Wall Street Journal sentences. The achievements of the work are a domain independent method, using the anti- grammar, to reduce the word lattice search space whilst allowing normal spontaneous English to be spoken; a system designed to allow integration with new sources of knowledge, such as semantics or prosody, providing a test-bench for determining the impact of different knowledge upon word lattice parsing without the need for the underlying speech recognition hardware; the robustness of the word lattice generation using parameters that withstand changes in vocabulary and domain
    corecore