Large vocabulary continuous speech recognition of highly inflectional language (Czech).

Abstract

The thesis concerns the development of a large vocabulary continuous speech recognition (LVCSR) system for highly inflectional languages, with special emphasis on the language modeling. An idea and usage of the automatic speech recognition is introduced and the basic principles of the statistical approach to the speech recognition and the decomposition of the system into basic components are explained. An overview of the existing statistical language modeling techniques is given and methods of inferring reliable probability estimates from sparse data and measures of the language model quality are described. There are offered a theoretical background to the finite-state machinery and the application of the finite-state machine framework to LVCSR. The goals of the thesis were to build a LVCSR system for the Czech language using standard techniques that were used for English and to analyze the system performance and propose and implement techniques that would improve the recognition accuracy. The development of the baseline system is described. The Czech language properties, especially from the automatic speech recognition point of view, were analyzed. The outcomes of this theoretical analysis are exploited and language models that take into account the specific features of the Czech language are presented. There is given a description of the class-based language models that strengthen the language model robustness and therefore reduce the perplexity and consequently improve the recognition accuracy. And finally a model that uses subword parts (morphemes) as the basic language modeling units is introduced. Such model offers a better coverage of an unknown text in comparison with standard word-based models given the same vocabulary size.Available from STL Prague, CZ / NTK - National Technical LibrarySIGLECZCzech Republi

    Similar works

    Full text

    thumbnail-image

    Available Versions

    Last time updated on 14/06/2016