37 research outputs found

    System for fast lexical and phonetic spoken term detection in a czech cultural heritage archive,”

    Get PDF
    Abstract The main objective of the work presented in this paper was to develop a complete system that would accomplish the original visions of the MALACH project. Those goals were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech, emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 hours of video constituting the Czech portion of the archive and find query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or Jewish slang

    Large vocabulary continuous speech recognition of highly inflectional language (Czech).

    No full text
    The thesis concerns the development of a large vocabulary continuous speech recognition (LVCSR) system for highly inflectional languages, with special emphasis on the language modeling. An idea and usage of the automatic speech recognition is introduced and the basic principles of the statistical approach to the speech recognition and the decomposition of the system into basic components are explained. An overview of the existing statistical language modeling techniques is given and methods of inferring reliable probability estimates from sparse data and measures of the language model quality are described. There are offered a theoretical background to the finite-state machinery and the application of the finite-state machine framework to LVCSR. The goals of the thesis were to build a LVCSR system for the Czech language using standard techniques that were used for English and to analyze the system performance and propose and implement techniques that would improve the recognition accuracy. The development of the baseline system is described. The Czech language properties, especially from the automatic speech recognition point of view, were analyzed. The outcomes of this theoretical analysis are exploited and language models that take into account the specific features of the Czech language are presented. There is given a description of the class-based language models that strengthen the language model robustness and therefore reduce the perplexity and consequently improve the recognition accuracy. And finally a model that uses subword parts (morphemes) as the basic language modeling units is introduced. Such model offers a better coverage of an unknown text in comparison with standard word-based models given the same vocabulary size.Available from STL Prague, CZ / NTK - National Technical LibrarySIGLECZCzech Republi

    Czech translation of the EBUContentGenre thesaurus

    No full text
    The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection)

    Přínos vhodného jazykového předzpracování pro vyhledávání v mluvené češtině v úloze CL-SR na CLEF 2006

    No full text
    Článek popisuje systém vytvořený týmem Západočeské univerzity pro účely participace v kampani CLEF 2006 CL-SR track. Rozhodli jsme se soustředit pouze na prohledávání české testovací kolekce a prozkoumání přínosu vhodného jazykového předzpracování pro úspěšnost vyhledávání. Pro účely lingvistického předzpracování dat jsme použili morfologický analyzátor a tagger. Pro vlastní vyhledávání jsme využili klasický tf.idf přístup se slepou zpětnou vazbou tak, jak je implementován v systému Lemur. Výsledky naznačují, že vhodné lingvistické předzpracování je pro úspěšné vyhledávání v mluvené češtině vskutku klíčové.The paper describes the system built by the team from the University of West Bohemia for participation in the CLEF 2006 CL-SR track. We have decided to concentrate only on the monolingual searching in the Czech test collection and investigate the effect of proper language processing on the retrieval performance. We have employed the Czech morphological analyser and tagger for that purposes. For the actual search system, we have used the classical tf.idf approach with blind relevance feedback as implemented in the Lemur toolkit. The results indicate that a suitable linguistic preprocessing is indeed crucial for the Czech IR performance

    Experimenty s automatickým vytvářením dotazů v rozšířeném Booleovském modelu

    No full text
    This paper concentrates on experiments with automatic creation of queries from natural language topics, suitable for use in the Extended Boolean information retrieval system. Because of the lack and/or inadequacy of the available methods, we propose a new method, based on pairing terms into a binary tree structure. The results of this method are compared with the results achieved by our implementation of the known method proposed by Salton and also with the results obtained with manually created queries. All experiments were performed on the same collection that was used in the CLEF 2007 campaign

    Adaptace jazykových modelů s využitím různých třídových modelů

    No full text
    Článek popisuje dvě rozdílné metody pro přidávání neviděných slov do LVCSR systému. Obě metody používají principy třídových jazykových modelů - první metoda využívá znalosti závislé na úloze, druhá metoda je plně automatická a na úloze nezávislá. Rozsáhlé experimenty navržených jazykových modelů na ASR systému pracujícím v reálném čase ukazují, že obě techniky poskytují zlepšení přesnosti rozpoznávání. Navíc příspěvky obou dvou metod lze kombinovat, což vede k celkovému absolutnímu zlepšení až 2 %.The paper presents two different methods for adding previously unseen words into the LVCSR system. Both methods employ the principles of class-based language modeling – the first one exploits task-specific knowledge, the second one is fully automatic and task independent. Extensive test of the proposed language models in the real-time ASR system showed that both techniques provide a consistent improvement in terms of recognition accuracy. Moreover, the contributions from both methods appear to be additive, yielding a total improvement of up to 2 % absolute

    Wizard of Oz data collection for the czech senior companion dialogue system

    Get PDF
    In this paper, we present the setup of a Wizard of Oz environment used for collection of data for the implementation of the Czech Senior Companion dialogue system. We also discuss some aspects of using WoZ method for collection of emotional data and summarize some statistics about data set recorded. The domain of the collected data is limited to reminiscing about photographs.1In each session a dialogue between elderly person and (WoZ) experimenter was recorded. Both audio and video data were collected

    Απολυτήριον γράμμα

    Get PDF
    Πρωτότυπο. Δίφυλλο. Σφραγίδα της μονής του Βατοπαιδίου, Διαστάσεις: 235 Χ 170, Ύλη γραφής: Χαρτί, Χρώμα μελάνης: Καστανό, Διατήρηση: πολύ καλή
    corecore