3 research outputs found

    Automatic processing of computer-transcribed spoken documents from multimedia archives

    Get PDF
    Tato práce se zaměřuje na řešení komplexního problému jak strukturalizovat (vhodně rozčlenit, textově i foneticky analyzovat a následně upravit) výstup systému pro automatické rozpoznávání řeči tak, aby byl co nejčitelnější pro člověka a zároveň připravený pro efektivní strojové zpracování a vyhledávání. Motivací pro řešení tohoto problému byl výzkumný projekt podporovaný Ministerstvem kultury ČR, jehož cílem bylo přepsat mluvené dokumenty z archivu Českého a Československého rozhlasu a zpřístupnit je pro vyhledávání. Vzhledem k rozsahu archivu (213.000 dokumentů z období 1923 až 2014) bylo nutné navrhnout a zrealizovat takový postup a technologie, které by byly schopny zvládnout nejen obrovské množství dat, ale také specifické problémy související s různou kvalitou záznamů, s přítomností českého i slovenského jazyka v dokumentech, se střídajícími se mluvčími, s prokládáním řeči znělkami, hudebními předěly a písničkami či s hluky na pozadí řeči.This thesis focuses on solving a complex task how to structure (i.e. appropriately divide, textually and phonetically analyze and subsequently modify) the output of the speech recognition system so it is most readable for human and also prepared for effective machine processing and search. Motivation to solve this task was the research project supported by the Czech Ministry of culture, aimed at transcription of spoken documents contained in the Czech and Czechoslovak radio and to make them available for search. Taking into account the archive size (213,000 documents form the years 1923-2014) it was essential to propose and implement such technologies, that were able to handle not only the waste amount of the data but also some specific issues associated with different acoustic quality of the documents, speaker changes, presence of jingles, music divides and song between the speech segments or with background noise

    Audio-Visual Speech Processing for Multimedia Localisation

    Get PDF
    For many years, film and television have dominated the entertainment industry. Recently, with the introduction of a range of digital formats and mobile devices, multimedia’s ubiquity as the dominant form of entertainment has increased dramatically. This, in turn, has increased demand on the entertainment industry, with production companies looking to increase their revenue by providing entertainment media to a growing international market. This brings with it challenges in the form of multimedia localisation - the process of preparing content for international distribution. The industry is now looking to modernise production processes - moving what were once wholly manual practices to semi-automated workflows. A key aspect of the localisation process is the alignment of content, such as subtitles or audio, when adapting content from one region to another. One method of automating this is through using audio content as a guide, providing a solution via audio-to-text alignment. While many approaches for audio-to-text alignment currently exist, these all require language models - meaning that dozens of languages models would be required for these approaches to be reliably implemented in large production companies. To address this, this thesis explores the development of audio-to-text alignment procedures which do not rely on language models, instead providing a language independent method for aligning multimedia content. To achieve this, the project explores both audio and visual speech processing, with a focus on voice activity detection, as a means for segmenting and aligning audio and text data. The thesis first presents a novel method for detecting speech activity in entertainment media. This method is compared with current state of the art, and demonstrates significant improvement over baseline methods. Secondly, the thesis explores a novel set of features for detecting voice activity in visual speech data. Here, we show that the combination of landmark and appearance-based features outperforms recent methods for visual voice activity detection, and specifically that the incorporation of landmark features is particularly crucial when presented with challenging natural speech data. Lastly, a speech activity-based alignment framework is presented which demonstrates encouraging results. Here, we show that Dynamic Time Warping (DTW) can be used for segment matching and alignment of audio and subtitle data, and we also present a novel method for aligning scene-level content which outperforms DTW for sequence alignment of finer-level data. To conclude, we demonstrate that combining global and local alignment approaches achieves strong alignment estimates, but that the resulting output is not sufficient for wholly automated subtitle alignment. We therefore propose that this be used as a platform for the development of lexical-discovery based alignment techniques, as the general alignment provided by our system would improve symbolic sequence discovery for sparse dictionary-based systems
    corecore