50 research outputs found

    Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project

    Full text link
    Czech is a very specific language due to its large differences between the formal and the colloquial form of speech. While the formal (written) form is used mainly in official documents, literature, and public speeches, the colloquial (spoken) form is used widely among people in casual speeches. This gap introduces serious problems for ASR systems, especially when training or evaluating ASR models on datasets containing a lot of colloquial speech, such as the MALACH project. In this paper, we are addressing this problem in the light of a new paradigm in end-to-end ASR systems -- recently introduced self-supervised audio Transformers. Specifically, we are investigating the influence of colloquial speech on the performance of Wav2Vec 2.0 models and their ability to transcribe colloquial speech directly into formal transcripts. We are presenting results with both formal and colloquial forms in the training transcripts, language models, and evaluation transcripts.Comment: to be published in Proceedings of TSD 202

    System for fast lexical and phonetic spoken term detection in a czech cultural heritage archive,”

    Get PDF
    Abstract The main objective of the work presented in this paper was to develop a complete system that would accomplish the original visions of the MALACH project. Those goals were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech, emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 hours of video constituting the Czech portion of the archive and find query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or Jewish slang

    Sample Size for Maximum Likelihood Estimates of Gaussian Model

    Full text link

    Transfer Learning of Transformer-based Speech Recognition Models from Czech to Slovak

    Full text link
    In this paper, we are comparing several methods of training the Slovak speech recognition models based on the Transformers architecture. Specifically, we are exploring the approach of transfer learning from the existing Czech pre-trained Wav2Vec 2.0 model into Slovak. We are demonstrating the benefits of the proposed approach on three Slovak datasets. Our Slovak models scored the best results when initializing the weights from the Czech model at the beginning of the pre-training phase. Our results show that the knowledge stored in the Cezch pre-trained model can be successfully reused to solve tasks in Slovak while outperforming even much larger public multilingual models.Comment: Accepted to TSD 202

    Využití morfologické informace pro robustní jazykové modelování v českém ASR systému

    No full text
    Článek ukazuje, že použití bohatých morfologických značek v rámci třídového n-gramového jazykového modelu a kombinace tohoto modelu se standardním slovním n-gramovým modelem může zlepšit přesnost rozpoznávání oproti slovnímu modelu v úloze automatického přepisu spontánních českých rozhovorů.Automatic speech recognition, or more precisely language modeling, of the Czech language has to face challenges that are not present in the language modeling of English. Those include mainly the rapid vocabulary growth and closely connected unreliable estimates of the language model parameters. These phenomena are caused mostly by the highly inflectional nature of the Czech language. On the other hand, the rich morphology together with the well-developed automatic systems for morphological tagging can be exploited to reinforce the language model probability estimates. This paper shows that using rich morphological tags within the concept of class-based n-gram language model with many-to-many word-to-class mapping and combination of this model with the standard word-based n-gram can improve the recognition accuracy over the word-based baseline on the task of automatic transcription of unconstrained spontaneous Czech interviews

    Speaker-clustered acoustic models evaluated on GPU for on-line subtitling of parliament meetings

    Get PDF
    This paper describes the effort with building speaker-clustered acoustic models as a part of the real-time LVCSR system that is used more than one year by the Czech TV for automatic subtitling of parliament meetings broadcasted on the channel ČT24. Speaker-clustered acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model or even gender-dependent models. Frequent changes of speakers and a direct connection of the LVCSR system to the audio channel require an automatic switching/fusion of models as quickly as possible. An important part of the solution is real time likelihood evaluations of all clustered acoustic models, taking advantage of a fast GPU(Graphic Processing Unit). The proposed method achieved a WER reduction to the baseline gender-independent model over 2.34% relatively with more than 2M Gaussian mixtures evaluated in real-time

    Using Morphological Information for Robust Language Modeling

    No full text
    Abstract-Automatic speech recognition, or more precisely language modeling, of the Czech language has to face challenges that are not present in the language modeling of English. Those include mainly the rapid vocabulary growth and closely connected unreliable estimates of the language model parameters. These phenomena are caused mostly by the highly inflectional nature of the Czech language. On the other hand, the rich morphology together with the well-developed automatic systems for morphological tagging can be exploited to reinforce the language model probability estimates. This paper shows that using rich morphological tags within the concept of class-based n-gram language model with many-to-many word-to-class mapping and combination of this model with the standard word-based n-gram can improve the recognition accuracy over the word-based baseline on the task of automatic transcription of unconstrained spontaneous Czech interviews. Index Terms-Language models, speech recognition and synthesis

    Using Morphological Information for Robust Language Modeling in Czech ASR System

    No full text
    corecore