26 research outputs found
Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project
Czech is a very specific language due to its large differences between the
formal and the colloquial form of speech. While the formal (written) form is
used mainly in official documents, literature, and public speeches, the
colloquial (spoken) form is used widely among people in casual speeches. This
gap introduces serious problems for ASR systems, especially when training or
evaluating ASR models on datasets containing a lot of colloquial speech, such
as the MALACH project. In this paper, we are addressing this problem in the
light of a new paradigm in end-to-end ASR systems -- recently introduced
self-supervised audio Transformers. Specifically, we are investigating the
influence of colloquial speech on the performance of Wav2Vec 2.0 models and
their ability to transcribe colloquial speech directly into formal transcripts.
We are presenting results with both formal and colloquial forms in the training
transcripts, language models, and evaluation transcripts.Comment: to be published in Proceedings of TSD 202
System for fast lexical and phonetic spoken term detection in a czech cultural heritage archive,”
Abstract The main objective of the work presented in this paper was to develop a complete system that would accomplish the original visions of the MALACH project. Those goals were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech, emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 hours of video constituting the Czech portion of the archive and find query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or Jewish slang
Využití morfologické informace pro robustní jazykové modelování v českém ASR systému
Článek ukazuje, že použití bohatých morfologických značek v rámci třídového n-gramového jazykového modelu a kombinace tohoto modelu se standardním slovním n-gramovým modelem může zlepšit přesnost rozpoznávání oproti slovnímu modelu v úloze automatického přepisu spontánních českých rozhovorů.Automatic speech recognition, or more precisely
language modeling, of the Czech language has to face challenges
that are not present in the language modeling of English. Those
include mainly the rapid vocabulary growth and closely connected
unreliable estimates of the language model parameters. These phenomena
are caused mostly by the highly inflectional nature of the
Czech language. On the other hand, the rich morphology together
with the well-developed automatic systems for morphological tagging
can be exploited to reinforce the language model probability
estimates. This paper shows that using rich morphological tags
within the concept of class-based n-gram language model with
many-to-many word-to-class mapping and combination of this
model with the standard word-based n-gram can improve the
recognition accuracy over the word-based baseline on the task
of automatic transcription of unconstrained spontaneous Czech
interviews
Speaker-clustered acoustic models evaluated on GPU for on-line subtitling of parliament meetings
This paper describes the effort with building speaker-clustered acoustic models as a part of the real-time LVCSR system that is used more than one year by the Czech TV for automatic subtitling of parliament meetings broadcasted on the channel ČT24. Speaker-clustered acoustic models are more acoustically homogeneous and therefore give better recognition performance than single gender-independent model or even gender-dependent models. Frequent changes of speakers and a direct connection of the LVCSR system to the audio channel require an automatic switching/fusion of models as quickly as possible. An important part of the solution is real time likelihood evaluations of all clustered acoustic models, taking advantage of a fast GPU(Graphic Processing Unit). The proposed method achieved a WER reduction to the baseline gender-independent model over 2.34% relatively with more than 2M Gaussian mixtures evaluated in real-time
Using Morphological Information for Robust Language Modeling
Abstract-Automatic speech recognition, or more precisely language modeling, of the Czech language has to face challenges that are not present in the language modeling of English. Those include mainly the rapid vocabulary growth and closely connected unreliable estimates of the language model parameters. These phenomena are caused mostly by the highly inflectional nature of the Czech language. On the other hand, the rich morphology together with the well-developed automatic systems for morphological tagging can be exploited to reinforce the language model probability estimates. This paper shows that using rich morphological tags within the concept of class-based n-gram language model with many-to-many word-to-class mapping and combination of this model with the standard word-based n-gram can improve the recognition accuracy over the word-based baseline on the task of automatic transcription of unconstrained spontaneous Czech interviews. Index Terms-Language models, speech recognition and synthesis
Vyhodnocování směsí Gaussovských modelů s plnou kovarianční maticí na GPU
Směsi Gaussovských modelů jsou často používány v mnoha úlohách klasifikace či jiného zpracování dad. Používány jsou jako modely hustoty pravděpodobnosti v prostorech vyšších dimenzí. V případech, kdy je dimenze prostoru příznakových vektorů relativně veliká (např. v systémech automatického rozpoznávání řeči), jsou používány smesi Gaussovských modelů s diagonální kovarianční maticí, spíše než modely s plnou kovariancí. A to ze dvou důvodů: Prvním důvodem je problematičnost odhadu parametrů těchto plných kovariančních matic v případě omezeného množství trénovacích dat. Druhým důvodem je pak výrazně vyšší výpočetní náročnost vyhodnocování těchto modelů. Robustnost odhadů byla zkoumána v mnoha již publikovaných pracích. Tento článek popisuje efektivní implementaci výpočtu pravděpodobností těchto modelů na GPU a řeší tak problém velké výpočetní náročnosti. Výkonost byla testována na akustickém modelu systému automatického rozpoznávání řeči. Výsledky naší implementace ukazují, že i grafická karta levného notebooku je schopná zvlándnout vyhodnotit velké akustické modely ve reálném čase. Tři varianty algoritmu byly implementovány a porovnány mezi sebou na různých GPU: NVIDIA CUDA, NVIDIA OpenCL a ATI/AMD OpenCL.Gaussian mixture models (GMMs) are often used in various data processing and classification tasks to model a continuous probability density in a multi-dimensional space. In cases, where the dimension of the feature space is relatively high (e.g. in the automatic speech recognition (ASR)), GMM with a higher number of Gaussians with diagonal covariances (DC) instead of full covariances (FC) is used from the two reasons. The first reason is a~problem how to estimate robust FC matrices with a~limited training data set. The second reason is a~much higher computational cost during the GMM evaluation. The first reason was addressed in many recent publications. In contrast, this paper describes an efficient implementation on Graphic Processing Unit (GPU) of the FC-GMM evaluation, which addresses the second reason. The performance was tested on acoustic models for ASR, and it is shown that even a low-end laptop GPU is capable to evaluate a large acoustic model in a fraction of the real speech time. Three variants of the algorithm were implemented and compared on various GPUs: NVIDIA CUDA, NVIDIA OpenCL, and ATI/AMD OpenCL
Optimized acoustic likelihoods computation for NVIDIA and ATI/AMD graphics processors
In this paper, we describe an optimized version of a Gaussian-mixture-based acoustic model likelihood evaluation algorithm for graphical processing units (GPUs). The evaluation of these likelihoods is one of the most computationally intensive parts of automatic speech recognizers, but it can be parallelized and offloaded to GPU devices. Our approach offers a significant speed-up over the recently published approaches, because it utilizes the GPU architecture in a more effective manner. All the recent implementations have been intended only for NVIDIA graphics processors, programmed either in CUDA or OpenCL GPU programming frameworks. We present results for both CUDA and OpenCL. Further, we have developed an OpenCL implementation optimized for ATI/AMD GPUs. Results suggest that even very large acoustic models can be used in real-time speech recognition engines on computers equipped with a low-end GPU or laptops. In addition, the completely asynchronous GPU management provides additional CPU resources for the decoder part of the LVCSR. The optimized implementation enables us to apply fusion techniques together with evaluating many (10 or even more) speaker-specific acoustic models. We apply this technique to a real-time parliamentary speech recognition system where the speaker changes frequently
Optimalizovaný výpočet směsí normálních rozložení na GPU
In this paper we present a highly optimized implementation of
Gaussian mixture acoustic model evaluation algorithm. Evaluation
of these likelihoods is one of the most computationally
intensive parts of automatics speech recognizers but it can be
well-parallelized and offloaded to GPU devices. Our approach
offers significant speed-up compared to the recently published
approaches, since it exploits the GPU architecture better. All
the recent implementations were programmed either in CUDA
or OpenCL GPU programming frameworks. We present results
for both; CUDA as well as OpenCL.
Results suggest that even very large acoustic models can
be utilized in real-time speech recognition engines on computers
and laptops equipped with a low-end GPU. Optimization
of acoustic likelihoods computation on GPU enables to use
the remaining GPU resources for offloading of other computeintensive
parts of LVCSR decoder.
Other possible use of the freed GPU resources is to evaluate
several acoustic models at the same time and use fusion
techniques or model selection techniques to improve the quality
of resulting conditional likelihoods under diverse conditions
Titulkování přímých televizních přenosů z Olympijády v Soči: některé zajímavé pohledy
Článek popisuje proces a některé zajímavé poznatky získané během titulkování více než 70 hodin živého televizního vysílání z olympijských her v Soči. Skryté titulky byla vytvářeny pro kanál ČT Sport což je sportovní kanál veřejnoprávního vysílání v České republice. V článku je diskutováno naše řešení a architektura pro distribuované vytváření titulků živých televizních programů přemlouvacím přístupem, stejně jako několik úprav stávajících aplikace pro tvorbu titulků (zejména LVCSR systému), ale také specifické vlasnosti jednotlivých titulkovaných sportů. Ukážeme, že přemlouvač může po usilovném tréninku dosáhnout takové přesnosti (více než 98%) a čitelnost titulků, které jednoznačně překonají správnost titulků vytvořených automatické uznávání originálního televizního kanálu.In this paper, we describe our effort and some interesting insights obtained during captioning more than 70 hours of live TV broadcasts from the Olympic Games in Sochi. The closed captioning was prepared for CT Sport, the sport channel of the public service broadcaster in the Czech Republic. We will briefly discuss our solution for distributed captioning architecture on live TV programs using re-speaking approach as well as several modifications of existing live captioning application (especially LVCSR system), but also the way of re-speaking of a real TV commentary for individual sports. We will show that a re-speaker after hard training can achieve such accuracy (more than 98%) and readability of captions which clearly outperform accuracy of captions created by automatic recognition of TV soundtrack