71 research outputs found
Recurrent DNNs and its Ensembles on the TIMIT Phone Recognition Task
In this paper, we have investigated recurrent deep neural networks (DNNs) in
combination with regularization techniques as dropout, zoneout, and
regularization post-layer. As a benchmark, we chose the TIMIT phone recognition
task due to its popularity and broad availability in the community. It also
simulates a low-resource scenario that is helpful in minor languages. Also, we
prefer the phone recognition task because it is much more sensitive to an
acoustic model quality than a large vocabulary continuous speech recognition
task. In recent years, recurrent DNNs pushed the error rates in automatic
speech recognition down. But, there was no clear winner in proposed
architectures. The dropout was used as the regularization technique in most
cases, but combination with other regularization techniques together with model
ensembles was omitted. However, just an ensemble of recurrent DNNs performed
best and achieved an average phone error rate from 10 experiments 14.84 %
(minimum 14.69 %) on core test set that is slightly lower then the
best-published PER to date, according to our knowledge. Finally, in contrast of
the most papers, we published the open-source scripts to easily replicate the
results and to help continue the development.Comment: Submitted to SPECOM 2018, 20th International Conference on Speech and
Compute
Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project
Czech is a very specific language due to its large differences between the
formal and the colloquial form of speech. While the formal (written) form is
used mainly in official documents, literature, and public speeches, the
colloquial (spoken) form is used widely among people in casual speeches. This
gap introduces serious problems for ASR systems, especially when training or
evaluating ASR models on datasets containing a lot of colloquial speech, such
as the MALACH project. In this paper, we are addressing this problem in the
light of a new paradigm in end-to-end ASR systems -- recently introduced
self-supervised audio Transformers. Specifically, we are investigating the
influence of colloquial speech on the performance of Wav2Vec 2.0 models and
their ability to transcribe colloquial speech directly into formal transcripts.
We are presenting results with both formal and colloquial forms in the training
transcripts, language models, and evaluation transcripts.Comment: to be published in Proceedings of TSD 202
Estimation of Single-Gaussian and Gaussian mixture models for pattern recognition
Single-Gaussian and Gaussian-Mixture Models are utilized in various
pattern recognition tasks. The model parameters are estimated usually via
Maximum Likelihood Estimation (MLE) with respect to available training data.
However, if only small amount of training data is available, the resulting model
will not generalize well. Loosely speaking, classification performance given an
unseen test set may be poor. In this paper, we propose a novel estimation technique
of the model variances. Once the variances were estimated using MLE,
they are multiplied by a scaling factor, which reflects the amount of uncertainty
present in the limited sample set. The optimal value of the scaling factor is based
on the Kullback-Leibler criterion and on the assumption that the training and test
sets are sampled from the same source distribution. In addition, in the case of
GMM, the proper number of components can be determined
System for fast lexical and phonetic spoken term detection in a czech cultural heritage archive,”
Abstract The main objective of the work presented in this paper was to develop a complete system that would accomplish the original visions of the MALACH project. Those goals were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech, emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 hours of video constituting the Czech portion of the archive and find query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or Jewish slang
Počet dat pro maximálně věrohodný odhad Gaussovského modelu v závislosti na dimenzi obrazového prostoru.
Growing awareness towards the sustainability has compelled supply chain domain experts to explore its relevance in this context. As a result, a number of studies in recent years have focused on investigating sustainable supply chain practices across the globe. Short food supply chains (SFSCs) have emerged as a promising sustainable alternative to the industrialized agro-food supply systems. However, academic literature hasn’t fully explored the linkage between SFSCs and sustainability. This study therefore aims to explore how SFSCs conforms to the dimensions of sustainability using the sustainability framework (social, economic, and environmental). The findings are based on a systematic literature review of 44 articles published between 2000 and 2018 selected from six electronic databases was used for the analysis. All items were properly analyzed by the researchers, seeking to identify the relationship or proximity of the information found in the papers with the SFSC concept. Our studies highlight the societal, environmental and cultural benefits of SFSC in addition to the associated economic and safety benefits. Our study thus, adds to the scant literature on SFSCs and shows a clear linkage between SFSCs and five-dimensional sustainability framework. We also propose a set of research questions that sets direction for future research
Transformer-Based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project
Czech is a very specific language due to its large differences between the formal and the colloquial form of speech. While the formal (written) form is used mainly in official documents, literature, and public speeches, the colloquial (spoken) form is used widely among people in casual speeches. This gap introduces serious problems for ASR systems, especially when training or evaluating ASR models on datasets containing a lot of colloquial speech, such as the MALACH project. In this paper, we are addressing this problem in the light of a new paradigm in end-to-end ASR systems – recently introduced self-supervised audio Transformers. Specifically, we are investigating the influence of colloquial speech on the performance of Wav2Vec 2.0 models and their ability to transcribe colloquial speech directly into formal transcripts. We are presenting results with both formal and colloquial forms in the training transcripts, language models, and evaluation transcripts
Zvýšení přesnosti ASR prodloužením neznělých fonémů v řeči pacientů používajících elektrolarynx
Pacienti, kteří podstoupili totální laryngektomii a používají k produkci hlasu elektrolarynx, trpí špatnou srozumitelností. V mnoha případech to může vést k obavám z mluvení s cizími lidmi, a to i po telefonu. Systémy automatického rozpoznávání řeči (ASR) by mohly pacientům pomoci tento problém překonat mnoha způsoby. Bohužel ani nejmodernější systémy ASR nemohou poskytnout výsledky srovnatelné s výsledky konvenčních řečníků. Problém je způsoben hlavně podobností mezi znělými a neznělými páry fonémů. V mnoha případech může problém vyřešit jazykový model, ale pouze v případě, že je kontext slova dostatečně dlouhý. Proto je pro zvýšení přesnosti rozpoznávání nezbytná úprava akustických dat a / nebo akustického modelu. V tomto článku navrhujeme prodloužení neznělých fonémů, abychom zlepšili přesnost rozpoznávání a obohatili systém ASR o model, který toto prodloužení zohledňuje. Myšlenka prodloužení je ověřena na souboru experimentů ASR s uměle prodlouženými neznělými fonémy. K obohacení systému ASR je navržen model DNN pro rescoring mřížky na základě trvání fonému. Nový systém je srovnáván se standardním ASR. Je také ověřeno, že systém ASR vytvořený pomocí prodloužených syntetických dat dokáže úspěšně rozpoznat protažená slova vyslovená skutečným mluvčím.Patients who have undergone total laryngectomy and use electrolarynx for voice production suffer from poor intelligibility. It may lead in many cases to fear of speaking to strangers, even over the phone. Automatic Speech Recognition (ASR) systems could help patients overcome this problem in many ways. Unfortunately, even state-of-the-art ASR systems cannot provide results comparable to those of conventional speakers. The problem is mainly caused by the similarity between voiced and unvoiced phoneme pairs. In many cases, a language model can help to solve the issue, but only if the word context is sufficiently long. Therefore adjustment of acoustic data and/or acoustic model is necessary to increase recognition accuracy. In this paper, we propose voiceless phonemes elongation to improve recognition accuracy and enrich the ASR system with a model that takes this elongation into account. The idea of elongation is verified on a set of ASR experiments with artificially elongated voiceless phonemes. To enriching the ASR system, the DNN model for rescoring lattices based on phoneme duration is proposed. The new system is compared with a standard ASR. It is also verified that the ASR system created using elongated synthetic data can successfully recognize the actual elongated data pronounced by the real speaker
- …