145 research outputs found
Improving bottleneck features for Vietnamese large vocabulary continuous speech recognition system using deep neural networks
In this paper, the pre-training method based on denoising auto-encoder is investigated and proved to be good models for initializing bottleneck networks of Vietnamese speech recognition system that result in better recognition performance compared to base bottleneck features reported previously. The experiments are carried out on the dataset containing speeches on Voice of Vietnam channel (VOV). The results show that the DBNF extraction for Vietnamese recognition decreases relative word error rate by 14 % and 39 % compared to the base bottleneck features and MFCC baseline, respectively
Modularity and Neural Integration in Large-Vocabulary Continuous Speech Recognition
This Thesis tackles the problems of modularity in Large-Vocabulary Continuous Speech Recognition with use of Neural Network
Topic Identification for Speech without ASR
Modern topic identification (topic ID) systems for speech use automatic
speech recognition (ASR) to produce speech transcripts, and perform supervised
classification on such ASR outputs. However, under resource-limited conditions,
the manually transcribed speech required to develop standard ASR systems can be
severely limited or unavailable. In this paper, we investigate alternative
unsupervised solutions to obtaining tokenizations of speech in terms of a
vocabulary of automatically discovered word-like or phoneme-like units, without
depending on the supervised training of ASR systems. Moreover, using automatic
phoneme-like tokenizations, we demonstrate that a convolutional neural network
based framework for learning spoken document representations provides
competitive performance compared to a standard bag-of-words representation, as
evidenced by comprehensive topic ID evaluations on both single-label and
multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201
Recommended from our members
Multi-language neural network language models
Recently there has been a lot of interest in neural network based language models. These models typically consist of vocabulary dependent input and output layers and one or more vocabulary independent hidden layers. One standard issue with these approaches is that large quantities of training data are needed to ensure robust parameter estimates. This poses a significant problem when only limited data is available. One possible way to address this issue is augmentation: model-based, in the form of language model interpolation, and data-based, in the form of data augmentation. However, these approaches may not always be possible to use due to vocabulary dependent input and output layers. This seriously restricts the nature of the data possible to use in augmentation. This paper describes a general solution whereby only one or more vocabulary independent hidden layers are augmented. Such approach makes it possible to examine augmentation from previously impossible domains. Moreover, this approach paves a direct way for multi-task learning with these models. As a proof of the concept this paper examines the use of multilingual data for augmenting hidden layers of recurrent neural network language models. Experiments are conducted using a set of language packs released within IARPA Babel program
Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages
The application of deep neural networks to the task of acoustic modeling for automatic speech recognition (ASR) has resulted in dramatic decreases of word error rates, allowing for the use of this technology in smart phones and personal home assistants in high-resource languages. Developing ASR models of this caliber, however, requires hundreds or thousands of hours of transcribed speech recordings, which presents challenges for most of the world’s languages. In this work, we investigate the applicability of three distinct architectures that have previously been used for ASR in languages with limited training resources. We tested these architectures using publicly available ASR datasets for several typologically and orthographically diverse languages, whose data was produced under a variety of conditions using different speech collection strategies, practices, and equipment. Additionally, we performed data augmentation on this audio, such that the amount of data could increase nearly tenfold, synthetically creating higher resource training. The architectures and their individual components were modified, and parameters explored such that we might find a best-fit combination of features and modeling schemas to fit a specific language morphology. Our results point to the importance of considering language-specific and corpus-specific factors and experimenting with multiple approaches when developing ASR systems for resource-constrained languages
Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information
This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech
Multi-Task Neural Networks for Speech Recognition
Prvnà část tĂ©to diplomovĂ© práci se zabĂ˝vá teoretickĂ˝m rozborem principĹŻ neuronovĂ˝ch sĂtĂ, vÄŤetnÄ› moĹľnosti jejich pouĹľitĂ v oblasti rozpoznávánĂ Ĺ™eÄŤi. Práce pokraÄŤuje popisem viceĂşkolovĂ˝ch neuronovĂ˝ch sĂtĂ a souvisejĂcĂch experimentĹŻ. Praktická část práce obsahovala zmÄ›ny software pro trĂ©novánĂ neuronovĂ˝ch sĂtĂ, kterĂ© umoĹľnily viceĂşkolovĂ© trĂ©novánĂ. Je rovněž popsáno pĹ™ipravenĂ© prostĹ™edĂ, vÄŤetnÄ› nÄ›kolika dedikovanĂ˝ch skriptĹŻ. Experimenty pĹ™edstavenĂ© v tĂ©to diplomovĂ© práci ověřujĂ pouĹľitĂ artikulaÄŤnĂch characteristik Ĺ™eÄŤi pro viceĂşkolovĂ© trĂ©novánĂ. Experimenty byly provedeny na dvou Ĺ™eÄŤovĂ˝ch databázĂch lišĂcĂch se kvalitou a velikostĂ a representujĂcĂch rĹŻznĂ© jazyky - angliÄŤtinu a vietnamštinu. ArtikulaÄŤnĂ charakteristiky byly takĂ© kombinovány s jinĂ˝mi sekundárnĂmi Ăşkoly, napĹ™Ăklad kontextem, s zámÄ›rem ověřit jejich komplementaritu. PorovnanĂ je provedeno s neuronovĂ˝mi sĂtÄ›mi rĹŻznĂ˝ch velikostĂ tak, aby byl popsán vztah mezi velikostĂ neuronovĂ˝ch sĂtĂ a efektivitou viceĂşkolovĂ©ho trĂ©novánĂ. ZávÄ›rem provedenĂ˝ch experimentĹŻ je, Ĺľe viceĂşkolovĂ© trĂ©novánĂ s pouĹľitĂm artikulaÄŤnich charakteristik jako sekundárnĂch ĂşkolĹŻ vede k lepšĂmu trĂ©novánĂ neuronovĂ˝ch sĂtĂ a vĂ˝sledkem tohoto trĂ©novánĂ mĹŻĹľe bĂ˝t pĹ™esnÄ›jšà rozpoznávánĂ fonĂ©mĹŻ. V závÄ›ru práce jsou viceĂşkolovĂ© neuronovĂ© sĂtÄ› testovány v systĂ©mu rozpoznávánĂ Ĺ™eÄŤi jako extraktor pĹ™ĂznakĹŻ.The first part of this Master's thesis covers theoretical investigation into the principles and usage of neural networks, including their usability for the speech recognition tasks. Then it proceeds to summarize the multi-task neural networks' operating principles and some recent experiments with them. The practical part of the semester project reports changes made to a tool for neural network training which support multi-task training. Then the preparation of the settings is described, including a number of scripts written especially for this purpose. The experiments presented in the thesis explore the idea of using articulatory characteristics of phonemes as secondary tasks for multi-task training. The experiments are conducted on two different datasets of different quality and size and representing different languages - English and Vietnamese. Articulatory characteristics are occasionally combined with different secondary tasks, such as context, to see how well they function together. A comparison is made between the networks of different sizes to see how their size affects the effectiveness of multi-task training. These experiments show that multi-task training with the use of articulatory characteristics as secondary tasks can enhance training and yield better phoneme accuracy as a result. Finally, multi-task training is embedded to a speech recognition system as a feature extractor.
Recommended from our members
Towards automatic assessment of spontaneous spoken English
With increasing global demand for learning English as a second language, there has been considerable interest in
methods of automatic assessment of spoken language proficiency for use in interactive electronic learning tools as
well as for grading candidates for formal qualifications. This paper presents an automatic system to address the
assessment of spontaneous spoken language. Prompts or questions requiring spontaneous speech responses elicit
more natural speech which better reflects a learner’s proficiency level than read speech. In addition to the challenges
of highly variable non-native, learner, speech and noisy real-world recording conditions, this requires any automatic
system to handle disfluent, non-grammatical, spontaneous speech with the underlying text unknown. To handle these,
a strong deep learning based speech recognition system is applied in combination with a Gaussian Process (GP)
grader. A range of features derived from the audio using the recognition hypothesis are investigated for their efficacy
in the automatic grader. The proposed system is shown to predict grades at a similar level to the original examiner
graders on real candidate entries. Interpolation with the examiner grades further boosts performance. The ability to
reject poorly estimated grades is also important and measures are proposed to evaluate the performance of rejection
schemes. The GP variance is used to decide which automatic grades should be rejected. Back-off to an expert grader
for the least confident grades gives gains.Cambridge Assessment Englis
- …