109 research outputs found

    Impact of deep MLP architecture on different acoustic modeling techniques for under-resourced speech recognition

    Get PDF
    Posterior based acoustic modeling techniques such as Kullback– Leibler divergence based HMM (KL-HMM) and Tandem are able to exploit out-of-language data through posterior fea-tures, estimated by a Multi-Layer Perceptron (MLP). In this paper, we investigate the performance of posterior based ap-proaches in the context of under-resourced speech recognition when a standard three-layer MLP is replaced by a deeper five-layer MLP. The deeper MLP architecture yields similar gains of about 15 % (relative) for Tandem, KL-HMM as well as for a hybrid HMM/MLP system that directly uses the poste-rior estimates as emission probabilities. The best performing system, a bilingual KL-HMM based on a deep MLP, jointly trained on Afrikaans and Dutch data, performs 13 % better than a hybrid system using the same bilingual MLP and 26% better than a subspace Gaussian mixture system only trained on Afrikaans data. Index Terms — KL-HMM, Tandem, hybrid system, deep MLPs, under-resourced speech recognitio

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Analysis of Data Augmentation Methods for Low-Resource Maltese ASR

    Full text link
    Recent years have seen an increased interest in the computational speech processing of Maltese, but resources remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for low-resource languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data. The goal is to determine which of these techniques, or combination of them, is the most effective to improve speech recognition for languages where the starting point is a small corpus of approximately 7 hours of transcribed speech. Our results show that combining the data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model.Comment: 12 page

    Multilingual Deep Neural Network based Acoustic Modeling For Rapid Language Adaptation

    Get PDF
    This paper presents a study on multilingual deep neural network (DNN) based acoustic modeling and its application to new languages. We investigate the effect of phone merging on multilingual DNN in context of rapid language adaptation. Moreover, the combination of multilingual DNNs with Kullback--Leibler divergence based acoustic modeling (KL-HMM) is explored. Using ten different languages from the Globalphone database, our studies reveal that crosslingual acoustic model transfer through multilingual DNNs is superior to unsupervised RBM pre-training and greedy layer-wise supervised training. We also found that KL-HMM based decoding consistently outperforms conventional hybrid decoding, especially in low-resource scenarios. Furthermore, the experiments indicate that multilingual DNN training equally benefits from simple phoneset concatenation and manually derived universal phonesets

    Development of Bilingual ASR System for MediaParl Corpus

    Get PDF
    The development of an Automatic Speech Recognition (ASR) system for the bilingual MediaParl corpus is challenging for several reasons: (1) reverberant recordings, (2) accented speech, and (3) no prior information about the language. In that context, we employ frequency domain linear prediction-based (FDLP) features to reduce the effect of reverberation, exploit bilingual deep neural networks applied in Tandem and hybrid acoustic modeling approaches to significantly improve ASR for accented speech and develop a fully bilingual ASR system using entropy-based decoding-graph selection. Our experiments indicate that the proposed bilingual ASR system performs similar to a language-specific ASR system if approximately five seconds of speech are available

    A survey on Automatic Speech Recognition systems for Portuguese language and its variations

    Get PDF
    Communication has been an essential part of being human and living in society. There are several different languages and variations of them, so you can speak English in one place and not be able to communicate effectively with someone who speaks English with a different accent. There are several application areas where voice/speech data can be of importance, such as health, security, biometric analysis or education. However, most studies focus on English, Arabic or Asian languages, neglecting other relevant languages, such as Portuguese, which leaves their investigations wide open. Thus, it is crucial to understand the area, where the main focus is: what are the most used techniques for feature extraction and classification, and so on. This paper presents a survey on automatic speech recognition components for Portuguese-based language and its variations, as an understudied language. With a total of 101 papers from 2012 to 2018, the Portuguese-based automatic speech recognition field tendency will be explained, and several possible unexplored methods will be presented and discussed in a collaborative and overall way as our main contribution

    Multilingual audio information management system based on semantic knowledge in complex environments

    Get PDF
    This paper proposes a multilingual audio information management system based on semantic knowledge in complex environments. The complex environment is defined by the limited resources (financial, material, human, and audio resources); the poor quality of the audio signal taken from an internet radio channel; the multilingual context (Spanish, French, and Basque that is in under-resourced situation in some areas); and the regular appearance of cross-lingual elements between the three languages. In addition to this, the system is also constrained by the requirements of the local multilingual industrial sector. We present the first evolutionary system based on a scalable architecture that is able to fulfill these specifications with automatic adaptation based on automatic semantic speech recognition, folksonomies, automatic configuration selection, machine learning, neural computing methodologies, and collaborative networks. As a result, it can be said that the initial goals have been accomplished and the usability of the final application has been tested successfully, even with non-experienced users.This work is being funded by Grants: TEC201677791-C4 from Plan Nacional de I + D + i, Ministry of Economic Affairs and Competitiveness of Spain and from the DomusVi Foundation Kms para recorder, the Basque Government (ELKARTEK KK-2018/00114, GEJ IT1189-19, the Government of Gipuzkoa (DG18/14 DG17/16), UPV/EHU (GIU19/090), COST ACTION (CA18106, CA15225)
    • …
    corecore