Search CORE

2,197 research outputs found

Language modeling and transcription of the TED corpus lectures

Author: Cettolo M.
Federico M.
Leeuwis E.
Publication venue: IEEE
Publication date: 01/01/2003
Field of study

Transcribing lectures is a challenging task, both in acoustic and in language modeling. In this work, we present our first results on the automatic transcription of lectures from the TED corpus, recently released by ELRA and LDC. In particular, we concentrated our effort on language modeling. Baseline acoustic and language models were developed using respectively 8 hours of TED transcripts and various types of texts: conference proceedings, lecture transcripts, and conversational speech transcripts. Then, adaptation of the language model to single speakers was investigated by exploiting different kinds of information: automatic transcripts of the talk, the title of the talk, the abstract and, finally, the paper. In the last case, a 39.2% WER was achieved

Archivio della ricerca - Fondazione Bruno Kessler

University of Twente Research Information

Reducing speech recognition time and memory use by means of compound (de-)composition

Author: Martens Jean-Pierre
Réveil Bert
Publication venue: STW Technology Foundation
Publication date: 01/01/2008
Field of study

This paper tackles the problem of Out Of Vocabulary words in Automatic Speech Transcription applications for a compound language (Dutch). A seemingly attractive way to reduce the amount of OOV words in compound languages is to extend the AST system with a compound (de-)composition module. However, thus far, successful implementations of this approach are rather scarce. We developed a novel data driven compound (de-)composition module and tested it in two different AST experiments. For equal lexicon sizes, we see that our compound processor lowers the OOV rate. Moreover we are able to transform that gain in OOV rate into a reduction of the Word Error Rate of the transcription system. Using our approach we built a system with an 84K lexicon that performs as accurately as a baseline system with a 168K lexicon, but our system is 5-6% faster and requires about 50% less storage for the lexical component, even though this component is encoded in an optimal way (prefix-suffix tree compression)

Ghent University Academic Bibliography

Analysis of Data Augmentation Methods for Low-Resource Maltese ASR

Author: Borg Claudia
DeMarco Andrea
Gatt Albert
Mena Carlos
van der Plas Lonneke
Williams Aiden
Publication venue
Publication date: 20/01/2023
Field of study

Recent years have seen an increased interest in the computational speech processing of Maltese, but resources remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for low-resource languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data. The goal is to determine which of these techniques, or combination of them, is the most effective to improve speech recognition for languages where the starting point is a small corpus of approximately 7 hours of transcribed speech. Our results show that combining the data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model.Comment: 12 page

arXiv.org e-Print Archive

Modeling of Filled Pauses and Onomatopoeas for Spontaneous Speech Recognition

Author: Andrej Zgank
Mirjam Sepesy Maucec
Publication venue: 'IntechOpen'
Publication date: 16/08/2010
Field of study

IntechOpen

I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

Author: Batliner A
Hantke S
Kurle R
Mousa AELD
Ringeval F
Schuller B
Weninger F
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 14/04/2016
Field of study

We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

Directory of Open Access Journals

Spiral - Imperial College Digital Repository

SPA: Web-based platform for easy access to speech processing modules

Author: Abad A.
Batista F.
Curto P.
Ferreira J.
Matos D. M. de.
Moniz H.
Ribeiro E.
Ribeiro R.
Trancoso I.
Publication venue: ELDA/ELRA
Publication date: 01/01/2016
Field of study

This paper presents SPA, a web-based Speech Analytics platform that integrates several speech processing modules and that makes it possible to use them through the web. It was developed with the aim of facilitating the usage of the modules, without the need to know about software dependencies and specific configurations. Apart from being accessed by a web-browser, the platform also provides a REST API for easy integration with other applications. The platform is flexible, scalable, provides authentication for access restrictions, and was developed taking into consideration the time and effort of providing new services. The platform is still being improved, but it already integrates a considerable number of audio and text processing modules, including: Automatic transcription, speech disfluency classification, emotion detection, dialog act recognition, age and gender classification, non-nativeness detection, hyperarticulation detection, dialog act recognition, and two external modules for feature extraction and DTMF detection. This paper describes the SPA architecture, presents the already integrated modules, and provides a detailed description for the ones most recently integrated.info:eu-repo/semantics/publishedVersio

Repositório Institucional do ISCTE-IUL