Search CORE

83 research outputs found

Phoneme-Grapheme Based Speech Recognition System

Author: Bengio Samy
Bourlard Hervé
Magimai.-Doss Mathew
Stephenson Todd Andrew
Publication venue: IDIAP
Publication date: 10/03/2006
Field of study

State-of-the-art Automatic Speech Recognition (ASR) systems typically use phoneme as the subword units. In this paper, we investigate a system where the word models are defined in-terms of two different subword units, i.e., phonemes and graphemes. We train models for both the subword units, and then perform decoding using either both or just one subword unit. We have studied this system for American English language where there is weak correspondence between the grapheme and phoneme. The results from our studies show that there is good potential in using grapheme as auxiliary subword units

Infoscience - École polytechnique fédérale de Lausanne

Flexible decision trees for grapheme based speech recognition

Author: Mimer Borislava
Schultz Tanja
Stüker Sebastian
Publication venue: Cottbus
Publication date: 01/01/2004
Field of study

KITopen

A Grapheme based Speech Recognition System for Russian

Author: Schultz Tanja
Stüker Sebastian
Publication venue
Publication date: 16/06/2008
Field of study

KITopen

Character-Level Incremental Speech Recognition with Recurrent Neural Networks

Author: Hwang Kyuyeon
Sung Wonyong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/01/2016
Field of study

In real-time speech recognition applications, the latency is an important issue. We have developed a character-level incremental speech recognition (ISR) system that responds quickly even during the speech, where the hypotheses are gradually improved while the speaking proceeds. The algorithm employs a speech-to-character unidirectional recurrent neural network (RNN), which is end-to-end trained with connectionist temporal classification (CTC), and an RNN-based character-level language model (LM). The output values of the CTC-trained RNN are character-level probabilities, which are processed by beam search decoding. The RNN LM augments the decoding by providing long-term dependency information. We propose tree-based online beam search with additional depth-pruning, which enables the system to process infinitely long input speech with low latency. This system not only responds quickly on speech but also can dictate out-of-vocabulary (OOV) words according to pronunciation. The proposed model achieves the word error rate (WER) of 8.90% on the Wall Street Journal (WSJ) Nov'92 20K evaluation set when trained on the WSJ SI-284 training set.Comment: To appear in ICASSP 201

arXiv.org e-Print Archive

Crossref

Towards Rapid Language Portability of Speech Processing Systems

Author: Schultz Tanja
Publication venue
Publication date: 16/06/2008
Field of study

KITopen

Advances in All-Neural Speech Recognition

Author: Droppo J.
Stolcke A.
Yu C.
Zweig G.
Publication venue
Publication date: 25/01/2017
Field of study

This paper advances the design of CTC-based all-neural (or end-to-end) speech recognizers. We propose a novel symbol inventory, and a novel iterated-CTC method in which a second system is used to transform a noisy initial output into a cleaner version. We present a number of stabilization and initialization methods we have found useful in training these networks. We evaluate our system on the commonly used NIST 2000 conversational telephony test set, and significantly exceed the previously published performance of similar systems, both with and without the use of an external language model and decoding technology

arXiv.org e-Print Archive

Crossref

On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition

Author: Bruguier Antoine
Irie Kazuki
Kannan Anjuli
Nguyen Patrick
Prabhavalkar Rohit
Rybach David
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2019
Field of study

In conventional speech recognition, phoneme-based models outperform grapheme-based models for non-phonetic languages such as English. The performance gap between the two typically reduces as the amount of training data is increased. In this work, we examine the impact of the choice of modeling unit for attention-based encoder-decoder models. We conduct experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various target units (phoneme, grapheme, and word-piece); across all tasks, we find that grapheme or word-piece models consistently outperform phoneme-based models, even though they are evaluated without a lexicon or an external language model. We also investigate model complementarity: we find that we can improve WERs by up to 9% relative by rescoring N-best lists generated from a strong word-piece based baseline with either the phoneme or the grapheme model. Rescoring an N-best list generated by the phonemic system, however, provides limited improvements. Further analysis shows that the word-piece-based models produce more diverse N-best hypotheses, and thus lower oracle WERs, than phonemic models.Comment: To appear in the proceedings of INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

Acoustic Modelling for Under-Resourced Languages

Author: Stüker Sebastian
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2009
Field of study

Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones. In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages

KITopen

A study of phoneme and grapheme based context-dependent ASR systems

Author: Dines John
Magimai.-Doss Mathew
Publication venue: IDIAP
Publication date: 11/02/2010
Field of study

In this paper we present a study of automatic speech recognition systems using context-dependent phonemes and graphemes as sub-word units based on the conventional HMM/GMM system as well as tandem system. Experimental studies conducted on three different continuous speech recognition tasks show that systems using only context-dependent graphemes can yield competitive performance on small to medium vocabulary tasks when compared to a context-dependent phoneme-based automatic speech recognition system. In particular, we demonstrate the utility of tandem features that use an MLP trained to estimate phoneme posterior probabilities in improving grapheme based recognition system performance by incorporating phonemic knowledge into the system without having to explicitly define a phonetically transcribed lexicon

Infoscience - École polytechnique fédérale de Lausanne