Search CORE

1,689 research outputs found

Text Preprocessing for Speech Synthesis

Author: Pfitzinger Hartmut R.
Reichel Uwe D.
Publication venue
Publication date: 01/01/2006
Field of study

In this paper we describe our text preprocessing modules for English text-to-speech synthesis. These modules comprise rule-based text normalization subsuming sentence segmentation and normalization of non-standard words, statistical part-of-speech tagging, and statistical syllabification, grapheme-to-phoneme conversion, and word stress assignment relying in parts on rule-based morphological analysis

CiteSeerX

Open Access LMU

A Comparison of Different Machine Transliteration Models

Author: Choi K.
Isahara H.
Oh J.
Publication venue: 'AI Access Foundation'
Publication date: 06/10/2011
Field of study

Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine transliteration models -- grapheme-based transliteration model, phoneme-based transliteration model, hybrid transliteration model, and correspondence-based transliteration model -- have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine transliteration performance

arXiv.org e-Print Archive

Crossref

Few-Shot and Zero-Shot Learning for Historical Text Normalization

Author: Bollmann Marcel
Korchagina Natalia
Søgaard Anders
Publication venue
Publication date: 01/01/2019
Field of study

Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of different multi-task learning architectures. This paper evaluates 63~multi-task learning configurations for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. We also show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.Comment: Accepted at DeepLo-201

arXiv.org e-Print Archive

Crossref

Publikationer från Linköpings universitet

Copenhagen University Research Information System

ZORA

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Zero-shot keyword spotting for visual speech recognition in-the-wild

Author: Fei Tao
JS Chung
K Audhkhasi
K He
M Cooke
S Fernández
S Hochreiter
S Watanabe
Z Akata
Publication venue
Publication date: 25/07/2018
Field of study

Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.Comment: Accepted at ECCV-201

arXiv.org e-Print Archive

Crossref