3,800 research outputs found
A Comparison of Different Machine Transliteration Models
Machine transliteration is a method for automatically converting words in one
language into phonetically equivalent ones in another language. Machine
transliteration plays an important role in natural language applications such
as information retrieval and machine translation, especially for handling
proper nouns and technical terms. Four machine transliteration models --
grapheme-based transliteration model, phoneme-based transliteration model,
hybrid transliteration model, and correspondence-based transliteration model --
have been proposed by several researchers. To date, however, there has been
little research on a framework in which multiple transliteration models can
operate simultaneously. Furthermore, there has been no comparison of the four
models within the same framework and using the same data. We addressed these
problems by 1) modeling the four models within the same framework, 2) comparing
them under the same conditions, and 3) developing a way to improve machine
transliteration through this comparison. Our comparison showed that the hybrid
and correspondence-based models were the most effective and that the four
models can be used in a complementary manner to improve machine transliteration
performance
Few-Shot and Zero-Shot Learning for Historical Text Normalization
Historical text normalization often relies on small training datasets. Recent
work has shown that multi-task learning can lead to significant improvements by
exploiting synergies with related datasets, but there has been no systematic
study of different multi-task learning architectures. This paper evaluates
63~multi-task learning configurations for sequence-to-sequence-based historical
text normalization across ten datasets from eight languages, using
autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary
tasks. We observe consistent, significant improvements across languages when
training data for the target task is limited, but minimal or no improvements
when training data is abundant. We also show that zero-shot learning
outperforms the simple, but relatively strong, identity baseline.Comment: Accepted at DeepLo-201
Letter to Sound Rules for Accented Lexicon Compression
This paper presents trainable methods for generating letter to sound rules
from a given lexicon for use in pronouncing out-of-vocabulary words and as a
method for lexicon compression.
As the relationship between a string of letters and a string of phonemes
representing its pronunciation for many languages is not trivial, we discuss
two alignment procedures, one fully automatic and one hand-seeded which produce
reasonable alignments of letters to phones.
Top Down Induction Tree models are trained on the aligned entries. We show
how combined phoneme/stress prediction is better than separate prediction
processes, and still better when including in the model the last phonemes
transcribed and part of speech information. For the lexicons we have tested,
our models have a word accuracy (including stress) of 78% for OALD, 62% for CMU
and 94% for BRULEX. The extremely high scores on the training sets allow
substantial size reductions (more than 1/20).
WWW site: http://tcts.fpms.ac.be/synthesis/mbrdicoComment: 4 pages 1 figur
Predictors of developmental dyslexia in European orthographies with varying complexity
Background: The relationship between phoneme awareness, rapid automatized naming (RAN), verbal short-term/working memory (ST/WM) and diagnostic category is investigated in control and dyslexic children, and the extent to which this depends on orthographic complexity.
Methods: General cognitive, phonological and literacy skills were tested in 1138 control and 1114 dyslexic children speaking 6 different languages spanning a large range of orthographic complexity (Finnish, Hungarian, German, Dutch, French, English).
Results: Phoneme deletion and RAN were strong concurrent predictors of developmental dyslexia, while verbal ST/WM and general verbal abilities played a comparatively minor role. In logistic regression models, more participants were classified correctly when orthography was more complex. The impact of phoneme deletion and RAN-digits was stronger in complex than in less complex orthographies.
Conclusions: Findings are largely consistent with the literature on predictors of dyslexia and literacy skills, while uniquely demonstrating how orthographic complexity exacerbates some symptoms of dyslexia
- …