3,800 research outputs found

    A Comparison of Different Machine Transliteration Models

    Full text link
    Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine transliteration models -- grapheme-based transliteration model, phoneme-based transliteration model, hybrid transliteration model, and correspondence-based transliteration model -- have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine transliteration performance

    Few-Shot and Zero-Shot Learning for Historical Text Normalization

    Get PDF
    Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of different multi-task learning architectures. This paper evaluates 63~multi-task learning configurations for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. We also show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.Comment: Accepted at DeepLo-201

    The issue of semantic mediation in word and number naming

    Get PDF

    Letter to Sound Rules for Accented Lexicon Compression

    Get PDF
    This paper presents trainable methods for generating letter to sound rules from a given lexicon for use in pronouncing out-of-vocabulary words and as a method for lexicon compression. As the relationship between a string of letters and a string of phonemes representing its pronunciation for many languages is not trivial, we discuss two alignment procedures, one fully automatic and one hand-seeded which produce reasonable alignments of letters to phones. Top Down Induction Tree models are trained on the aligned entries. We show how combined phoneme/stress prediction is better than separate prediction processes, and still better when including in the model the last phonemes transcribed and part of speech information. For the lexicons we have tested, our models have a word accuracy (including stress) of 78% for OALD, 62% for CMU and 94% for BRULEX. The extremely high scores on the training sets allow substantial size reductions (more than 1/20). WWW site: http://tcts.fpms.ac.be/synthesis/mbrdicoComment: 4 pages 1 figur

    Predictors of developmental dyslexia in European orthographies with varying complexity

    Get PDF
    Background: The relationship between phoneme awareness, rapid automatized naming (RAN), verbal short-term/working memory (ST/WM) and diagnostic category is investigated in control and dyslexic children, and the extent to which this depends on orthographic complexity. Methods: General cognitive, phonological and literacy skills were tested in 1138 control and 1114 dyslexic children speaking 6 different languages spanning a large range of orthographic complexity (Finnish, Hungarian, German, Dutch, French, English). Results: Phoneme deletion and RAN were strong concurrent predictors of developmental dyslexia, while verbal ST/WM and general verbal abilities played a comparatively minor role. In logistic regression models, more participants were classified correctly when orthography was more complex. The impact of phoneme deletion and RAN-digits was stronger in complex than in less complex orthographies. Conclusions: Findings are largely consistent with the literature on predictors of dyslexia and literacy skills, while uniquely demonstrating how orthographic complexity exacerbates some symptoms of dyslexia
    corecore