331 research outputs found
Hyphenation : from transformer models and word embeddings to a new linguistic rule-set
Modern language models, especially those based on deep neural networks, frequently use bottom-up vocabulary generation techniques like Byte Pair Encoding (BPE) to create word pieces enabling them to model any sequence of text, even with a fixed-size vocabulary significantly smaller than the full training vocabulary.
The resulting language models often prove extremely capable. Yet, when included into traditional Automatic Speech Recognition (ASR) pipelines, these languages models can sometimes perform quite unsatisfyingly for rare or unseen text, because the resulting word pieces often don’t map cleanly to phoneme sequences (consider for instance Multilingual BERT’s unfortunate breaking of Sonnenlicht into Sonne+nl+icht). This impairs the ability for the acoustic model to generate the required token sequences, preventing good options from being considered in the first place.
While approaches like Morfessor attempt to solve this problem using more refined algorithms, these approaches only make use of the written form of a word as an input, splitting words into parts disregarding the word’s actual meaning.
Meanwhile, word embeddings for languages like Dutch have become extremely common and high-quality; in this project, the question of whether this knowledge about a word usage in context could be leveraged to yield better hyphenation quality will be investigated.
For this purpose, the following approach is evaluated: A baseline Transformer model is tasked to generate hyphenation candidates for a given word based on its written form, and those candidates are subsequently reranked based on the embedding of the hyphenated word. The obtained results will be compared with the results yielded by Morfessor based on the same dataset.
Finally, a new set of linguistic rules to perform Dutch hyphenation (suitable for use with Liang’s hyphenation algorithm from TEX82) will be presented. The resulting output of these rules will be compared to currently available rule-sets
Morphological Analysis as Classification: an Inductive-Learning Approach
Morphological analysis is an important subtask in text-to-speech conversion,
hyphenation, and other language engineering tasks. The traditional approach to
performing morphological analysis is to combine a morpheme lexicon, sets of
(linguistic) rules, and heuristics to find a most probable analysis. In
contrast we present an inductive learning approach in which morphological
analysis is reformulated as a segmentation task. We report on a number of
experiments in which five inductive learning algorithms are applied to three
variations of the task of morphological analysis. Results show (i) that the
generalisation performance of the algorithms is good, and (ii) that the lazy
learning algorithm IB1-IG performs best on all three tasks. We conclude that
lazy learning of morphological analysis as a classification task is indeed a
viable approach; moreover, it has the strong advantages over the traditional
approach of avoiding the knowledge-acquisition bottleneck, being fast and
deterministic in learning and processing, and being language-independent.Comment: 11 pages, 5 encapsulated postscript figures, uses non-standard NeMLaP
proceedings style nemlap.sty; inputs ipamacs (international phonetic
alphabet) and epsf macro
Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2). 29 November 2012, Lisbon, Portugal
Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), held in Lisbon, Portugal on 29 November 2012
A Finite State and Data-Oriented Method for Grapheme to Phoneme Conversion
A finite-state method, based on leftmost longest-match replacement, is
presented for segmenting words into graphemes, and for converting graphemes
into phonemes. A small set of hand-crafted conversion rules for Dutch achieves
a phoneme accuracy of over 93%. The accuracy of the system is further improved
by using transformation-based learning. The phoneme accuracy of the best system
(using a large set of rule templates and a `lazy' variant of Brill's algoritm),
trained on only 40K words, reaches 99% accuracy.Comment: 8 page
A language-sensitive text editor for Dutch
Modern word processors begin to offer a range of facilities for spelling, grammar and style checking in English. For the Dutch language hardly anything is available as yet. Many commercial word processing packages do include a hyphenation routine and a lexicon-based spelling checker but the practical usefulness of these tools is limited due to certain properties of Dutch orthography, as we will explain below. In this chapter we describe a text editor which incorporates a great deal of lexical, morphological and syntactic knowledge of Dutch and monitors the orthographical quality of Dutch texts. Section 1 deals with those aspects of Dutch orthography which pose problems to human authors as well as to computational language sensitive text editing tools. In section 2 we describe the design and the implementation of the text editor we have built. Section 3 is mainly devoted to a provisional evaluation of the system
Pattern-driven morphological decomposition
Wetensch. publicatieFaculteit der Lettere
- …