Search CORE

6 research outputs found

African language technology: the data-driven perspective

Author: De Pauw Guy
de Schryver Gilles-Maurice
Publication venue: 'European Academy of Applied and Social Sciences (EURAASS)'
Publication date: 01/01/2009
Field of study

Ghent University Academic Bibliography

Institutional Repository Universiteit Antwerpen

Diacritic Restoration and the Development of a Part-of-Speech Tagset for the Māori Language

Author: Cocks John
Publication venue: 'University of Waikato'
Publication date: 16/03/2012
Field of study

This thesis investigates two fundamental problems in natural language processing: diacritic restoration and part-of-speech tagging. Over the past three decades, statistical approaches to diacritic restoration and part-of-speech tagging have grown in interest as a consequence of the increasing availability of manually annotated training data in major languages such as English and French. However, these approaches are not practical for most minority languages, where appropriate training data is either non-existent or not publically available. Furthermore, before developing a part-of-speech tagging system, a suitable tagset is required for that language. In this thesis, we make the following contributions to bridge this gap: Firstly, we propose a method for diacritic restoration based on naive Bayes classifiers that act at word-level. Classifications are based on a rich set of features, extracted automatically from training data in the form of diacritically marked text. This method requires no additional resources, which makes it language independent. The algorithm was evaluated on one language, namely Māori, and an accuracy exceeding 99% was observed. Secondly, we present our work on creating one of the necessary resources for the development of a part-of-speech tagging system in Māori, that of a suitable tagset. The tagset described was developed in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora, and was the result of in-depth analysis of the Māori grammar

Research Commons@Waikato

Spell-checking in Spanish: the case of diacritic accents

Author: Atserias Batalla Jordi
Fuentes Fort Maria
Nazar Rogelio
Renau Irene
Publication venue
Publication date: 01/01/2012
Field of study

This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker’s dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo ‘continuous’ and continuó ‘he/she/it continued’, or when different diacritics make other word distinctions, as in continúo ‘I continue’. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.Peer ReviewedPostprint (author’s final draft

UPCommons. Portal del coneixement obert de la UPC

AfLaT 2010: proceedings of the second workshop on African language technology (AfLaT 2010)

Author: De Pauw Guy
de Schryver Gilles-Maurice
Groenewald Handré
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2010
Field of study

Ghent University Academic Bibliography

A grapheme-based approach for accent restoration in Gikuyu

Author: De Pauw Guy
Githinji P.W.
Waiganjo Wagacha P.
Publication venue
Publication date: 01/01/2006
Field of study

Institutional Repository Universiteit Antwerpen

On reconstructing Proto-Bantu grammar

Author
Publication venue
Publication date: 01/01/2022
Field of study

This book is about reconstructing the grammar of Proto-Bantu, the ancestral language at the origin of current-day Bantu languages. While Bantu is a low-level branch of Niger-Congo, the world’s biggest phylum, it is still Africa’s biggest language family. This edited volume attempts to retrieve the phonology, morphology and syntax used by the earliest Bantu speakers to communicate with each other, discusses methods to do so, and looks at issues raised by these academic endeavours. It is a collective effort involving a fine mix of junior and senior scholars representing several generations of expert historical-comparative Bantu research. It is the first systematic approach to Proto-Bantu grammar since Meeussen’s Bantu Grammatical Reconstructions (1967). Based on new bodies of evidence from the last five decades, most notably from northwestern Bantu languages, this book considerably transforms our understanding of Proto-Bantu grammar and offers new methodological approaches to Bantu grammatical reconstruction

Institutional Repository of the Freie Universität Berlin