Search CORE

640 research outputs found

Web-based corpus acquisition for Swahili language modelling

Author: Kivaisi Alexander
Mbogho Audrey
Publication venue
Publication date: 01/01/2012
Field of study

Finding large amounts of text data for use in natural language technology is difficult for under-resourced languages such as Swahili. The corpora that are readily accessible for these languages are not sufficient to be used in language technologies, whose requirements can run into the hundreds of millions of words. This paper describes how we can take advantage of search engines such as Google together with crawling tools to collect Swahili text from the Web. We also share the experience of cleaning up and normalising the resulting text data. Finally, we show some preliminary results of the evaluation of the language models built from our corpus as well as results of how they compare to those built from the Helsinki Corpus

UCT Computer Science Research Document Archive

State-of-the-art software to support intelligent lexicography

Author: de Schryver Gilles-Maurice
Publication venue: 中国社会科学 = China Sociale Wetenschappen Publishing House
Publication date: 01/01/2010
Field of study

Ghent University Academic Bibliography

AfLaT 2010: proceedings of the second workshop on African language technology (AfLaT 2010)

Author: De Pauw Guy
de Schryver Gilles-Maurice
Groenewald Handré
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2010
Field of study

Ghent University Academic Bibliography

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages

Author: De Pauw Guy
de Schryver Gilles-Maurice
Levin Lori
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2009
Field of study

Ghent University Academic Bibliography

Word sense disambiguation of swahili : Extending swahili language techonology with machine learning

Author: Ng'ang'a Wanjiku
Publication venue: Helsingfors universitet
Publication date: 01/11/2005
Field of study

Helsingin yliopiston digitaalinen arkisto

HausaNLP at SemEval-2023 Task 12: Leveraging African Low Resource TweetData for Sentiment Analysis

Author: Abdullahi Abdulkadir
Adamu Shamsuddeen Umaru
Ahmad Mahmoud Said
Aliyu Saminu Mohammad
Bello Musa
Gadanya Murja Sani
Imam Amina Abubakar
Jamoh Abdulmalik Yusuf
Lawan Falalu Ibrahim
Muaz Sanah Abdullahi
Rabiu Nur Bala
Salahudeen Saheed Abdullahi
Shuaibu Aliyu Rabiu
Wali Ahmad Mustapha
Yusuf Aliyu
Publication venue
Publication date: 26/04/2023
Field of study

We present the findings of SemEval-2023 Task 12, a shared task on sentiment analysis for low-resource African languages using Twitter dataset. The task featured three subtasks; subtask A is monolingual sentiment classification with 12 tracks which are all monolingual languages, subtask B is multilingual sentiment classification using the tracks in subtask A and subtask C is a zero-shot sentiment classification. We present the results and findings of subtask A, subtask B and subtask C. We also release the code on github. Our goal is to leverage low-resource tweet data using pre-trained Afro-xlmr-large, AfriBERTa-Large, Bert-base-arabic-camelbert-da-sentiment (Arabic-camelbert), Multilingual-BERT (mBERT) and BERT models for sentiment analysis of 14 African languages. The datasets for these subtasks consists of a gold standard multi-class labeled Twitter datasets from these languages. Our results demonstrate that Afro-xlmr-large model performed better compared to the other models in most of the languages datasets. Similarly, Nigerian languages: Hausa, Igbo, and Yoruba achieved better performance compared to other languages and this can be attributed to the higher volume of data present in the languages

arXiv.org e-Print Archive

Lexical and Grammar Resource Engineering for Runyankore & Rukiga: A Symbolic Approach

Author: Bamutura David
Publication venue
Publication date: 01/01/2021
Field of study

Current research in computational linguistics and natural language processing (NLP) requires the existence of language resources. Whereas these resources are available for a few well-resourced languages, there are many languages that have been neglected. Among the neglected and / or under-resourced languages are Runyankore and Rukiga (henceforth referred to as Ry/Rk). Recently, the NLP community has started to acknowledge that resources for under-resourced languages should also be given priority. Why? One reason being that as far as language typology is concerned, the few well-resourced languages do not represent the structural diversity of the remaining languages. The central focus of this thesis is about enabling the computational analysis and generation of utterances in Ry/Rk. Ry/Rk are two closely related languages spoken by about 3.4 and 2.4 million people respectively. They belong to the Nyoro-Ganda (JE10) language zone of the Great Lakes, Narrow Bantu of the Niger-Congo language family.The computational processing of these languages is achieved by formalising the grammars of these two languages using Grammatical Framework (GF) and its Resource Grammar Library (RGL). In addition to the grammar, a general-purpose computational lexicon for the two languages is developed. Although we utilise the lexicon to tremendously increase the lexical coverage of the grammars, the lexicon can be used for other NLP tasks.In this thesis a symbolic / rule-based approach is taken because the lack of adequate languages resources makes the use of data-driven NLP approaches unsuitable for these languages

Chalmers Research

Analogical classification in formal grammar

Author: Guzmán Naranjo Matías
Publication venue: Language Science Press
Publication date: 06/12/2018
Field of study

The organization of the lexicon, and especially the relations between groups of lexemes is a strongly debated topic in linguistics. Some authors have insisted on the lack of any structure of the lexicon. In this vein, Di Sciullo & Williams (1987: 3) claim that “[t]he lexicon is like a prison – it contains only the lawless, and the only thing that its inmates have in commonis lawlessness”. In the alternative view, the lexicon is assumed to have a rich structure that captures all regularities and partial regularities that exist between lexical entries.Two very different schools of linguistics have insisted on the organization of the lexicon. On the one hand, for theories like HPSG (Pollard & Sag 1994), but also some versions of construction grammar (Fillmore & Kay 1995), the lexicon is assumed to have a very rich structure which captures common grammatical properties between its members. In this approach, a type hierarchy organizes the lexicon according to common properties between items. For example, Koenig (1999: 4, among others), working from an HPSG perspective, claims that the lexicon “provides a unified model for partial regularties, medium-size generalizations, and truly productive processes”. On the other hand, from the perspective of usage-based linguistics, several authors have drawn attention to the fact that lexemes which share morphological or syntactic properties, tend to be organized in clusters of surface (phonological or semantic) similarity (Bybee & Slobin 1982; Skousen 1989; Eddington 1996). This approach, often called analogical, has developed highly accurate computational and non-computational models that can predict the classes to which lexemes belong. Like the organization of lexemes in type hierarchies, analogical relations between items help speakers to make sense of intricate systems, and reduce apparent complexity (Köpcke & Zubin 1984). Despite this core commonality, and despite the fact that most linguists seem to agree that analogy plays an important role in language, there has been remarkably little work on bringing together these two approaches. Formal grammar traditions have been very successful in capturing grammatical behaviour, but, in the process, have downplayed the role analogy plays in linguistics (Anderson 2015). In this work, I aim to change this state of affairs. First, by providing an explicit formalization of how analogy interacts with grammar, and second, by showing that analogical effects and relations closely mirror the structures in the lexicon. I will show that both formal grammar approaches, and usage-based analogical models, capture mutually compatible relations in the lexicon

Language Science Press

Analogical classification in formal grammar

Author: Guzmán Naranjo Matías
Publication venue: Language Science Press
Publication date: 06/12/2018
Field of study

Language Science Press

Analogical classification in formal grammar

Author: Guzmán Naranjo Matías
Publication venue: Language Science Press
Publication date: 06/12/2018
Field of study

Language Science Press