12 research outputs found

    Looking forward by looking back: Applying lessons from 20 years of African language technology

    Get PDF
    This paper takes a frank look at what has and has not been achieved in African language technology during the past two decades. Several questions are addressed: What was the status of technology for African languages 20 years ago? What were the major initiatives during that time? What were their successes and failures? What can we learn from these experiences? How does this inform the work that we are planning going forward? Examining in particular the history of Swahili, it is argued that technology projects have often achieved their expressed aims, but have collectively not significantly advanced the normalization of African languages as operable within the technical sphere, even while Africa has become blanketed with mobile technology. It is argued that future projects will succeed only by asserting the goal that technology of 2035 must be fully operational in users' primary languages, and gearing policy, funding, and individual project efforts toward gathering and deploying linguistic data for a large number of African languages to meet cutting-edge technologies as they emerg

    Measuring verb similarity using binary coefficients with application to isiXhosa and isiZulu

    Get PDF
    Natural Language Processing (NLP) for underresourced languages may benefit from a bootstrapping approach to utilise the sparse resources across closely related languages. This brings afore the question of language similarity, and therewith the question of how to measure that, so as to make informed predictions on potential success of bootstrapping. We present a method for measuring morphosyntactic similarity by developing Context Free Grammars (CFGs) for isiXhosa and isiZulu verb fragments that are relevant for the use case of weather forecast generation. We then investigate morphosyntactic similarity of the CFGs using parse tree analysis and four binary similarity measures. In particular, we selected four binary similarity measures from other domains and adapted them to our data, which are the word sets generated from the respective CFGs. The similarity measures together with the parse tree analysis are used to study the the extent to which both languages can be generated by a singular grammar fragment. The resulting 52 rules for isiXhosa and 49 rules for isiZulu overlap on 42 rules. This supports the existing intuition of similarity, as they are in the same language cluster. The morphosyntactic similarity measured with the binary coefficients reached 59.5% overall (adapted Driver-Kroeber), with 99.5% for the past tense only. This lower score cf. the structure of the CFG is attributable to the small differences in terminals in mainly the prefix of the verb. The parse tree analysis and binary similarity measures show that a modularised set of rules for the prefix, verb root, and suffix would allow the generation of the two languages with a single grammar where only the prefix requires differentiation

    Contextualising Levels of Language Resourcedness affecting Digital Processing of Text

    Get PDF
    Application domains such as digital humanities and tool like chatbots involve some form of processing natural language, from digitising hardcopies to speech generation. The language of the content is typically characterised as either a low resource language (LRL) or high resource language (HRL), also known as resource-scarce and well-resourced languages, respectively. African languages have been characterized as resource-scarce languages (Bosch et al. 2007; Pretorius & Bosch 2003; Keet & Khumalo 2014) and English is by far the most well-resourced language. Varied language resources are used to develop software systems for these languages to accomplish a wide range of tasks. In this paper we argue that the dichotomous typology LRL and HRL for all languages is problematic. Through a clear understanding of language resources situated in a society, a matrix is developed that characterizes languages as Very LRL, LRL, RL, HRL and Very HRL. The characterization is based on the typology of contextual features for each category, rather than counting tools, and motivation is provided for each feature and each characterization. The contextualisation of resourcedness, with a focus on African languages in this paper, and an increased understanding of where on the scale the language used in a project is, may assist in, among others, better planning of research and implementation projects. We thus argue in this paper that the characterization of language resources within a given scale in a project is an indispensable component particularly in the context of low-resourced languages

    The Virtual Institute for Afrikaans and the Afrikaans community’s market needs

    Get PDF
    The Virtual Institute for Afrikaans (VivA) is a research institute and service provider for Afrikaans in digital contexts. It is a registered non-profit company, with the Afrikaanse Taal- en Kultuurvereniging (ATKV), North-West University (NWU), Suid-Afrikaanse Akademie vir Wetenskap en Kuns (SAAWK), and Trust vir Afrikaanse Onderwys (TAO) as its founding members. In order to make informed choices regarding VivA’s product and service offering, mixed method research was conducted to determine shortcomings in the Afrikaans offering of digital language products. For purposes of the quantitative research, an online questionnaire was completed by 319 respondents (demographic representation of mostly white, mother-tongue speakers of Afrikaans between the ages of 30 and 65), while a focus group with ten respondents (mostly white, mother-tongue speakers of Afrikaans between 15 and 62) was used to gather qualitative information. The focus group session was recorded, transcribed, coded and then analysed to derive seven key themes that are associated with VivA. One of the key fi ndings is that a large part of the Afrikaans users in this sample did not know of the existence of the Afrikaans Wiktionary and Wikipedia. This fi nding directed VivA’s priorities in other directions, although it will keep on exploring ideas and methods to change this perception. It was also clear that Afrikaans users have a need for four specifi c Afrikaans electronic aids, namely an online/mobile version of the Afrikaanse Woordelys en Spelreëls (Afrikaans Word-list and Spelling Rules); an Afrikaans grammar checker; a terminology bank; and automatic translation tools. Despite the fact that the majority of respondents had a fairly negative experience with regard to automatic translation assistance, it was found that a signifi cant number of respondents are still positive about it, and have a strong need for such a high-quality product. On the basis of this research, the needs of the Afrikaans community related to language products and services were determined, and various products and services were introduced in order to meet these identifi ed needs. Hence, VivA’s initial products and services offering includes: a dictionary portal (where users can access various free and commercial dictionaries online, as well as via an online and offl ine Android and iOS app); grammar portal (where users, especially international researchers, can access extensive information about the phonology, morphology and syntax of Afrikaans, presented comparatively with Dutch and Frisian as part of the international Taalportaal project); language advice portal (where users can get telephonic and online answers to language-related questions from a professional language advisor); corpus portal (where users can do online corpus queries in a large and growing collection of written and transcribed spoken Afrikaans corpora); and information portal (with access to a blog, competitions, etcetera). The article concludes with an overview of potential future research and development topics, including a motivation for the need for regular technology audits.Die Virtuele Instituut vir Afrikaans (VivA) is ʼn navorsingsinstituut en diensverskaffer vir Afrikaans in digitale kontekste. Ten einde verantwoorde keuses met betrekking tot VivA se produk- en diensaanbod te maak, is kwantitatiewe en kwalitatiewe navorsing gedoen om tekortkominge in die Afrikaanse mark van digitale taalprodukte te bepaal. Sewe temas is uit die fokusgroepgesprek geïdentifi seer. Een van die belangrikste bevindinge is dat ʼn groot deel van die Afrikaanse gebruikers in hierdie steekproef nie geweet het van die Afrikaanse Wiktionary en Wikipedia nie. Dit het duidelik geblyk dat Afrikaanse gebruikers veral ʼn behoefte het aan vier elektroniese Afrikaanse hulpmiddels, te wete ʼn aanlyn/mobiele weergawe van Afrikaanse Woordelys en Spelreëls; ʼn Afrikaanse grammatikatoetser; ʼn terminologiebank; en outomatiese vertaalhulpmiddels. Ofskoon die meerderheid respondente ʼn redelik negatiewe belewenis met betrekking tot outomatiese vertaalhulp gehad het, is bevind dat ʼn beduidende aantal tog positief daaroor is en ʼn sterk behoefte aan so ʼn hoëkwaliteitproduk het. Op grond van hierdie navorsing is die markbehoeftes van die Afrikaanse gemeenskap bepaal en verskeie produkte en dienste is voorgestel. Ten einde aan die geïdentifi seerde markbehoeftes te voldoen, sluit VivA se aanvangsprodukte en -dienste onder andere die volgende in: Woordeboekportaal; Taalportaal; Adviesportaal; Korpusportaal; en Inligtingsportaal.VivAhttp://reference.sabinet.co.za/sa_epublication/akgeeshttp://www.scielo.org.za/scielo.php?script=sci_serial&pid=0041-4751&lng=enam2016Business Managemen

    The Virtual Institute for Afrikaans and the Afrikaans community's market needs

    Get PDF
    The Virtual Institute for Afrikaans (VivA) is a research institute and service provider for Afrikaans in digital contexts. It is a registered non-profit company, with the Afrikaanse Taal- en Kultuurvereniging (ATKV), North-West University (NWU), Suid-Afrikaanse Akademie vir Wetenskap en Kuns (SAAWK), and Trust vir Afrikaanse Onderwys (TAO) as its founding members. In order to make informed choices regarding VivA’s product and service offering, mixed method research was conducted to determine shortcomings in the Afrikaans offering of digital language products. For purposes of the quantitative research, an online questionnaire was completed by 319 respondents (demographic representation of mostly white, mother-tongue speakers of Afrikaans between the ages of 30 and 65), while a focus group with ten respondents (mostly white, mother-tongue speakers of Afrikaans between 15 and 62) was used to gather qualitative information. The focus group session was recorded, transcribed, coded and then analysed to derive seven key themes that are associated with VivA. One of the key fi ndings is that a large part of the Afrikaans users in this sample did not know of the existence of the Afrikaans Wiktionary and Wikipedia. This fi nding directed VivA’s priorities in other directions, although it will keep on exploring ideas and methods to change this perception. It was also clear that Afrikaans users have a need for four specifi c Afrikaans electronic aids, namely an online/mobile version of the Afrikaanse Woordelys en Spelreëls (Afrikaans Word-list and Spelling Rules); an Afrikaans grammar checker; a terminology bank; and automatic translation tools. Despite the fact that the majority of respondents had a fairly negative experience with regard to automatic translation assistance, it was found that a signifi cant number of respondents are still positive about it, and have a strong need for such a high-quality product. On the basis of this research, the needs of the Afrikaans community related to language products and services were determined, and various products and services were introduced in order to meet these identifi ed needs. Hence, VivA’s initial products and services offering includes: a dictionary portal (where users can access various free and commercial dictionaries online, as well as via an online and offl ine Android and iOS app); grammar portal (where users, especially international researchers, can access extensive information about the phonology, morphology and syntax of Afrikaans, presented comparatively with Dutch and Frisian as part of the international Taalportaal project); language advice portal (where users can get telephonic and online answers to language-related questions from a professional language advisor); corpus portal (where users can do online corpus queries in a large and growing collection of written and transcribed spoken Afrikaans corpora); and information portal (with access to a blog, competitions, etcetera). The article concludes with an overview of potential future research and development topics, including a motivation for the need for regular technology audits.Die Virtuele Instituut vir Afrikaans (VivA) is ʼn navorsingsinstituut en diensverskaffer vir Afrikaans in digitale kontekste. Ten einde verantwoorde keuses met betrekking tot VivA se produk- en diensaanbod te maak, is kwantitatiewe en kwalitatiewe navorsing gedoen om tekortkominge in die Afrikaanse mark van digitale taalprodukte te bepaal. Sewe temas is uit die fokusgroepgesprek geïdentifi seer. Een van die belangrikste bevindinge is dat ʼn groot deel van die Afrikaanse gebruikers in hierdie steekproef nie geweet het van die Afrikaanse Wiktionary en Wikipedia nie. Dit het duidelik geblyk dat Afrikaanse gebruikers veral ʼn behoefte het aan vier elektroniese Afrikaanse hulpmiddels, te wete ʼn aanlyn/mobiele weergawe van Afrikaanse Woordelys en Spelreëls; ʼn Afrikaanse grammatikatoetser; ʼn terminologiebank; en outomatiese vertaalhulpmiddels. Ofskoon die meerderheid respondente ʼn redelik negatiewe belewenis met betrekking tot outomatiese vertaalhulp gehad het, is bevind dat ʼn beduidende aantal tog positief daaroor is en ʼn sterk behoefte aan so ʼn hoëkwaliteitproduk het. Op grond van hierdie navorsing is die markbehoeftes van die Afrikaanse gemeenskap bepaal en verskeie produkte en dienste is voorgestel. Ten einde aan die geïdentifi seerde markbehoeftes te voldoen, sluit VivA se aanvangsprodukte en -dienste onder andere die volgende in: Woordeboekportaal; Taalportaal; Adviesportaal; Korpusportaal; en Inligtingsportaal.VivAhttp://reference.sabinet.co.za/sa_epublication/akgeeshttp://www.scielo.org.za/scielo.php?script=sci_serial&pid=0041-4751&lng=enam2016Business Managemen

    The construction of a linguistic linked data framework for bilingual lexicographic resources

    Get PDF
    Little-known lexicographic resources can be of tremendous value to users once digitised. By extending the digitisation efforts for a lexicographic resource, converting the human readable digital object to a state that is also machine-readable, structured data can be created that is semantically interoperable, thereby enabling the lexicographic resource to access, and be accessed by, other semantically interoperable resources. The purpose of this study is to formulate a process when converting a lexicographic resource in print form to a machine-readable bilingual lexicographic resource applying linguistic linked data principles, using the English-Xhosa Dictionary for Nurses as a case study. This is accomplished by creating a linked data framework, in which data are expressed in the form of RDF triples and URIs, in a manner which allows for extensibility to a multilingual resource. Click languages with characters not typically represented by the Roman alphabet are also considered. The purpose of this linked data framework is to define each lexical entry as “historically dynamic”, instead of “ontologically static” (Rafferty, 2016:5). For a framework which has instances in constant evolution, focus is thus given to the management of provenance and linked data generation thereof. The output is an implementation framework which provides methodological guidelines for similar language resources in the interdisciplinary field of Library and Information Science

    CLARIN. The infrastructure for language resources

    Get PDF
    CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future. The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
    corecore