Language identification for South African Bantu languages Using Rank Order Statistics

GR Botha; P McNamee; P Zulu; W Li

Language identification for South African Bantu languages Using Rank Order Statistics

Authors: GR Botha
P McNamee
P Zulu
W Li
Publication date: 1 January 2019
Publisher: 'Springer Fachmedien Wiesbaden GmbH'
Doi

Abstract

Language identification is an important pre-process in many data management and information retrieval and transformation systems. However, Bantu languages are known to be difficult to identify because of lack of data and language similarity. This paper investigates the performance of n-gram counting using rank orders in order to discriminate among the different Bantu languages spoken in South Africa, using varying test and training data sizes. The highest average accuracy obtained was 99.3% with a testing size of 495 characters and training size of 600000 characters. The lowest average accuracy obtained was 78.72% when the testing size was 15 characters and learning size was 200000 characters

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

UCT Computer Science Research Document Archive

oai:pubs.cs.uct.ac.za:1334

Last time updated on 28/10/2019

Crossref

Last time updated on 10/08/2021