175 research outputs found

    Development of a stemmer for the isiXhosa language

    Get PDF
    IsiXhosa language is one of the eleven official languages and the second most widely spoken language in South Africa. However, in terms of computational linguistics, the language did not get attention and natural language related work is almost non-existent. Document retrieval using unstructured queries requires some kind of language processing, and an efficient retrieval of documents can be achieved if we use a technique called stemming. The area that involves document storage and retrieval is called Information Retrieval (IR). Basically, IR systems make use of a Stemmer to index document representations and also terms in users’ queries to retrieve matching documents. In this dissertation, we present the developed Stemmer that can be used in both conditions. The Stemmer is used in IR systems, like Google to retrieve documents written in isiXhosa. In the Eastern Cape Province of South Africa many public schools take isiXhosa as a subject and also a number of Universities in South Africa teach isiXhosa. Therefore, for a language important such as this, it is important to make valuable information that is available online accessible to users through the use of IR systems. In our efforts to develop a Stemmer for the isiXhosa language, an investigation on how others have developed Stemmers for other languages was carried out. From the investigation we came to realize that the Porter stemming algorithm in particular was the main algorithm that many of other Stemmers make use of as a reference. We found that Porter’s algorithm could not be used in its totality in the development of the isiXhosa Stemmer because of the morphological complexity of the language. We developed an affix removal that is embedded with rules that determine which order should be followed in stripping the affixes. The rule is that, the word under consideration is checked against the exceptions, if it’s not in the exceptions list then the stripping continue in the following order; Prefix removal, Suffix removal and finally save the result as stem. The Stemmer was successfully developed and was tested and evaluated in a sample data that was randomly collected from the isiXhosa text books and isiXhosa dictionary. From the results obtained we concluded that the Stemmer can be used in IR systems as it showed 91 percent accuracy. The errors were 9 percent and therefore these results are within the accepted range and therefore the Stemmer can be used to help in retrieval of isiXhosa documents. This is only a noun Stemmer and in the future it can be extended to also stem verbs as well. The Stemmer can also be used in the development of spell-checkers of isiXhosa

    Language and Identity Theories and experiences in lexicography and linguistic policies in a global world

    Get PDF
    This book was conceived during the closing event of the DiM project, developed within the framework of the Erasmus plus KA204 - Strategic Partnerships for Adult Education programme. Its fourteen chapters intend to offer food for thought on some of the currently most debated questions for linguists in the global village, and are divided into three thematic sections: 1) multilingualism, minority languages and the eternal dichotomy between orality and writing; 2) lexicography and L2 teaching; 3) the role of linguistics in particularly complex multilingual contexts. The book was published thanks to a grant obtained in 2018 by Regione Friuli Venezia Giulia

    English speakers' common orthographic errors in Arabic as L2 writing system : an analytical case study

    Get PDF
    PhD ThesisThe research involving Arabic Writing System (WS) is quite limited. Yet, researching writing errors of L2WS Arabic against a certain L1WS seems to be relatively neglected. This study attempts to identify, describe, and explain common orthographic errors in Arabic writing amongst English-speaking learners. First, it outlines the Arabic Writing System’s (AWS) characteristics and available empirical studies of L2WS Arabic. This study embraced the Error Analysis approach, utilising a mixed-method design that deployed quantitative and qualitative tools (writing tests, questionnaire, and interview). The data were collected from several institutions around the UK, which collectively accounted for 82 questionnaire responses, 120 different writing samples from 44 intermediate learners, and six teacher interviews. The hypotheses for this research were; a) English-speaking learners of Arabic make common orthographic errors similar to those of Arabic native speakers; b) English-speaking learners share several common orthographic errors with other learners of Arabic as a second/foreign language (AFL); and c) English-speaking learners of Arabic produce their own common orthographic errors which are specifically related to the differences between the two WSs. The results confirmed all three hypotheses. Specifically, English-speaking learners of L2WS Arabic commonly made six error types: letter ductus (letter shape), orthography (spelling), phonology, letter dots, allographemes (i.e. letterform), and direction. Gemination and L1WS transfer error rates were not found to be major. Another important result showed that five letter groups in addition to two letters are particularly challenging to English-speaking learners. Study results indicated that error causes were likely to be from one of four factors: script confusion, orthographic difficulties, phonological realisation, and teaching/learning strategies. These results are generalizable as the data were collected from several institutions in different parts of the UK. Suggestions and implications as well as recommendations for further research are outlined accordingly in the conclusion chapter
    • …
    corecore