131 research outputs found
Amharic Speech Recognition for Speech Translation
International audienceThe state-of-the-art speech translation can be seen as a cascade of Automatic Speech Recognition, Statistical Machine Translation and Text-To-Speech synthesis. In this study an attempt is made to experiment on Amharic speech recognition for Amharic-English speech translation in tourism domain. Since there is no Amharic speech corpus, we developed a read-speech corpus of 7.43hr in tourism domain. The Amharic speech corpus has been recorded after translating standard Basic Traveler Expression Corpus (BTEC) under a normal working environment. In our ASR experiments phoneme and syllable units are used for acoustic models, while morpheme and word are used for language models. Encouraging ASR results are achieved using morpheme-based language models and phoneme-based acoustic models with a recognition accuracy result of 89.1%, 80.9%, 80.6%, and 49.3% at character, morph, word and sentence level respectively. We are now working towards designing Amharic-English speech translation through cascading components under different error correction algorithms
The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation
Machine translation (MT) is one of the main tasks in natural language
processing whose objective is to translate texts automatically from one natural
language to another. Nowadays, using deep neural networks for MT tasks has
received great attention. These networks require lots of data to learn abstract
representations of the input and store it in continuous vectors. This paper
presents the first relatively large-scale Amharic-English parallel sentence
dataset. Using these compiled data, we build bi-directional Amharic-English
translation models by fine-tuning the existing Facebook M2M100 pre-trained
model achieving a BLEU score of 37.79 in Amharic-English 32.74 in
English-Amharic translation. Additionally, we explore the effects of Amharic
homophone normalization on the machine translation task. The results show that
the normalization of Amharic homophone characters increases the performance of
Amharic-English machine translation in both directions
Extended Parallel Corpus for Amharic-English Machine Translation
This paper describes the acquisition, preprocessing, segmentation, and
alignment of an Amharic-English parallel corpus. It will be useful for machine
translation of an under-resourced language, Amharic. The corpus is larger than
previously compiled corpora; it is released for research purposes. We trained
neural machine translation and phrase-based statistical machine translation
models using the corpus. In the automatic evaluation, neural machine
translation models outperform phrase-based statistical machine translation
models.Comment: Accepted to 2nd AfricanNLP workshop at EACL 202
Classifying Amharic News Text Using Self-Organizing Maps
The paper addresses using artificial neural networks for classification of Amharic news items. Amharic is the language for countrywide communication in Ethiopia and has its own writing system containing extensive systematic redundancy. It is quite dialectally diversified and probably representative of the languages of a continent that so far has received little attention within the language processing field.
The experiments investigated document clustering around user queries using Self-Organizing Maps, an unsupervised learning neural network strategy. The best ANN model showed a precision of 60.0% when trying to cluster unseen data, and a 69.5% precision when trying to classify it
The Development of Oromo Writing System
The development and use of languages for official, education, religion, etc. purposes have been a major political issue in many developing multilingual countries. A number of these countries, including China and India, have recognised the issues and developed language policies that have provided some ethnic groups with the right to develop their languages and cultures by using writing systems based on scripts suitable for these purposes. On the other hand, other countries, such as Ethiopia (a multilingual African state) had, for a long time, preferred a policy of one language and one script in the belief that this would help the assimilation of various ethnic groups create a homogenous population with one language and culture. Rather than realizing that aim, the policy became a significant source of conflict and demands for political independence among disfavoured groups.
This thesis addresses the development of a writing system for Oromo, a language spoken by approximately 40 percent of the total population of Ethiopia, which remained officially unwritten until the early 1990s. It begins by reviewing the early history of Oromo writing and discusses the Ethiopian language policies, analysing materials written in various scripts and certain writers starting from the 19th century. The adoption of Roman script for Oromo writing and the debates that followed are explored, with an examination of some phonological aspects of the Oromo language and the implications of representing them using the Roman alphabet.
This thesis argues that the Oromo language has thrived during the past few years having implemented a Roman-based alphabetical script. There have been and continue to be, however, internal and external challenges confronting the development of the Oromo writing system which need to be carefully considered and addressed by stakeholders, primarily by the Oromo people and the Ethiopian government, in order for the Oromo language to establish itself as a fully codified language in the modern nation-state
- …