131 research outputs found

    Amharic Speech Recognition for Speech Translation

    No full text
    International audienceThe state-of-the-art speech translation can be seen as a cascade of Automatic Speech Recognition, Statistical Machine Translation and Text-To-Speech synthesis. In this study an attempt is made to experiment on Amharic speech recognition for Amharic-English speech translation in tourism domain. Since there is no Amharic speech corpus, we developed a read-speech corpus of 7.43hr in tourism domain. The Amharic speech corpus has been recorded after translating standard Basic Traveler Expression Corpus (BTEC) under a normal working environment. In our ASR experiments phoneme and syllable units are used for acoustic models, while morpheme and word are used for language models. Encouraging ASR results are achieved using morpheme-based language models and phoneme-based acoustic models with a recognition accuracy result of 89.1%, 80.9%, 80.6%, and 49.3% at character, morph, word and sentence level respectively. We are now working towards designing Amharic-English speech translation through cascading components under different error correction algorithms

    The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation

    Full text link
    Machine translation (MT) is one of the main tasks in natural language processing whose objective is to translate texts automatically from one natural language to another. Nowadays, using deep neural networks for MT tasks has received great attention. These networks require lots of data to learn abstract representations of the input and store it in continuous vectors. This paper presents the first relatively large-scale Amharic-English parallel sentence dataset. Using these compiled data, we build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model achieving a BLEU score of 37.79 in Amharic-English 32.74 in English-Amharic translation. Additionally, we explore the effects of Amharic homophone normalization on the machine translation task. The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions

    Extended Parallel Corpus for Amharic-English Machine Translation

    Full text link
    This paper describes the acquisition, preprocessing, segmentation, and alignment of an Amharic-English parallel corpus. It will be useful for machine translation of an under-resourced language, Amharic. The corpus is larger than previously compiled corpora; it is released for research purposes. We trained neural machine translation and phrase-based statistical machine translation models using the corpus. In the automatic evaluation, neural machine translation models outperform phrase-based statistical machine translation models.Comment: Accepted to 2nd AfricanNLP workshop at EACL 202

    Classifying Amharic News Text Using Self-Organizing Maps

    Get PDF
    The paper addresses using artificial neural networks for classification of Amharic news items. Amharic is the language for countrywide communication in Ethiopia and has its own writing system containing extensive systematic redundancy. It is quite dialectally diversified and probably representative of the languages of a continent that so far has received little attention within the language processing field. The experiments investigated document clustering around user queries using Self-Organizing Maps, an unsupervised learning neural network strategy. The best ANN model showed a precision of 60.0% when trying to cluster unseen data, and a 69.5% precision when trying to classify it

    The Development of Oromo Writing System

    Get PDF
    The development and use of languages for official, education, religion, etc. purposes have been a major political issue in many developing multilingual countries. A number of these countries, including China and India, have recognised the issues and developed language policies that have provided some ethnic groups with the right to develop their languages and cultures by using writing systems based on scripts suitable for these purposes. On the other hand, other countries, such as Ethiopia (a multilingual African state) had, for a long time, preferred a policy of one language and one script in the belief that this would help the assimilation of various ethnic groups create a homogenous population with one language and culture. Rather than realizing that aim, the policy became a significant source of conflict and demands for political independence among disfavoured groups. This thesis addresses the development of a writing system for Oromo, a language spoken by approximately 40 percent of the total population of Ethiopia, which remained officially unwritten until the early 1990s. It begins by reviewing the early history of Oromo writing and discusses the Ethiopian language policies, analysing materials written in various scripts and certain writers starting from the 19th century. The adoption of Roman script for Oromo writing and the debates that followed are explored, with an examination of some phonological aspects of the Oromo language and the implications of representing them using the Roman alphabet. This thesis argues that the Oromo language has thrived during the past few years having implemented a Roman-based alphabetical script. There have been and continue to be, however, internal and external challenges confronting the development of the Oromo writing system which need to be carefully considered and addressed by stakeholders, primarily by the Oromo people and the Ethiopian government, in order for the Oromo language to establish itself as a fully codified language in the modern nation-state
    corecore