Search CORE

355 research outputs found

Preliminary Experiments on Unsupervised Word Discovery in Mboshi

Author: Adda Gilles
Adda-Decker Martine
Allauzen Alexandre
Besacier Laurent
Bonneau-Maynard Helene
Godard Pierre
Kouarata Guy-Noël
Löser Kevin
Rialland Annie
Yvon François
Publication venue: HAL CCSD
Publication date: 01/09/2016
Field of study

International audienceThe necessity to document thousands of endangered languages encourages the collaboration between linguists and computer scientists in order to provide the documentary linguistics community with the support of automatic processing tools. The French-German ANR-DFG project Breaking the Unwritten Language Barrier (BULB) aims at developing such tools for three mostly unwritten African languages of the Bantu family. For one of them, Mboshi, a language originating from the " Cu-vette " region of the Republic of Congo, we investigate unsuper-vised word discovery techniques from an unsegmented stream of phonemes. We compare different models and algorithms, both monolingual and bilingual, on a new corpus in Mboshi and French, and discuss various ways to represent the data with suitable granularity. An additional French-English corpus allows us to contrast the results obtained on Mboshi and to experiment with more data

Crossref

Hal - Université Grenoble Alpes

HAL

Innovative technologies for under-resourced language documentation: The BULB Project

Author: Adda Gilles
Adda-Decker Martine
Ambouroue Odette
Besacier Laurent
Blachon David
Ene Bonneau-Maynard Héì
Gauthier Elodie
Godard Pierre
Hamlaoui Fatima
Idiatov Dmitry
Kouarata Guy-Noël
Lamel Lori
Makasso Emmanuel-Moselly
Rialland Annie
Stuker Sebastian
Van De Velde Mark
Yvon François
Zerbian Sabine
Publication venue: HAL CCSD
Publication date: 01/05/2016
Field of study

International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

Hal - Université Grenoble Alpes

Innovative technologies for under-resourced language documentation: The BULB Project

Author: Adda Gilles
Adda-Decker Martine
Ambouroue Odette
Besacier Laurent
Blachon David
Ene Bonneau-Maynard Héì
Gauthier Elodie
Godard Pierre
Hamlaoui Fatima
Idiatov Dmitry
Kouarata Guy-Noël
Lamel Lori
Makasso Emmanuel-Moselly
Rialland Annie
Stuker Sebastian
Van De Velde Mark
Yvon François
Zerbian Sabine
Publication venue: HAL CCSD
Publication date: 01/05/2016
Field of study

Hal - Université Grenoble Alpes

HAL

Hal-Diderot

2nd Conference on Language, Data and Knowledge (LDK 2019), May 20–23, 2019, Leipzig, Germany

Author: Buitelaar Paul
Chiarcos Christian
de Melo Gerard
Dojchinovski Milan
Eskevich Maria
Fäth Christian
Klimek Bettina
McCrae John P.
Publication venue
Publication date: 27/04/2023
Field of study

OPUS Augsburg

Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages

Author: De Pauw Guy
de Schryver Gilles-Maurice
Levin Lori
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2009
Field of study

Ghent University Academic Bibliography

AfLaT 2010: proceedings of the second workshop on African language technology (AfLaT 2010)

Author: De Pauw Guy
de Schryver Gilles-Maurice
Groenewald Handré
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2010
Field of study

Ghent University Academic Bibliography

Proceedings of the workshop on language technology for normalisation of less-resourced languages (SaLTMiL 8 - AfLaT 2012)

Author: De Pauw Guy
de Schryver Gilles-Maurice
Forcada Mike L
Sarasola Kepa
Tyers Francis M
Wagacha Peter W
Publication venue: European Language Resources Association
Publication date: 01/01/2012
Field of study

Ghent University Academic Bibliography

English Index

Author: Pálfi Lórand-Levente
Publication venue: Aarhus University, Faculty of Arts, School of Communication and Culture
Publication date: 13/03/2007
Field of study

No abstract

Tidsskrift.dk (Det Kongelige Bibliotek)

Penalizing unknown words’ emissions in hmm pos tagger based on Malay affix morphemes

Author: Aziz M.J.A.
Mohamed H.
Omar N.
Publication venue: 'African Journals Online (AJOL)'
Publication date: 22/01/2018
Field of study

The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger isthat the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. Therefore, training cannot properly map possible tags. The exact morphemes of prefixes, suffixes and circumfixes in the agglutinative Malay language is examined to assign unknown words’ probable tags based on linguistically meaningful affixes using a morpheme-based POS guessing algorithm for tagging. The algorithm has been integrated into Viterbi algorithm which uses HMM trained parameters for tagging new sentences. In the experiment, this tagger is first, uses character-based prediction to handle unknown words; next, uses morpheme-based POS guessing algorithm; lastly, combination of the first and second.Keywords: Malay POS tagger; morpheme-based; HMM

AJOL - African Journals Online

Investigating the effectiveness of available tools for translating into tshiVenda

Author: Nemutamvuni Mulalo Edward
Publication venue
Publication date: 01/11/2018
Field of study

Text in EnglishAbstracts in English and VendaThis study has investigated the effectiveness of available tools used for translating from English into Tshivenḓa and vice versa with the aim to investigate and determine the effectiveness of these tools. This study dealt with the problem of lack of effective translation tools used to translate between English and Tshivenḓa. Tshivenḓa is one of South Africa’s minority languages. Its (Tshivenḓa) lack of effective translation tools negatively affects language practitioners’ work. This situation is perilous for translation quality assurance. Translation tools, both computer technology and non-computer technology tools abound for developed languages such as English, French and others. Based on the results of this research project, the researcher did make recommendations that could remedy the situation. South Africa is a democratic country that has a number of language-related policies. This then creates a conducive context for stakeholders with language passion to fully develop Tshivenḓa language in all dimensions. The fact is that all languages have evolved and they were all underdeveloped. This vividly shows that Tshivenḓa language development is also possible just like Afrikaans, which never existed on earth before 1652. It (Afrikaans) has evolved and overtaken all indigenous South African languages. This study did review the literature regarding translation and translation tools. The literature was obtained from both published and unpublished sources. The study has used mixed methods research, i.e. quantitative and qualitative research methods. These methods successfully complemented each other throughout the entire research. Data were gathered through questionnaires and interviews wherein both open and closed-ended questions were employed. Both purposive/judgemental and snowball (chain) sampling have been applied in this study. Data analysis was addressed through a combination of methods owing to the nature of mixed methods research. Guided by analytic comparison approach when grouping together related data during data analysis and presentation, both statistical and textual analyses have been vital in this study. Themes were constructed to lucidly present the gathered data. At the last chapters, the researcher discussed the findings and evaluated the entire research before making recommendations and conclusion.Iyi ṱhoḓisiso yo ita tsedzuluso nga ha kushumele kwa zwishumiswa zwi re hone zwine zwa shumiswa u pindulela u bva kha luambo lwa English u ya kha Tshivenḓa na u bva kha Tshivenḓa u ya kha English ndivho I ya u sedzulusa na u lavhelesa kushumele kwa izwi zwishumiswa uri zwi a thusa naa. Ino ṱhoḓisiso yo shumana na thaidzo ya ṱhahelelo ya zwishumiswa zwa u pindulela zwine zwa shumiswa musi hu tshi pindulelwa vhukati ha English na Tshivenḓa. Tshivenḓa ndi luṅwe lwa nyambo dza Afrika Tshipembe dzine dza ambiwa nga vhathu vha si vhanzhi. U shaea ha zwishumiswa zwa u pindulela zwine zwa shuma nga nḓila I thusaho zwi kwama mushumo wa vhashumi vha zwa nyambo nga nḓila I si yavhuḓi. Iyi nyimele I na mulingo u kwamaho khwaḽithi ya zwo pindulelwaho. Zwishumiswa zwa u pindulela, zwa thekhnoḽodzhi ya khomphiyutha na zwi sa shumisi thekhnoḽodzhi ya khomphiyutha zwo ḓalesa kha nyambo dzo bvelelaho u tou fana na kha English, French na dziṅwe. Zwo sendeka kha mvelelo dza ino thandela ya ṱhoḓisiso, muṱoḓisisi o ita themendelo dzine dza nga fhelisa thaidzo ya nyimele. Afrika Tshipembe ndi shango ḽa demokirasi ḽine ḽa vha na mbekanyamaitele dzo vhalaho nga ha dzinyambo. Izwi zwi ita uri hu vhe na nyimele ine vhafaramikovhe vhane vha funesa nyambo vha kone u bveledza Tshivenḓa kha masia oṱhe. Zwavhukuma ndi zwa uri nyambo dzoṱhe dzi na mathomo nahone dzoṱhe dzo vha dzi songo bvelela. Izwi zwi ita uri zwi vhe khagala uri luambo lwa Tshivenḓa na lwone lu nga bveledzwa u tou fana na luambo lwa Afrikaans lwe lwa vha lu si ho ḽifhasini phanḓa ha ṅwaha wa 1652. Ulu luambo (Afrikaans) lwo vha hone shangoni lwa mbo bveledzwa lwa fhira nyambo dzoṱhe dza fhano hayani Afrika Tshipembe. Kha ino ṱhoḓisiso ho vhaliwa maṅwalwa ane a amba nga ha u pindulela na nga ha zwishumiswa zwa u pindulela. Maṅwalwa e a vhalwa o wanala kha zwiko zwo kanḓiswaho na zwiko zwi songo kanḓiswaho. Ino ṱhoḓisiso yo shumisa ngona dza ṱhoḓisiso dzo ṱanganyiswaho, idzo ngona ndi khwanthithethivi na khwaḽithethivi. Idzi ngona dzo shumisana zwavhuḓisa kha ṱhoḓisiso yoṱhe. Data yo kuvhanganywa hu tshi khou shumiswa dzimbudziso na u tou vhudzisa hune afho ho shumiswa mbudziso dzo vuleaho na dzo valeaho. Ngona dza u nanga sambula muṱoḓisisi o shumisa khaṱulo yawe uri ndi nnyi ane a nga vha a na data yo teaho na u humbela vhavhudziswa uri vha bule vhaṅwe vhathu vha re na data yo teaho ino ṱhoḓisiso. viii Tsenguluso ya data ho ṱanganyiswa ngona dza u sengulusa zwo itiswa ngauri ṱhoḓisiso ino yo ṱanganyisa ngona dza u ita ṱhoḓisiso. Sumbanḓila ho shumiswa tsenguluso ya mbambedzo kha u sengulusa data. Data ine ya fana yo vhewa fhethu huthihi musi hu tshi khou senguluswa na u vhiga. Tsenguluso I shumisaho mbalo/tshivhalo (khwanthithethivi) na I shumisaho maipfi kha ino ngudo dzo shumiswa. Ho vhumbiwa dziṱhoho u itela u ṱana data ye ya kuvhanganywa. Ngei kha ndima dza u fhedza, muṱodisisi o rera nga ha mawanwa, o ṱhaṱhuvha ṱhoḓisiso yoṱhe phanḓa ha u ita themendelo na u vhina.African LanguagesM.A. (African Languages

Unisa Institutional Repository