355 research outputs found

    Preliminary Experiments on Unsupervised Word Discovery in Mboshi

    No full text
    International audienceThe necessity to document thousands of endangered languages encourages the collaboration between linguists and computer scientists in order to provide the documentary linguistics community with the support of automatic processing tools. The French-German ANR-DFG project Breaking the Unwritten Language Barrier (BULB) aims at developing such tools for three mostly unwritten African languages of the Bantu family. For one of them, Mboshi, a language originating from the " Cu-vette " region of the Republic of Congo, we investigate unsuper-vised word discovery techniques from an unsegmented stream of phonemes. We compare different models and algorithms, both monolingual and bilingual, on a new corpus in Mboshi and French, and discuss various ways to represent the data with suitable granularity. An additional French-English corpus allows us to contrast the results obtained on Mboshi and to experiment with more data

    Innovative technologies for under-resourced language documentation: The BULB Project

    No full text
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    Innovative technologies for under-resourced language documentation: The BULB Project

    Get PDF
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    English Index

    Get PDF
    No abstract

    Penalizing unknown words’ emissions in hmm pos tagger based on Malay affix morphemes

    Get PDF
    The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger isthat the training depends on an untagged corpus; the only supervised data limiting  possible tagging of words is a dictionary. Therefore, training cannot properly map  possible tags. The exact morphemes of prefixes, suffixes and circumfixes in the   agglutinative Malay language is examined to assign unknown words’ probable tags based on linguistically meaningful affixes using a morpheme-based POS guessing algorithm for tagging. The algorithm has been integrated into Viterbi algorithm which uses HMM trained parameters for tagging new sentences. In the experiment, this tagger is first, uses character-based prediction to handle unknown words; next, uses morpheme-based POS guessing algorithm; lastly, combination of the first and second.Keywords: Malay POS tagger; morpheme-based; HMM

    Investigating the effectiveness of available tools for translating into tshiVenda

    Get PDF
    Text in EnglishAbstracts in English and VendaThis study has investigated the effectiveness of available tools used for translating from English into Tshivenḓa and vice versa with the aim to investigate and determine the effectiveness of these tools. This study dealt with the problem of lack of effective translation tools used to translate between English and Tshivenḓa. Tshivenḓa is one of South Africa’s minority languages. Its (Tshivenḓa) lack of effective translation tools negatively affects language practitioners’ work. This situation is perilous for translation quality assurance. Translation tools, both computer technology and non-computer technology tools abound for developed languages such as English, French and others. Based on the results of this research project, the researcher did make recommendations that could remedy the situation. South Africa is a democratic country that has a number of language-related policies. This then creates a conducive context for stakeholders with language passion to fully develop Tshivenḓa language in all dimensions. The fact is that all languages have evolved and they were all underdeveloped. This vividly shows that Tshivenḓa language development is also possible just like Afrikaans, which never existed on earth before 1652. It (Afrikaans) has evolved and overtaken all indigenous South African languages. This study did review the literature regarding translation and translation tools. The literature was obtained from both published and unpublished sources. The study has used mixed methods research, i.e. quantitative and qualitative research methods. These methods successfully complemented each other throughout the entire research. Data were gathered through questionnaires and interviews wherein both open and closed-ended questions were employed. Both purposive/judgemental and snowball (chain) sampling have been applied in this study. Data analysis was addressed through a combination of methods owing to the nature of mixed methods research. Guided by analytic comparison approach when grouping together related data during data analysis and presentation, both statistical and textual analyses have been vital in this study. Themes were constructed to lucidly present the gathered data. At the last chapters, the researcher discussed the findings and evaluated the entire research before making recommendations and conclusion.Iyi ṱhoḓisiso yo ita tsedzuluso nga ha kushumele kwa zwishumiswa zwi re hone zwine zwa shumiswa u pindulela u bva kha luambo lwa English u ya kha Tshivenḓa na u bva kha Tshivenḓa u ya kha English ndivho I ya u sedzulusa na u lavhelesa kushumele kwa izwi zwishumiswa uri zwi a thusa naa. Ino ṱhoḓisiso yo shumana na thaidzo ya ṱhahelelo ya zwishumiswa zwa u pindulela zwine zwa shumiswa musi hu tshi pindulelwa vhukati ha English na Tshivenḓa. Tshivenḓa ndi luṅwe lwa nyambo dza Afrika Tshipembe dzine dza ambiwa nga vhathu vha si vhanzhi. U shaea ha zwishumiswa zwa u pindulela zwine zwa shuma nga nḓila I thusaho zwi kwama mushumo wa vhashumi vha zwa nyambo nga nḓila I si yavhuḓi. Iyi nyimele I na mulingo u kwamaho khwaḽithi ya zwo pindulelwaho. Zwishumiswa zwa u pindulela, zwa thekhnoḽodzhi ya khomphiyutha na zwi sa shumisi thekhnoḽodzhi ya khomphiyutha zwo ḓalesa kha nyambo dzo bvelelaho u tou fana na kha English, French na dziṅwe. Zwo sendeka kha mvelelo dza ino thandela ya ṱhoḓisiso, muṱoḓisisi o ita themendelo dzine dza nga fhelisa thaidzo ya nyimele. Afrika Tshipembe ndi shango ḽa demokirasi ḽine ḽa vha na mbekanyamaitele dzo vhalaho nga ha dzinyambo. Izwi zwi ita uri hu vhe na nyimele ine vhafaramikovhe vhane vha funesa nyambo vha kone u bveledza Tshivenḓa kha masia oṱhe. Zwavhukuma ndi zwa uri nyambo dzoṱhe dzi na mathomo nahone dzoṱhe dzo vha dzi songo bvelela. Izwi zwi ita uri zwi vhe khagala uri luambo lwa Tshivenḓa na lwone lu nga bveledzwa u tou fana na luambo lwa Afrikaans lwe lwa vha lu si ho ḽifhasini phanḓa ha ṅwaha wa 1652. Ulu luambo (Afrikaans) lwo vha hone shangoni lwa mbo bveledzwa lwa fhira nyambo dzoṱhe dza fhano hayani Afrika Tshipembe. Kha ino ṱhoḓisiso ho vhaliwa maṅwalwa ane a amba nga ha u pindulela na nga ha zwishumiswa zwa u pindulela. Maṅwalwa e a vhalwa o wanala kha zwiko zwo kanḓiswaho na zwiko zwi songo kanḓiswaho. Ino ṱhoḓisiso yo shumisa ngona dza ṱhoḓisiso dzo ṱanganyiswaho, idzo ngona ndi khwanthithethivi na khwaḽithethivi. Idzi ngona dzo shumisana zwavhuḓisa kha ṱhoḓisiso yoṱhe. Data yo kuvhanganywa hu tshi khou shumiswa dzimbudziso na u tou vhudzisa hune afho ho shumiswa mbudziso dzo vuleaho na dzo valeaho. Ngona dza u nanga sambula muṱoḓisisi o shumisa khaṱulo yawe uri ndi nnyi ane a nga vha a na data yo teaho na u humbela vhavhudziswa uri vha bule vhaṅwe vhathu vha re na data yo teaho ino ṱhoḓisiso. viii Tsenguluso ya data ho ṱanganyiswa ngona dza u sengulusa zwo itiswa ngauri ṱhoḓisiso ino yo ṱanganyisa ngona dza u ita ṱhoḓisiso. Sumbanḓila ho shumiswa tsenguluso ya mbambedzo kha u sengulusa data. Data ine ya fana yo vhewa fhethu huthihi musi hu tshi khou senguluswa na u vhiga. Tsenguluso I shumisaho mbalo/tshivhalo (khwanthithethivi) na I shumisaho maipfi kha ino ngudo dzo shumiswa. Ho vhumbiwa dziṱhoho u itela u ṱana data ye ya kuvhanganywa. Ngei kha ndima dza u fhedza, muṱodisisi o rera nga ha mawanwa, o ṱhaṱhuvha ṱhoḓisiso yoṱhe phanḓa ha u ita themendelo na u vhina.African LanguagesM.A. (African Languages
    corecore