18,536 research outputs found

    Innovative technologies for under-resourced language documentation: The BULB Project

    No full text
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    Innovative technologies for under-resourced language documentation: The BULB Project

    Get PDF
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    Compiling a dictionary of an unwritten language: a non-corpus-based approach

    Get PDF
    In this article an account is given of the experience in fieldwork by the Dictionary of the Flemish Dialects (Woordenboek van de Vlaamse Dialecten, WVD), Ghent University, Belgium. The focus is on the practical aspects with regard to methods of lexicographic fieldwork. It is main-tained that the analysis of 'metalinguistical conversations' with groups of respondents in which their lexicographic competence is explored, is a suitable way of collecting lexicographic data. Field-work by correspondence (questionnaires) can amplify and verify the data collected through inter-views. Keywords: lexicography, unwritten language, dialect, regional dic-tionary, fieldwork, general vocabulary, dutch, southern dutch, flem-ish, brabant dialect, limburg dialect, the netherlands, belgium, systema-tic arrangement, methodology, questionnaire, interview, word atlas, language variatio

    Root-Oriented Words Generation: An Easier Way Towards Dictionary Making for the Dusunic Family of Languages

    Get PDF
    Dictionary production is one of the most effective methods of preserving languages and cultures. The Dusunic Family of Languages (DFL) in Sabah, Malaysia would have welcomed the efforts to document their languages through dictionary production as there are still lacking of dictionary, vocabulary and phrase books. Furthermore, more than half of the languages in DFL are unwritten. However, making dictionary conventionally is tedious and time consuming. The Dusunic Family of Languages which are facing extinction threats do not have the luxury of time to wait for dictionary production via the conventional method. Hence, this study explores the use of a method called Root- Oriented Words Generation (ROWG) which is formulated based on spelling orthography of DFL to generate one and two-syllable words list. From the words list, root words registers were compiled which can then be used as database for dictionary production. Findings of this study showed that ROWG was able to generate an exhaustive word lists of DFL and compile a large volume of root words register in DFL. Hence, this study was able to highlight the feasibility and viability of using ROWG to produce root words register of DFL which could possibly reduce the time for dictionary production significantly. In future studies, it is recommended that the ROWG is extended to include more than two syllable words. This study showed the potentiality of ROWG to address the looming demise of DFL by providing a more efficient way of compiling root words for the purpose of making a dictionary
    corecore