23 research outputs found

    Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App

    Get PDF
    International audienceThis paper reports on our ongoing efforts to collect speech data in under-resourced or endangered languages of Africa. Data collection is carried out using an improved version of the Android application Aikuma developed by Steven Bird and colleagues 1. Features were added to the app in order to facilitate the collection of parallel speech data in line with the requirements of the French-German ANR/DFG BULB (Breaking the Unwritten Language Barrier) project. The resulting app, called Lig-Aikuma, runs on various mobile phones and tablets and proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). Lig-Aikuma's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping. It was used for field data collections in Congo-Brazzaville resulting in a total of over 80 hours of speech. Design issues of the mobile app as well as the use of Lig-Aikuma during two recording campaigns, are further described in this paper

    A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

    Full text link
    Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.Comment: accepted to LREC 201

    Innovative technologies for under-resourced language documentation: The BULB Project

    No full text
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    Innovative technologies for under-resourced language documentation: The BULB Project

    Get PDF
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    LESSONS LEARNED AFTER DEVELOPMENT AND USE OF A DATA COLLECTION APP FOR LANGUAGE DOCUMENTATION (LIG-AIKUMA)

    Get PDF
    International audienceLig-Aikuma is a free Android app running on various mobile phones and tablets. It proposes a range of different speech collection modes (recording, respeaking, translation and elicitation) and offers the possibility to share recordings between users. More than 250 hours of speech in 6 different languages from sub-Saharan Africa (including 3 oral languages in the process of being documented) have already been collected with Lig-Aikuma. This paper presents the lessons learned after 3 years of development and use of Lig-Aikuma. While significant data collections were conducted, this has not been done without difficulties. Some mixed results lead us to stress the importance of design choices, data sharing architecture and user manual. We also discuss other potential uses of the app, discovered during its deployment: data collection for language revitalisation, data collection for speech technology development (ASR) and enrichment of existing corpora through the addition of spoken comments

    Collecte de parole pour l’étude des langues peu dotĂ©es ou en danger avec l’application mobile Lig-Aikuma

    No full text
    International audienceNous rapportons dans cet article les travaux en cours portant sur la collecte de langues africaines peu dotées ou en danger. Une collecte de données a été menée à l'aide d'une version modifiée de l'application Android AIKUMA, initialement développée par Steven Bird et coll. (Bird et al., 2014). Les modifications apportées suivent les spécifications du projet franco-allemand ANR/DFG BULB 1 pour faciliter la collecte sur le terrain de corpus de parole parallÚles. L'application résultante, appelée LIG-AIKUMA, a été testée avec succÚs sur plusieurs smartphones et tablettes et propose plusieurs modes de fonctionnement (enregistrement de parole, respeaking de parole, traduction et élicitation). Entre autres fonctionnalités, LIG-AIKUMA permet la génération et la manipulation avancée de fichiers de métadonnées ainsi que la prise en compte d'informations d'alignement entre phrases prononcées parallÚles dans les modes de respeaking et de traduction. L'application a été utilisée aux cours de campagnes de collecte sur le terrain, au Congo-Brazzaville, permettant l'acquisition de 80 heures de parole. La conception de l'application et l'illustration de son usage dans deux campagnes de collecte sont décrites plus en détail dans cet article

    Amharic Speech Recognition for Speech Translation

    No full text
    International audienceThe state-of-the-art speech translation can be seen as a cascade of Automatic Speech Recognition, Statistical Machine Translation and Text-To-Speech synthesis. In this study an attempt is made to experiment on Amharic speech recognition for Amharic-English speech translation in tourism domain. Since there is no Amharic speech corpus, we developed a read-speech corpus of 7.43hr in tourism domain. The Amharic speech corpus has been recorded after translating standard Basic Traveler Expression Corpus (BTEC) under a normal working environment. In our ASR experiments phoneme and syllable units are used for acoustic models, while morpheme and word are used for language models. Encouraging ASR results are achieved using morpheme-based language models and phoneme-based acoustic models with a recognition accuracy result of 89.1%, 80.9%, 80.6%, and 49.3% at character, morph, word and sentence level respectively. We are now working towards designing Amharic-English speech translation through cascading components under different error correction algorithms
    corecore