15,023 research outputs found

    A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

    Full text link
    Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.Comment: accepted to LREC 201

    Innovative technologies for under-resourced language documentation: The BULB Project

    Get PDF
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    Innovative technologies for under-resourced language documentation: The BULB Project

    No full text
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

    Get PDF
    We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201

    PRESERVING VERNACULARS IN INDONESIA: A BILINGUAL VERNACULAR-ENGLISH DICTIONARY APPROACH

    Get PDF
    English learners in Indonesia learn the English language through the Indonesian language, the language oinstruction in the country's education, despite the fact that 80% of the country's population speakvernaculars as mother tongue. The provision of materials for learning, including bilingual dictionaries,therefore follow this convention while bilingual dictionaries accommodating the learners speakingvernaculars natively are barely provided. This condition insists that every Indonesian must comprehend theIndonesian language first to learn English albeit theories on foreign language learning suggest theotherwise. Apart from this, the use of vernaculars of Indonesia itself tends to decline yet the bilinguadictionaries linking the vernaculars with a widely-known language such as English still lack. This articleelaborates the issues of (1) English vocabulary learning and (2) the maintenance of the vernaculars oIndonesia with discussions about Butzkamm's theory and UNESCO's suggestion on foreign languagelearning, Nation's New General Service List as the core of the English vocabulary, and the application otechnology in the lexicography of bilingual dictionary. Choosing Cirebon dialect of Javanese as an example,this article suggests that the provision of a bilingual dictionary functioning as a reference material foEnglish vocabulary learning yet as a documentation of vernacular maintenance is possible

    Southeast Asia in the Ancient Indian Ocean World; Combining Historical Linguistic and Archaeological Approaches

    Full text link
    This PhD dissertation examines the role of insular Southeast Asia in the trans-regional networks of maritime trade that shaped the history of Indian Ocean. The work brings together data and approaches from archaeology, historical linguistics and other disciplines, proposing a reconstruction of cultural and linguistic contact between Southeast Asia and its maritime neighbours to the west in order to advance our historical understanding of this part of the world. Numerous biological, commercial and technical items are examined. The study underlines that the analysis of lexical data is one of the strongest tools to detect and analyse contact between two or more speech communities. It demonstrates how Southeast Asian products and concepts were mainly dispersed by speakers of Malay varieties, although other communities played a role as well. Through an interdisciplinary approach, the study offers new perspectives on the role of insular Southeast Asian agents on cultural dynamism and interethnic contact in the pre-modern Indian Ocean World

    Bayesian Models for Unit Discovery on a Very Low Resource Language

    Get PDF
    Developing speech technologies for low-resource languages has become a very active research field over the last decade. Among others, Bayesian models have shown some promising results on artificial examples but still lack of in situ experiments. Our work applies state-of-the-art Bayesian models to unsupervised Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also show that Bayesian models can naturally integrate information from other resourceful languages by means of informative prior leading to more consistent discovered units. Finally, discovered acoustic units are used, either as the 1-best sequence or as a lattice, to perform word segmentation. Word segmentation results show that this Bayesian approach clearly outperforms a Segmental-DTW baseline on the same corpus.Comment: Accepted to ICASSP 201
    • 

    corecore