10,115 research outputs found

    Machine-assisted translation by Human-in-the-loop Crowdsourcing for Bambara

    Get PDF
    Language is more than a tool of conveying information; it is utilized in all aspects of our lives. Yet only a small number of languages in the 7,000 languages worldwide are highly resourced by human language technologies (HLT). Despite African languages representing over 2,000 languages, only a few African languages are highly resourced, for which there exists a considerable amount of parallel digital data. We present a novel approach to machine translation (MT) for under-resourced languages by improving the quality of the model using a paradigm called ``humans in the Loop.\u27\u27 This thesis describes the work carried out to create a Bambara-French MT system including data discovery, data preparation, model hyper-parameter tuning, the development of a crowdsourcing platform for humans in the loop, vocabulary sizing, and segmentation. We present a novel approach to machine translation (MT) for under-resourced languages by improving the quality of the model using a paradigm called ``humans in the Loop.\u27\u27 We achieved a BLEU (bilingual evaluation understudy) score of 17.5. The results confirm that MT for Bambara, despite our small data set, is viable. This work has the potential to contribute to the reduction of language barriers between the people of Sub-Saharan Africa and the rest of the world

    A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

    Full text link
    Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.Comment: accepted to LREC 201

    Computerization of African languages-French dictionaries

    Get PDF
    This paper relates work done during the DiLAF project. It consists in converting 5 bilingual African language-French dictionaries originally in Word format into XML following the LMF model. The languages processed are Bambara, Hausa, Kanuri, Tamajaq and Songhai-zarma, still considered as under-resourced languages concerning Natural Language Processing tools. Once converted, the dictionaries are available online on the Jibiki platform for lookup and modification. The DiLAF project is first presented. A description of each dictionary follows. Then, the conversion methodology from .doc format to XML files is presented. A specific point on the usage of Unicode follows. Then, each step of the conversion into XML and LMF is detailed. The last part presents the Jibiki lexical resources management platform used for the project.Comment: 8 page

    Transnational reflections on transnational research projects on men, boys and gender relations

    Get PDF
    This article reflects on the research project, ‘Engaging South African and Finnish youth towards new traditions of non-violence, equality and social well-being’, funded by the Finnish and South African national research councils, in the context of wider debates on research, projects and transnational processes. The project is located within a broader analysis of research projects and projectization (the reduction of research to separate projects), and the increasing tendencies for research to be framed within and as projects, with their own specific temporal and organizational characteristics. This approach is developed further in terms of different understandings of research across borders: international, comparative, multinational and transnational. Special attention is given to differences between research projects that are in the Europe and the EU, and projects that are between the global North and the global South. The theoretical, political and practical challenges of the North-South research project are discussed

    Innovative technologies for under-resourced language documentation: The BULB Project

    No full text
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    Innovative technologies for under-resourced language documentation: The BULB Project

    Get PDF
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping

    Disrupting Digital Monolingualism: A report on multilingualism in digital theory and practice

    Get PDF
    This report is about the Disrupting Digital Monolingualism virtual workshop in June 2020. The DDM workshop sought to draw together a wide range of stakeholders active in confronting the current language bias in most of the digital platforms, tools, algorithms, methods, and datasets which we use in our study or practice, and to reverse the powerful impact this bias has on geocultural knowledge dynamics in the wider world. The workshop aimed to describe the state of the art across different academic disciplines and professional fields, and foster collaboration across diverse perspectives around four points of focus: Linguistic and geocultural diversity in digital knowledge infrastructures; Working with multilingual methods and data; Transcultural and translingual approaches to digital study; and Artificial intelligence, machine learning and NLP in language worlds. Event website https://languageacts.org/digital-mediations/event/disrupting-digital-monolingualism/ This report forms part of a series of reports produced by the Digital Mediations strand of the Language Acts & Worldmaking project, in this case in collaboration with the translingual strand of the Cross-Language Dynamics project (based at the Institute of Modern Languages Research), both funded by the UK Arts and Humanities Research Council’s Open World Research Initiative. Digital Mediations explores interactions and tensions between digital culture, multilingualism and language fields including the Modern Languages
    • 

    corecore