914 research outputs found

    Collaboration in the Production of a Massively Multilingual Lexicon

    Get PDF
    This paper discusses the multiple approaches to collaboration that the Kamusi Project is employing in the creation of a massively multilingual lexical resource. The project’s data structure enables the inclusion of large amounts of rich data within each sense-specific entry, with transitive concept-based links across languages. Data collection involves mining existing data sets, language experts using an online editing system, crowdsourcing, and games with a purpose. The paper discusses the benefits and drawbacks of each of these elements, and the steps the project is taking to account for those. Special attention is paid to guiding crowd members with targeted questions that produce results in a specific format. Collaboration is seen as an essential method for generating large amounts of linguistic data, as well as for validating the data so it can be considered trustworthy

    Kamusi and Amazigh: Solutions for Dialects within a Global Linguistic Data Infrastructure

    Get PDF
    AgzulTazerawt-gi , terza amek ara d nesuffeɣ amawal n tmaziɣt ara yesdukkelen akk tantaliwin-is mebla ma tufrar-d ta ɣef tiyiḍ . Aya- nezmer ad t-nefru s tarrayt n uskenawal ara d -yeddemen akk tantaliwin-agi i yesdukkelen tameslyat akken ma llant yerna ad sishelent asuɣel ɣer tutlayin niḍen.AbstractThe major problem concerning the making of a common Berber dictionary is that language comes in many forms, none of which can be considered representative. We suggest that these problems can be overcome through a lexicographic approach that considers language variants as a matter of data organization. Instead of searching for a recognized definitive form for a language or region, it is possible to list all the forms encountered. Finally, there might be a monolingual Amazigh dictionary that reports on all variants of the language in its unit and provides bridges of translation to other languages around the world.Keywords: common Berber dictionary, lexicographical approach, variants, monolingual Amazigh dictionary, translatio

    Sierra Leone: Krio and the Quest for National Integration

    Get PDF
    The Republic of Sierra Leone is a smaller country in size, population and the number of its languages than many other countries on the West African coast such as Ghana, Ivory Coast and Nigeria. A particularly interesting phenomenon is however present in the configuration of the languages present and used in the country, and how language links up the general population. Though there are two proportionately large indigenous languages spoken in the country, Temne and Mende, it is found that the language which has spread and serves as a universal lingua franca known by as much as 95% of the population of Sierra Leone is in fact an English-based creole known as Krio, which is the mother tongue of a much smaller group of speakers primarily localized in and near the capital city Freetown. This chapter examines the growing significance of Krio in Sierra Leone and how it originally developed as a contact language among different groups of resettled emancipated slaves and other indigenous inhabitants of the Freetown area. The implications of the growth of Krio for national language policy and the position of English as the official language are examined, as well as the existence of ambivalent and changing attitudes towards the Krio language

    Multilingual Lexicography with a Focus on Less-Resourced Languages: Data Mining, Expert Input, Crowdsourcing, and Gamification

    Get PDF
    This paper looks at the challenges that the Kamusi Project faces for acquiring open lexical data for less-resourced languages (LRLs), of a range, depth, and quality that can be useful within Human Language Technology (HLT). These challenges include accessing and reforming existing lexicons into interoperable data, recruiting language specialists and citizen linguists, and obtaining large volumes of quality input from the crowd. We introduce our crowdsourcing model, specifically (1) motivating participation using a “play to pay” system, games, social rewards, and material prizes; (2) steering the crowd to contribute structured and reliable data via targeted questions; and (3) evaluating participants’ input through crowd validation and statistical analysis to ensure that only trust-worthy material is incorporated into Kamusi’s master database. We discuss the mobile application Kamusi has developed for crowd participation that elicits high-quality structured data directly from each language’s speakers through narrow questions that can be answered with a minimum of time and effort. Through the integration of existing lexicons, expert input, and innovative methods of acquiring knowledge from the crowd, an accurate and reliable multilingual dictionary with a focus on LRLs will grow and become available as a free public resource

    Final FLaReNet deliverable: Language Resources for the Future - The Future of Language Resources

    Get PDF
    Language Technologies (LT), together with their backbone, Language Resources (LR), provide an essential support to the challenge of Multilingualism and ICT of the future. The main task of language technologies is to bridge language barriers and to help creating a new environment where information flows smoothly across frontiers and languages, no matter the country, and the language, of origin. To achieve this goal, all players involved need to act as a community able to join forces on a set of shared priorities. However, until now the field of Language Resources and Technology has long suffered from an excess of individuality and fragmentation, with a lack of coherence concerning the priorities for the field, the direction to move, not to mention a common timeframe. The context encountered by the FLaReNet project was thus represented by an active field needing a coherence that can only be given by sharing common priorities and endeavours. FLaReNet has contributed to the creation of this coherence by gathering a wide community of experts and making them participate in the definition of an exhaustive set of recommendations

    Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database

    Get PDF
    This work presents three novel speech recognition architectures evaluated on the Spanish RTVE2020 dataset, employed as the main evaluation set in the Albayzín S2T Transcription Challenge 2020. The main objective was to improve the performance of the systems previously submitted by the authors to the challenge, in which the primary system scored the second position. The novel systems are based on both DNN-HMM and E2E acoustic models, for which fully-and self-supervised learning methods were included. As a result, the new speech recognition engines clearly outper-formed the performance of the initial systems from the previous best WER of 19.27 to the new best of 17.60 achieved by the DNN-HMM based system. This work therefore describes an interesting benchmark of the latest acoustic models over a highly challenging dataset, and identifies the most optimal ones depending on the expected quality, the available resources and the required latency

    The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe

    Get PDF
    Proceedings of the 1st FLaReNet Forum on the European Language Resources and Technologies, held in Vienna, at the Austrian Academy of Science, on 12-13 February 2009

    Problems and Procedures to Make Wordnet Data (Retro)Fit for a Multilingual Dictionary

    Get PDF
    The data compiled through many Wordnet projects can be a rich source of seed information for a multilingual dictionary. However, the original Princeton WordNet was not intended as a dictionary per se, and spawning other languages from it introduces inherent ambiguity that confounds precise inter-lingual linking. This paper discusses a new presentation of existing Wordnet data that displays joints (distance between predicted links) and substitution (degree of equivalence between confirmed pairs) as a two-tiered horizontal ontology. Improvements to make Wordnet data function as lexicography include term-specific English definitions where the topical synset glosses are inadequate, validation of mappings between each member of an English synset and each member of the synsets from other languages, removal of erroneous translation terms, creation of own-language definitions for the many languages where those are absent, and validation of predicted links between non- English pairs. The paper describes the current state and future directions of a system to crowdsource human review and expansion of Wordnet data, using gamification to build consensus validated, dictionary caliber data for languages now in the Global WordNet as well as new languages that do not have formal Wordnet projects of their own
    corecore