73 research outputs found

    Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages

    Get PDF
    One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data

    OcWikiDisc : a Corpus of Wikipedia Talk Pages in Occitan

    Get PDF
    This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext's language identification model from Meta AI's No Language Left Behind initiative, released in July 2022.Peer reviewe

    Natural language processing for similar languages, varieties, and dialects: A survey

    Get PDF
    There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe

    Analyse morphosyntaxique de l'occitan languedocien : l'amitié entre un petit languedocien et un gros catalan

    Get PDF
    International audienceDans cette étude, nous nous intéressons à la question de l'analyse morphosyntaxique de l'occitan. Nous utilisons Talismane, un logiciel par apprentissage supervisé, nécessitant des données annotées pour l'entraînement et optionnellement un lexique. Nous montrons dans cet article, qu'en l'absence de données annotées suffisantes pour l'occitan, il est possible d'obtenir de bons résultats (92%) en utilisant les données d'une langue étymologiquement proche, le catalan. Nous avons utilisé le corpus Ancora (500 000 formes) et un lexique occitan languedocien (250 000 entrées). Utiliser un corpus catalan de taille importante permet une amélioration de +3% par rapport au résultat obtenu avec le seul corpus d'entraînement occitan disponible à ce jour de 2800 formes. Abstract. Pos-tagging the Lengadocian dialect of Occitan: a little Lengadocian befriends a big Catalan. In this study, we examine the question of Occitan POS-tagging. We use Talismane, a supervised machine learning NLP tool, requiring annotated data for training and optionally a lexicon. We show that, with insufficient data for Occitan, it is possible to obtain good results (92%) by using data from an etymologically close language, in this case Catalan. We used the Catalan Ancora corpus (500,000 tokens) and an Occitan Languedocien lexicon (250,000 entries). Using the larger Catalan corpus improved results by +3% with respect to the result obtained using the only Occitan training corpus available to date (2,800 tokens). Mots-clés : traitement automatique des langues peu dotées, occitan, analyse morphosyntaxiqu

    NLP for Language Varieties of Italy: Challenges and the Path Forward

    Full text link
    Italy is characterized by a one-of-a-kind linguistic diversity landscape in Europe, which implicitly encodes local knowledge, cultural traditions, artistic expression, and history of its speakers. However, over 30 language varieties in Italy are at risk of disappearing within few generations. Language technology has a main role in preserving endangered languages, but it currently struggles with such varieties as they are under-resourced and mostly lack standardized orthography, being mainly used in spoken settings. In this paper, we introduce the linguistic context of Italy and discuss challenges facing the development of NLP technologies for Italy's language varieties. We provide potential directions and advocate for a shift in the paradigm from machine-centric to speaker-centric NLP. Finally, we propose building a local community towards responsible, participatory development of speech and language technologies for languages and dialects of Italy.Comment: 16 pages, 3 figures, 4 table

    Tools for linguistic variation

    Get PDF
    Índice / Index / Sommaire:- Introducción a los problemas y métodos según los principios de la Escuela Dialectométrica de Salzburgo (con ejemplos sacados del “Atlante Italo-Svizzero”, AIS) (Hans Goebl).- Some further dialectometrical stops (John Nerbonne, Jelena Prokic, Martijn Wieling and Charlotte Gooskens).- Tools for dialect syntax: the case of CORDIAL-SIN (an annotated corpus of Portuguese dialects) (Ernestina Carrilho).- Le projet Vivaldi: présentation d’un atlas linguistique parlant virtual (Roland Bauer).- Le Thesaurus Occitan: une base de données multimedia consacrée aux dialectes occitans (Guylaine Brun-Trigaud).- The Thesaurus Occitan: a multimedia database dedicated to Occitan dialects; presentation of its morphosyntax module (Pierre-Aurélien Georges).- New methods for the study of grammatical variation and the Audible Corpus of Spoken Rural Spanish (Inés Fernández Ordóñez).- The application of speech synthesis and speech recognition techniques in dialectal studies (María Pilar Perea).- Relevancia del análisis lingüístico en el tratamiento cuantitativo de la variación dialectal (Esteve Clua).- El procesamiento informático de los materiales del Atlas Lingüístico de la Península Ibérica de Tomás Navarro Tomás (Pilar García Mouton).- Un retrato del artículo vasco en el año 1895 mediante el programa VDM (Ekaitz Santazilia).- Technology for prosodic variation (Gotzon Aurrekoetxea and Aitor Iglesias)

    Le projet RESTAURE

    Get PDF
    National audienceLe projet Ressources Informatisées et traitement automatique pour les langues régionales (RESTAURE) est un projet financé par l’ANR, entamé au mois de janvier 2015 pour une durée de 42 mois. Il comporte trois objectifs principaux :• acquisition et normalisation de ressources (corpus et lexiques) ;• développement d’outils pour l’acquisition et l’analyse de corpus ;• diffusion des résultats auprès du grand public.Les langues régionales de France concernées par le projet sont au nombre de trois : le picard, l’alsacien et l’occitan. Chacune de ces langues est représentée par un laboratoire partenaire : LESCLAP à Amiens pour le picard, LiLPa à Strasbourg pour l’alsacien, et CLLE-ERSS à Toulouse pour l’occitan. À cela s’ajoute un laboratoire en région parisienne, le LIMSI-CNRS, qui travaille sur les aspects de traitement automatique des langues.La motivation principale du projet est le manque de ressources informatisées pour les langues régionales de France, en particulier pour les trois langues concernées par le projet

    The future of dialects: Selected papers from Methods in Dialectology XV

    Get PDF
    Traditional dialects have been encroached upon by the increasing mobility of their speakers and by the onslaught of national languages in education and mass media. Typically, older dialects are “leveling” to become more like national languages. This is regrettable when the last articulate traces of a culture are lost, but it also promotes a complex dynamics of interaction as speakers shift from dialect to standard and to intermediate compromises between the two in their forms of speech. Varieties of speech thus live on in modern communities, where they still function to mark provenance, but increasingly cultural and social provenance as opposed to pure geography. They arise at times from the need to function throughout the different groups in society, but they also may have roots in immigrants’ speech, and just as certainly from the ineluctable dynamics of groups wishing to express their identity to themselves and to the world. The future of dialects is a selection of the papers presented at Methods in Dialectology XV, held in Groningen, the Netherlands, 11-15 August 2014. While the focus is on methodology, the volume also includes specialized studies on varieties of Catalan, Breton, Croatian, (Belgian) Dutch, English (in the US, the UK and in Japan), German (including Swiss German), Italian (including Tyrolean Italian), Japanese, and Spanish as well as on heritage languages in Canada

    The future of dialects: Selected papers from Methods in Dialectology XV

    Get PDF
    Traditional dialects have been encroached upon by the increasing mobility of their speakers and by the onslaught of national languages in education and mass media. Typically, older dialects are “leveling” to become more like national languages. This is regrettable when the last articulate traces of a culture are lost, but it also promotes a complex dynamics of interaction as speakers shift from dialect to standard and to intermediate compromises between the two in their forms of speech. Varieties of speech thus live on in modern communities, where they still function to mark provenance, but increasingly cultural and social provenance as opposed to pure geography. They arise at times from the need to function throughout the different groups in society, but they also may have roots in immigrants’ speech, and just as certainly from the ineluctable dynamics of groups wishing to express their identity to themselves and to the world. The future of dialects is a selection of the papers presented at Methods in Dialectology XV, held in Groningen, the Netherlands, 11-15 August 2014. While the focus is on methodology, the volume also includes specialized studies on varieties of Catalan, Breton, Croatian, (Belgian) Dutch, English (in the US, the UK and in Japan), German (including Swiss German), Italian (including Tyrolean Italian), Japanese, and Spanish as well as on heritage languages in Canada
    • …
    corecore