6 research outputs found

    Digital Resilience: Harnessing the Power of the Collective for Language Preservation. A Success Story for Catalan

    Get PDF
    El creixement generalitzat de la intel·ligència artificial (IA) ha fet que les tecnologies de la llengua siguin més accessibles que mai i ha portat aquesta tecnologia a la nostra vida diària. Tanmateix, el desenvolupament accelerat de la tecnologia lingüística comporta intrínsecament un biaix cap a una perspectiva anglocèntrica i eurocèntrica, que té com a resultat una representació i un reconeixement limitats de les llengües minoritzades. És ben sabut que la presència d’una llengua en línia en garanteix la supervivència. En aquest article explorem la història d’èxit del català, una llengua minoritzada amb una comunitat activa en línia, que ha establert les bases per al desenvolupament de les tecnologies de la llengua. La història d’èxit de la comunitat catalanoparlant demostra que les comunitats poden tenir un paper important en la supervivència i en l’evolució de les llengües minoritzades en l’era digital.The widespread growth of Artificial Intelligence (AI) has made language technology more accessible than ever before, bringing language technology into our daily lives. Nevertheless, the rapid development of language technology intrinsically carries a bias towards Anglo-centric and Euro-centric perspectives, leading to limited representation and recognition for minoritized languages. It is a fact that the online presence of a language ensures its survival. This article explores the success story of Catalan, a minoritized language with an active online community, which has laid the foundations for language technology development. The Catalan-speaking community’s success story demonstrates how communities can play a significant role in the survival and evolution of minoritized languages in the digital age

    to post-edit or to translate ... That is the question: a case study of a recommender system for Quality Estimation of Machine Translation based on linguistic features

    Get PDF
    [EN]The implementation of a machine translation system into production is not enough to warrant its efficient use. There exists the need to know when it is profitable to use machine translation as opposed to translating from scratch. That is why being able to estimate the quality of a machine translation is crucial. This thesis investigates the task of quality estimation of machine translation for a specific machine translation system and a specific domain by developing a recommender system for Spanish to English. The work further investigates how quality estimation can benefit from the use of linguistic characteristics in contrast to the more common shallower features. The data was collected from real translators who performed a post-editing task, and the linguistic features were manually annotated. First, we build a classification model that selects sentences for post-editing or translating. Secondly, we perform a regression task based on three quality indicators: Quality, Time and HTER. Although experimentation shows some promising results, overall the selected features are not discriminative enough for the recommender system to be implemented into production. Results are discussed at different levels, suggesting a replication at a larger scale, with automatic annotation of informative linguistic features.[EU]Itzulpen automatikoko sistema bat produkzio-katean sartzeak ez du bere horretan erabilera eraginkor bat bermatzen. Beharrezkoa da jakitea noiz den probetxugarria itzulpen automatikoa editatzea eta noiz eskuz itzultzea. Horretarako ezinbestekoa da itzulpen automatikoaren kalitatea aurreikusteko gai izatea. Lan honek ikertzen du itzulpen automatikoaren kalitatearen estimazioa sistema zehatz batentzat eta domeinu zehatz baterako, gomendio sistema bat garatuz gaztelaniatik ingelesera itzultzerakoan erabiltzeko. Lanean aztertzen da nola lagundu dezaketen ezaugarri linguistikoek kalitatearen estimazioan, ohikoak diren azaleko ezaugarriekin alderatuta. Datuak itzultzaile profesionalen postedizio lanetik bildu dira eta ezaugarri linguistikoak eskuz etiketatu. Lehenengo, esaldi bat posteditatzea edo itzultzea gomendatzen duten sailkapen ereduak eraiki dira. Bigarrenik, erregresio ereduak entrenatu dira hiru kalitate adierazle aurreikusteko: kalitatea, denbora eta HTER. Esperimentuek emaitza adierazgarriak erakusten dituzten arren, orokorrean erabilitako ezaugarriek ez dute behar bezala bereizten edizio mota komenigarriena zein den, eta beraz, gomendio sistemaren doitasuna ez da produkzioan ezartzeko nahikoa. Emaitzak maila desberdinetan aztertu dira eta esperimentazioa datu-multzo zabalago batekin egitea proposatzen da, anotazio automatikoa erabilita eta informatiboagoak diren ezaugarri linguistikoak erabilita

    Transfer Learning with Shallow Decoders: BSC at WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task

    Get PDF
    This paper describes the participation of the BSC team in the WMT2021{'}s Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The system aims to solve the Subtask 2: Wikipedia cultural heritage articles, which involves translation in four Romance languages: Catalan, Italian, Occitan and Romanian. The submitted system is a multilingual semi-supervised machine translation model. It is based on a pre-trained language model, namely XLM-RoBERTa, that is later fine-tuned with parallel data obtained mostly from OPUS. Unlike other works, we only use XLM to initialize the encoder and randomly initialize a shallow decoder. The reported results are robust and perform well for all tested languages.Postprint (author's final draft

    to post-edit or to translate ... That is the question: a case study of a recommender system for Quality Estimation of Machine Translation based on linguistic features

    No full text
    [EN]The implementation of a machine translation system into production is not enough to warrant its efficient use. There exists the need to know when it is profitable to use machine translation as opposed to translating from scratch. That is why being able to estimate the quality of a machine translation is crucial. This thesis investigates the task of quality estimation of machine translation for a specific machine translation system and a specific domain by developing a recommender system for Spanish to English. The work further investigates how quality estimation can benefit from the use of linguistic characteristics in contrast to the more common shallower features. The data was collected from real translators who performed a post-editing task, and the linguistic features were manually annotated. First, we build a classification model that selects sentences for post-editing or translating. Secondly, we perform a regression task based on three quality indicators: Quality, Time and HTER. Although experimentation shows some promising results, overall the selected features are not discriminative enough for the recommender system to be implemented into production. Results are discussed at different levels, suggesting a replication at a larger scale, with automatic annotation of informative linguistic features.[EU]Itzulpen automatikoko sistema bat produkzio-katean sartzeak ez du bere horretan erabilera eraginkor bat bermatzen. Beharrezkoa da jakitea noiz den probetxugarria itzulpen automatikoa editatzea eta noiz eskuz itzultzea. Horretarako ezinbestekoa da itzulpen automatikoaren kalitatea aurreikusteko gai izatea. Lan honek ikertzen du itzulpen automatikoaren kalitatearen estimazioa sistema zehatz batentzat eta domeinu zehatz baterako, gomendio sistema bat garatuz gaztelaniatik ingelesera itzultzerakoan erabiltzeko. Lanean aztertzen da nola lagundu dezaketen ezaugarri linguistikoek kalitatearen estimazioan, ohikoak diren azaleko ezaugarriekin alderatuta. Datuak itzultzaile profesionalen postedizio lanetik bildu dira eta ezaugarri linguistikoak eskuz etiketatu. Lehenengo, esaldi bat posteditatzea edo itzultzea gomendatzen duten sailkapen ereduak eraiki dira. Bigarrenik, erregresio ereduak entrenatu dira hiru kalitate adierazle aurreikusteko: kalitatea, denbora eta HTER. Esperimentuek emaitza adierazgarriak erakusten dituzten arren, orokorrean erabilitako ezaugarriek ez dute behar bezala bereizten edizio mota komenigarriena zein den, eta beraz, gomendio sistemaren doitasuna ez da produkzioan ezartzeko nahikoa. Emaitzak maila desberdinetan aztertu dira eta esperimentazioa datu-multzo zabalago batekin egitea proposatzen da, anotazio automatikoa erabilita eta informatiboagoak diren ezaugarri linguistikoak erabilita

    Unsupervised Feature Selection for Effective Parallel Corpus Filtering

    No full text
    This work presents an unsupervised method of selecting filters and threshold values for the OpusFilter parallel corpus cleaning toolbox. The method clusters sentence pairs into noisy and clean categories and uses the features of the noisy cluster center as filtering parameters. Our approach utilizes feature importance analysis to disregard filters that do not differentiate between clean and noisy data. A randomly sampled subset of a given corpus is used for filter selection and ineffective filters are not run for the full corpus. We use a set of automatic evaluation metrics to assess the quality of translation models trained with data filtered by our method and data filtered with OpusFilter’s default parameters. The trained models cover English-German and English-Ukrainian in both directions. The proposed method outperforms the default parameters in all translation directions for almost all evaluation metrics.This work presents an unsupervised method of selecting filters and threshold values for the OpusFilter parallel corpus cleaning toolbox. The method clusters sentence pairs into noisy and clean categories and uses the features of the noisy cluster center as filtering parameters. Our approach utilizes feature importance analysis to disregard filters that do not differentiate between clean and noisy data. A randomly sampled subset of a given corpus is used for filter selection and ineffective filters are not run for the full corpus. We use a set of automatic evaluation metrics to assess the quality of translation models trained with data filtered by our method and data filtered with OpusFilter’s default parameters. The trained models cover English-German and English-Ukrainian in both directions. The proposed method outperforms the default parameters in all translation directions for almost all evaluation metrics.Peer reviewe

    Four Approaches to Low-Resource Multilingual NMT : The Helsinki Submission to the AmericasNLP 2023 Shared Task

    No full text
    The Helsinki-NLP team participated in the AmericasNLP 2023 Shared Task with 6 submissions for all 11 language pairs arising from 4 different multilingual systems. We provide a detailed look at the work that went into collecting and preprocessing the data that led to our submissions. We explore various setups for multilingual Neural Machine Translation (NMT), namely knowledge distillation and transfer learning, multilingual NMT including a high-resource language (English), language-specific fine-tuning, and multilingual NMT exclusively using low-resource data. Our multilingual Model B ranks first in 4 out of the 11 language pairs.Peer reviewe
    corecore