1,684 research outputs found

    Chinese-Catalan: A neural machine translation approach based on pivoting and attention mechanisms

    Get PDF
    This article innovatively addresses machine translation from Chinese to Catalan using neural pivot strategies trained without any direct parallel data. The Catalan language is very similar to Spanish from a linguistic point of view, which motivates the use of Spanish as pivot language. Regarding neural architecture, we are using the latest state-of-the-art, which is the Transformer model, only based on attention mechanisms. Additionally, this work provides new resources to the community, which consists of a human-developed gold standard of 4,000 sentences between Catalan and Chinese and all the others United Nations official languages (Arabic, English, French, Russian, and Spanish). Results show that the standard pseudo-corpus or synthetic pivot approach performs better than cascade.Peer ReviewedPostprint (author's final draft

    Why Catalan-Spanish Neural Machine Translation? Analysis, comparison and combination with standard Rule and Phrase-based technologies

    Get PDF
    Catalan and Spanish are two related languages given that both derive from Latin. They share similarities in several linguistic levels including morphology, syntax and semantics. This makes them particularly interesting for the MT task. Given the recent appearance and popularity of neural MT, this paper analyzes the performance of this new approach compared to the well-established rule-based and phrase-based MT systems. Experiments are reported on a large database of 180 million words. Results, in terms of standard automatic measures, show that neural MT clearly outperforms the rule-based and phrase-based MT system on in-domain test set, but it is worst in the out-of-domain test set. A naive system combination specially works for the latter. In-domain manual analysis shows that neural MT tends to improve both adequacy and fluency, for example, by being able to generate more natural translations instead of literal ones, choosing to the adequate target word when the source word has several translations and improving gender agreement. However, out-of-domain manual analysis shows how neural MT is more affected by unknown words or contexts.Postprint (published version

    Comparing rule-based and data-driven approaches to Spanish-to-Basque machine translation

    Get PDF
    In this paper, we compare the rule-based and data-driven approaches in the context of Spanish-to-Basque Machine Translation. The rule-based system we consider has been developed specifically for Spanish-to-Basque machine translation, and is tuned to this language pair. On the contrary, the data-driven system we use is generic, and has not been specifically designed to deal with Basque. Spanish-to-Basque Machine Translation is a challenge for data-driven approaches for at least two reasons. First, there is lack of bilingual data on which a data-driven MT system can be trained. Second, Basque is a morphologically-rich agglutinative language and translating to Basque requires a huge generation of morphological information, a difficult task for a generic system not specifically tuned to Basque. We present the results of a series of experiments, obtained on two different corpora, one being “in-domain” and the other one “out-of-domain” with respect to the data-driven system. We show that n-gram based automatic evaluation and edit-distance-based human evaluation yield two different sets of results. According to BLEU, the data-driven system outperforms the rule-based system on the in-domain data, while according to the human evaluation, the rule-based approach achieves higher scores for both corpora

    English-Catalan Neural Machine Translation in the Biomedical Domain through the cascade approach

    Full text link
    This paper describes the methodology followed to build a neural machine translation system in the biomedical domain for the English-Catalan language pair. This task can be considered a low-resourced task from the point of view of the domain and the language pair. To face this task, this paper reports experiments on a cascade pivot strategy through Spanish for the neural machine translation using the English-Spanish SCIELO and Spanish-Catalan El Peri\'odico database. To test the final performance of the system, we have created a new test data set for English-Catalan in the biomedical domain which is freely available on request.Comment: Full workshop proceedings can be found at https://multilingualbio.bsc.es/wp-content/uploads/2018/03/LREC-2018-PROCEEDINGS-MultilingualBIO.pd

    Consumer Eroski parallel corpus

    Get PDF
    This paper introduces the Consumer Eroski Parallel Corpus, a collection of articles originally written in Spanish and later translated to three languages also spoken in Spain: Basque, Catalan and Galician. The articles have been correlated in the four languages at the sentence level automatically using Moore's bilingual sentence alignment tool (2002). The Spanish section is also annotated morphosyntactically for parts of speech using SVMtool (Giménez and Márquez 2004). The Basque, Catalan and Galician sections may be annotated in a future release with the collaboration of Computational Linguistics Groups in Spain. To my knowledge, the Consumer Eroski Parallel Corpus is the first resource to exist that encompasses a substantial body of parallel text from these four languages spoken in Spain. I would like to thank the Eroski Foundation for granting permission to share the corpus in the public domain. Making this resource public will provide additional opportunities to test, train and develop natural language processing tools in the computational linguistics community. It may also help translators as a reference. With the addition of an advanced search interface, currently under development, the corpus may be consulted by Basque and Romance linguists interested in cross-linguistic research

    Consumer Eroski parallel corpus

    Get PDF
    This paper introduces the Consumer Eroski Parallel Corpus, a collection of articles originally written in Spanish and later translated to three languages also spoken in Spain: Basque, Catalan and Galician. The articles have been correlated in the four languages at the sentence level automatically using Moore's bilingual sentence alignment tool (2002). The Spanish section is also annotated morphosyntactically for parts of speech using SVMtool (Giménez and Márquez 2004). The Basque, Catalan and Galician sections may be annotated in a future release with the collaboration of Computational Linguistics Groups in Spain. To my knowledge, the Consumer Eroski Parallel Corpus is the first resource to exist that encompasses a substantial body of parallel text from these four languages spoken in Spain. I would like to thank the Eroski Foundation for granting permission to share the corpus in the public domain. Making this resource public will provide additional opportunities to test, train and develop natural language processing tools in the computational linguistics community. It may also help translators as a reference. With the addition of an advanced search interface, currently under development, the corpus may be consulted by Basque and Romance linguists interested in cross-linguistic research

    Byte-based neural machine translation

    Get PDF
    This paper presents experiments compar- ing character-based and byte-based neural machine translation systems. The main motivation of the byte-based neural ma- chine translation system is to build multi- lingual neural machine translation systems that can share the same vocabulary. We compare the performance of both systems in several language pairs and we see that the performance in test is similar for most language pairs while the training time is slightly reduced in the case of byte-based neural machine translation.Postprint (author's final draft

    Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

    Get PDF
    This paper presents an overview of Apertium, a free and open-source rule-based machine translation platform. Translation in Apertium happens through a pipeline of modular tools, and the platform continues to be improved as more language pairs are added. Several advances have been implemented since the last publication, including some new optional modules: a module that allows rules to process recursive structures at the structural transfer stage, a module that deals with contiguous and discontiguous multi-word expressions, and a module that resolves anaphora to aid translation. Also highlighted is the hybridisation of Apertium through statistical modules that augment the pipeline, and statistical methods that augment existing modules. This includes morphological disambiguation, weighted structural transfer, and lexical selection modules that learn from limited data. The paper also discusses how a platform like Apertium can be a critical part of access to language technology for so-called low-resource languages, which might be ignored or deemed unapproachable by popular corpus-based translation technologies. Finally, the paper presents some of the released and unreleased language pairs, concluding with a brief look at some supplementary Apertium tools that prove valuable to users as well as language developers. All Apertium-related code, including language data, is free/open-source and available at https://github.com/apertium
    corecore