Search CORE

372 research outputs found

What’s in a domain?:Towards fine-grained adaptation for machine translation

Author: van der Wees M.E.
Publication venue
Publication date: 01/01/2017
Field of study

International Migration, Integration and Social Cohesion online publications

TectoMT – a deep-linguistic core of the combined Chimera MT system

Author: Bojar Ondřej
Hajič Jan
Popel Martin
Rosa Rudolf
Sudarikov Roman
Publication venue
Publication date: 01/01/2016
Field of study

Chimera is a machine translation system that combines the TectoMT deep-linguistic core with phrase-based MT system Moses. For English–Czech pair it also uses the Depfix post-correction system. All the components run on Unix/Linux platform and are open source (available from Perl repository CPAN and the LINDAT/CLARIN repository). The main website is https://ufal.mff.cuni.cz/tectomt. The development is currently supported by the QTLeap 7th FP project (http://qtleap.eu)

Biblio at Institute of Formal and Applied Linguistics

Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

Author
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Author: Artetxe Mikel
Schwenk Holger
Publication venue
Publication date: 11/07/1922
Field of study

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different language families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting sentence embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our approach sets a new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one. We also achieve very competitive results in cross-lingual document classification (MLDoc dataset). Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new test set of aligned sentences in 122 languages based on the Tatoeba corpus, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our PyTorch implementation, pre-trained encoder and the multilingual test set will be freely available

arXiv.org e-Print Archive

University of Michigan Library Repository

International Conference on Human-Informed Translation and Interpreting Technology (HiT-IT 2023) Proceedings

Author: Corpas Pastor G.
Mitkov R.
Monti J.
Orăsan C.
Publication venue: Incoma Ltd.
Publication date: 01/01/2023
Field of study

Università degli Studi di Napoli L'Orientale: CINECA IRIS