372 research outputs found
TectoMT â a deep-Âlinguistic core of the combined Chimera MT system
Chimera is a machine translation system that combines the TectoMT deep-linguistic core with phrase-based MT system Moses. For EnglishâCzech pair it also uses the Depfix post-correction system. All the components run on Unix/Linux platform and are open source (available from Perl repository CPAN and the LINDAT/CLARIN repository). The main website is https://ufal.mff.cuni.cz/tectomt. The development is currently supported by the QTLeap 7th FP project (http://qtleap.eu)
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Peer reviewe
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
We introduce an architecture to learn joint multilingual sentence
representations for 93 languages, belonging to more than 30 different language
families and written in 28 different scripts. Our system uses a single BiLSTM
encoder with a shared BPE vocabulary for all languages, which is coupled with
an auxiliary decoder and trained on publicly available parallel corpora. This
enables us to learn a classifier on top of the resulting sentence embeddings
using English annotated data only, and transfer it to any of the 93 languages
without any modification. Our approach sets a new state-of-the-art on zero-shot
cross-lingual natural language inference for all the 14 languages in the XNLI
dataset but one. We also achieve very competitive results in cross-lingual
document classification (MLDoc dataset). Our sentence embeddings are also
strong at parallel corpus mining, establishing a new state-of-the-art in the
BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new
test set of aligned sentences in 122 languages based on the Tatoeba corpus, and
show that our sentence embeddings obtain strong results in multilingual
similarity search even for low-resource languages. Our PyTorch implementation,
pre-trained encoder and the multilingual test set will be freely available
- âŠ