1,464 research outputs found

    ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations

    Full text link
    We describe PARANMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs. We generated the pairs automatically by using neural machine translation to translate the non-English side of a large parallel corpus, following Wieting et al. (2017). Our hope is that ParaNMT-50M can be a valuable resource for paraphrase generation and can provide a rich source of semantic knowledge to improve downstream natural language understanding tasks. To show its utility, we use ParaNMT-50M to train paraphrastic sentence embeddings that outperform all supervised systems on every SemEval semantic textual similarity competition, in addition to showing how it can be used for paraphrase generation

    Towards Phrase-based Unsupervised Machine Translation: Phrase Representations

    Get PDF
    Olemasolevad juhendamata masintõlke lähenemised saavutavad küll lootusrikkaid tulemusi, mis on aga halvemad kui juhendatud masintõlke meetodite puhul. Käesolev töö arendab uut fraaside tasemel töötavat lähenemist juhendamata masintõlkele: kuna praegused sõna-põhised lähenemised masintõlkele kasutavad sõnade vektoresitusi, siis selle uue lähenemise juures on vaja vastavaid vektoresitusi fraasidele. Neid esitusi on vaja õppida juhendamatul viisil, arvestades ka fraasidele spetsiifilisi eripärasid nagu mitmesõnalisi väljendeid, ning lisaks peab esituste vektorruum rahuldama teatud nõudeid, et juhendamata masintõlke töötaks. Antud töö defineerib fraasiesituste effektiivsust juhendamata masintõlke kontekstis, loob juhendamata kompositsionaalse modelleerimise raamistiku fraasidele, ning näitab kuidas raamistiku kasutades jõuda effektiivsete fraasiesitusteni.Arendatud skriptid ja treenitud mudelid on jagatud avatud lähtekoodi projektina.Current unsupervised machine translation models despite achieving promising results work quite modestly comparing to the supervised approaches. This work aims to make an important step towards a new research direction of Phrase-based Unsupervised Machine Translation. Since current word-based models rely on representation of words, phrasebased models require appropriate phrase representations. These representations should be learned without supervision, address phrase specific multiword expressions issues, and their embedding space has to follow certain regulations for unsupervised translationto perform reasonable. We specify what makes phrase representations effective in terms of unsupervised machine translation, define unsupervised compositional modeling framework for phrases, and show how to use this framework to satisfy to the proposed requirements thus obtaining effective representations for phrases. We make the code and trained models publicly available as an open source project

    Compositional Morphology for Word Representations and Language Modelling

    Full text link
    This paper presents a scalable method for integrating compositional morphological representations into a vector-based probabilistic language model. Our approach is evaluated in the context of log-bilinear language models, rendered suitably efficient for implementation inside a machine translation decoder by factoring the vocabulary. We perform both intrinsic and extrinsic evaluations, presenting results on a range of languages which demonstrate that our model learns morphological representations that both perform well on word similarity tasks and lead to substantial reductions in perplexity. When used for translation into morphologically rich languages with large vocabularies, our models obtain improvements of up to 1.2 BLEU points relative to a baseline system using back-off n-gram models.Comment: Proceedings of the 31st International Conference on Machine Learning (ICML
    corecore