2,009 research outputs found
Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure
It has been established that incorporating word cluster features derived from large unlabeled corpora can significantly improve prediction of linguistic structure. While previous work has focused primarily on English, we extend these results to other languages along two dimensions. First, we show that these results hold true for a number of languages across families. Second, and more interestingly, we provide an algorithm for inducing cross-lingual clusters and we show that features derived from these clusters significantly improve the accuracy of cross-lingual structure prediction. Specifically, we show that by augmenting direct-transfer systems with cross-lingual cluster features, the relative error of delexicalized dependency parsers, trained on English treebanks and transferred to foreign languages, can be reduced by up to 13%. When applying the same method to direct transfer of named-entity recognizers, we observe relative improvements of up to 26%
DOCmT5: Document-Level Pretraining of Multilingual Language Models
In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence
language model pretrained with large scale parallel documents. While previous
approaches have focused on leveraging sentence-level parallel data, we try to
build a general-purpose pretrained model that can understand and generate long
documents. We propose a simple and effective pretraining objective - Document
reordering Machine Translation (DrMT), in which the input documents that are
shuffled and masked need to be translated. DrMT brings consistent improvements
over strong baselines on a variety of document-level generation tasks,
including over 12 BLEU points for seen-language-pair document-level MT, over 7
BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1
points for seen-language-pair cross-lingual summarization. We achieve
state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation
tasks. We also conduct extensive analysis on various factors for document
pretraining, including (1) The effects of pretraining data quality and (2) The
effects of combining mono-lingual and cross-lingual pretraining. We plan to
make our model checkpoints publicly available.Comment: NAACL 2022 Finding
- …