1 research outputs found
Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance
Cross-lingual document alignment aims to identify pairs of documents in two
distinct languages that are of comparable content or translations of each
other. Such aligned data can be used for a variety of NLP tasks from training
cross-lingual representations to mining parallel bitexts for machine
translation training. In this paper we develop an unsupervised scoring function
that leverages cross-lingual sentence embeddings to compute the semantic
distance between documents in different languages. These semantic distances are
then used to guide a document alignment algorithm to properly pair
cross-lingual web documents across a variety of low, mid, and high-resource
language pairs. Recognizing that our proposed scoring function and other state
of the art methods are computationally intractable for long web documents, we
utilize a more tractable greedy algorithm that performs comparably. We
experimentally demonstrate that our distance metric performs better alignment
than current baselines outperforming them by 7% on high-resource language
pairs, 15% on mid-resource language pairs, and 22% on low-resource language
pair