2,387 research outputs found
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Cross-Lingual Low-Resource Set-to-Description Retrieval for Global E-Commerce
With the prosperous of cross-border e-commerce, there is an urgent demand for
designing intelligent approaches for assisting e-commerce sellers to offer
local products for consumers from all over the world. In this paper, we explore
a new task of cross-lingual information retrieval, i.e., cross-lingual
set-to-description retrieval in cross-border e-commerce, which involves
matching product attribute sets in the source language with persuasive product
descriptions in the target language. We manually collect a new and high-quality
paired dataset, where each pair contains an unordered product attribute set in
the source language and an informative product description in the target
language. As the dataset construction process is both time-consuming and
costly, the new dataset only comprises of 13.5k pairs, which is a low-resource
setting and can be viewed as a challenging testbed for model development and
evaluation in cross-border e-commerce. To tackle this cross-lingual
set-to-description retrieval task, we propose a novel cross-lingual matching
network (CLMN) with the enhancement of context-dependent cross-lingual mapping
upon the pre-trained monolingual BERT representations. Experimental results
indicate that our proposed CLMN yields impressive results on the challenging
task and the context-dependent cross-lingual mapping on BERT yields noticeable
improvement over the pre-trained multi-lingual BERT model.Comment: AAAI 202
Content based recommendation in catalogues of multilingual documents
A diplomamunka áttkeinti a többnyelvű információs rendszerek és dokumentum ajánlórendszerek működését és új modelleket ajánl
- …