26,208 research outputs found
Explicit versus Latent Concept Models for Cross-Language Information Retrieval
Cimiano P, Schultz A, Sizov S, Sorg P, Staab S. Explicit versus Latent Concept Models for Cross-Language Information Retrieval. In: Boutilier C, ed. IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press; 2009: 1513-1518
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Text categorization and similarity analysis: similarity measure, architecture and design
This research looks at the most appropriate similarity measure to use for a document classification problem. The goal is to find a method that is accurate in finding both semantically and version related documents. A necessary requirement is that the method is efficient in its speed and disk usage. Simhash is found to be the measure best suited to the application and it can be combined with other software to increase the accuracy. Pingar have provided an API that will extract the entities from a document and create a taxonomy displaying the relationships and this extra information can be used to accurately classify input documents. Two algorithms are designed incorporating the Pingar API and then finally an efficient comparison algorithm is introduced to cut down the comparisons required
Thematically Reinforced Explicit Semantic Analysis
We present an extended, thematically reinforced version of Gabrilovich and
Markovitch's Explicit Semantic Analysis (ESA), where we obtain thematic
information through the category structure of Wikipedia. For this we first
define a notion of categorical tfidf which measures the relevance of terms in
categories. Using this measure as a weight we calculate a maximal spanning tree
of the Wikipedia corpus considered as a directed graph of pages and categories.
This tree provides us with a unique path of "most related categories" between
each page and the top of the hierarchy. We reinforce tfidf of words in a page
by aggregating it with categorical tfidfs of the nodes of these paths, and
define a thematically reinforced ESA semantic relatedness measure which is more
robust than standard ESA and less sensitive to noise caused by out-of-context
words. We apply our method to the French Wikipedia corpus, evaluate it through
a text classification on a 37.5 MB corpus of 20 French newsgroups and obtain a
precision increase of 9-10% compared with standard ESA.Comment: 13 pages, 2 figures, presented at CICLing 201
- âŠ