2,931 research outputs found
Thematically Reinforced Explicit Semantic Analysis
We present an extended, thematically reinforced version of Gabrilovich and
Markovitch's Explicit Semantic Analysis (ESA), where we obtain thematic
information through the category structure of Wikipedia. For this we first
define a notion of categorical tfidf which measures the relevance of terms in
categories. Using this measure as a weight we calculate a maximal spanning tree
of the Wikipedia corpus considered as a directed graph of pages and categories.
This tree provides us with a unique path of "most related categories" between
each page and the top of the hierarchy. We reinforce tfidf of words in a page
by aggregating it with categorical tfidfs of the nodes of these paths, and
define a thematically reinforced ESA semantic relatedness measure which is more
robust than standard ESA and less sensitive to noise caused by out-of-context
words. We apply our method to the French Wikipedia corpus, evaluate it through
a text classification on a 37.5 MB corpus of 20 French newsgroups and obtain a
precision increase of 9-10% compared with standard ESA.Comment: 13 pages, 2 figures, presented at CICLing 201
Knowledge-rich Image Gist Understanding Beyond Literal Meaning
We investigate the problem of understanding the message (gist) conveyed by
images and their captions as found, for instance, on websites or news articles.
To this end, we propose a methodology to capture the meaning of image-caption
pairs on the basis of large amounts of machine-readable knowledge that has
previously been shown to be highly effective for text understanding. Our method
identifies the connotation of objects beyond their denotation: where most
approaches to image understanding focus on the denotation of objects, i.e.,
their literal meaning, our work addresses the identification of connotations,
i.e., iconic meanings of objects, to understand the message of images. We view
image understanding as the task of representing an image-caption pair on the
basis of a wide-coverage vocabulary of concepts such as the one provided by
Wikipedia, and cast gist detection as a concept-ranking problem with
image-caption pairs as queries. To enable a thorough investigation of the
problem of gist understanding, we produce a gold standard of over 300
image-caption pairs and over 8,000 gist annotations covering a wide variety of
topics at different levels of abstraction. We use this dataset to
experimentally benchmark the contribution of signals from heterogeneous
sources, namely image and text. The best result with a Mean Average Precision
(MAP) of 0.69 indicate that by combining both dimensions we are able to better
understand the meaning of our image-caption pairs than when using language or
vision information alone. We test the robustness of our gist detection approach
when receiving automatically generated input, i.e., using automatically
generated image tags or generated captions, and prove the feasibility of an
end-to-end automated process
Semantic Sort: A Supervised Approach to Personalized Semantic Relatedness
We propose and study a novel supervised approach to learning statistical
semantic relatedness models from subjectively annotated training examples. The
proposed semantic model consists of parameterized co-occurrence statistics
associated with textual units of a large background knowledge corpus. We
present an efficient algorithm for learning such semantic models from a
training sample of relatedness preferences. Our method is corpus independent
and can essentially rely on any sufficiently large (unstructured) collection of
coherent texts. Moreover, the approach facilitates the fitting of semantic
models for specific users or groups of users. We present the results of
extensive range of experiments from small to large scale, indicating that the
proposed method is effective and competitive with the state-of-the-art.Comment: 37 pages, 8 figures A short version of this paper was already
published at ECML/PKDD 201
The MeSH-gram Neural Network Model: Extending Word Embedding Vectors with MeSH Concepts for UMLS Semantic Similarity and Relatedness in the Biomedical Domain
Eliciting semantic similarity between concepts in the biomedical domain
remains a challenging task. Recent approaches founded on embedding vectors have
gained in popularity as they risen to efficiently capture semantic
relationships The underlying idea is that two words that have close meaning
gather similar contexts. In this study, we propose a new neural network model
named MeSH-gram which relies on a straighforward approach that extends the
skip-gram neural network model by considering MeSH (Medical Subject Headings)
descriptors instead words. Trained on publicly available corpus PubMed MEDLINE,
MeSH-gram is evaluated on reference standards manually annotated for semantic
similarity. MeSH-gram is first compared to skip-gram with vectors of size 300
and at several windows contexts. A deeper comparison is performed with tewenty
existing models. All the obtained results of Spearman's rank correlations
between human scores and computed similarities show that MeSH-gram outperforms
the skip-gram model, and is comparable to the best methods but that need more
computation and external resources.Comment: 6 pages, 2 table
A Unified multilingual semantic representation of concepts
Semantic representation lies at the core of several applications in Natural Language Processing. However, most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called MUFFIN , which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. MUFFIN represents a given concept in a unified semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several standard datasets
- …