8,173 research outputs found
Semantic Sort: A Supervised Approach to Personalized Semantic Relatedness
We propose and study a novel supervised approach to learning statistical
semantic relatedness models from subjectively annotated training examples. The
proposed semantic model consists of parameterized co-occurrence statistics
associated with textual units of a large background knowledge corpus. We
present an efficient algorithm for learning such semantic models from a
training sample of relatedness preferences. Our method is corpus independent
and can essentially rely on any sufficiently large (unstructured) collection of
coherent texts. Moreover, the approach facilitates the fitting of semantic
models for specific users or groups of users. We present the results of
extensive range of experiments from small to large scale, indicating that the
proposed method is effective and competitive with the state-of-the-art.Comment: 37 pages, 8 figures A short version of this paper was already
published at ECML/PKDD 201
Automated Detection of Non-Relevant Posts on the Russian Imageboard "2ch": Importance of the Choice of Word Representations
This study considers the problem of automated detection of non-relevant posts
on Web forums and discusses the approach of resolving this problem by
approximation it with the task of detection of semantic relatedness between the
given post and the opening post of the forum discussion thread. The
approximated task could be resolved through learning the supervised classifier
with a composed word embeddings of two posts. Considering that the success in
this task could be quite sensitive to the choice of word representations, we
propose a comparison of the performance of different word embedding models. We
train 7 models (Word2Vec, Glove, Word2Vec-f, Wang2Vec, AdaGram, FastText,
Swivel), evaluate embeddings produced by them on dataset of human judgements
and compare their performance on the task of non-relevant posts detection. To
make the comparison, we propose a dataset of semantic relatedness with posts
from one of the most popular Russian Web forums, imageboard "2ch", which has
challenging lexical and grammatical features.Comment: 6 pages, 1 figure, 1 table, main proceedings of AIST-2017 (Analysis
of Images, Social Networks, and Texts
Knowledge-rich Image Gist Understanding Beyond Literal Meaning
We investigate the problem of understanding the message (gist) conveyed by
images and their captions as found, for instance, on websites or news articles.
To this end, we propose a methodology to capture the meaning of image-caption
pairs on the basis of large amounts of machine-readable knowledge that has
previously been shown to be highly effective for text understanding. Our method
identifies the connotation of objects beyond their denotation: where most
approaches to image understanding focus on the denotation of objects, i.e.,
their literal meaning, our work addresses the identification of connotations,
i.e., iconic meanings of objects, to understand the message of images. We view
image understanding as the task of representing an image-caption pair on the
basis of a wide-coverage vocabulary of concepts such as the one provided by
Wikipedia, and cast gist detection as a concept-ranking problem with
image-caption pairs as queries. To enable a thorough investigation of the
problem of gist understanding, we produce a gold standard of over 300
image-caption pairs and over 8,000 gist annotations covering a wide variety of
topics at different levels of abstraction. We use this dataset to
experimentally benchmark the contribution of signals from heterogeneous
sources, namely image and text. The best result with a Mean Average Precision
(MAP) of 0.69 indicate that by combining both dimensions we are able to better
understand the meaning of our image-caption pairs than when using language or
vision information alone. We test the robustness of our gist detection approach
when receiving automatically generated input, i.e., using automatically
generated image tags or generated captions, and prove the feasibility of an
end-to-end automated process
Similarity Models in Distributional Semantics using Task Specific Information
In distributional semantics, the unsupervised learning approach has been widely used for a large number of tasks. On the other hand, supervised learning has less coverage.
In this dissertation, we investigate the supervised learning approach for semantic relatedness tasks in distributional semantics. The investigation considers mainly semantic similarity and semantic classification tasks. Existing and newly-constructed datasets are used as an input for the experiments. The new datasets are constructed from thesauruses like Eurovoc. The Eurovoc thesaurus is a multilingual thesaurus maintained by the Publications Office of the European Union. The meaning of the words in the dataset is represented by using a distributional semantic approach.
The distributional semantic approach collects co-occurrence information from large texts and represents the words in high-dimensional vectors. The English words are represented by using UkWaK corpus while German words are represented by using DeWaC corpus. After representing each word by the high dimensional vector, different supervised machine learning methods are used on the selected tasks. The outputs from the supervised machine learning methods are evaluated by comparing the tasks performance and accuracy with the state of the art unsupervised machine learning methods’ results. In addition, multi-relational matrix factorization is introduced as one supervised learning method in distributional semantics. This dissertation shows the multi-relational matrix factorization method as a good alternative method to integrate different sources of information of words in distributional semantics.
In the dissertation, some new applications are also introduced. One of the applications is an application which analyzes a German company’s website text, and provides information about the company with a concept cloud visualization. The other applications are automatic recognition/disambiguation of the library of congress subject headings and automatic identification of synonym relations in the Dutch Parliament thesaurus applications
- …