1,196 research outputs found
Semantic Sort: A Supervised Approach to Personalized Semantic Relatedness
We propose and study a novel supervised approach to learning statistical
semantic relatedness models from subjectively annotated training examples. The
proposed semantic model consists of parameterized co-occurrence statistics
associated with textual units of a large background knowledge corpus. We
present an efficient algorithm for learning such semantic models from a
training sample of relatedness preferences. Our method is corpus independent
and can essentially rely on any sufficiently large (unstructured) collection of
coherent texts. Moreover, the approach facilitates the fitting of semantic
models for specific users or groups of users. We present the results of
extensive range of experiments from small to large scale, indicating that the
proposed method is effective and competitive with the state-of-the-art.Comment: 37 pages, 8 figures A short version of this paper was already
published at ECML/PKDD 201
A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection
This is the authorâs version of a work that was accepted for publication in Information Processing and Management. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information Processing and Management 52 (2016) 550â570. DOI 10.1016/j.ipm.2015.12.004Cross-language plagiarism detection aims to detect plagiarised fragments of text among
documents in different languages. In this paper, we perform a systematic examination of
Cross-language Knowledge Graph Analysis; an approach that represents text fragments using
knowledge graphs as a language independent content model. We analyse the contributions
to cross-language plagiarism detection of the different aspects covered by knowledge
graphs: word sense disambiguation, vocabulary expansion, and representation by similarities
with a collection of concepts. In addition, we study both the relevance of concepts and
their relations when detecting plagiarism. Finally, as a key component of the knowledge
graph construction, we present a new weighting scheme of relations between concepts
based on distributed representations of concepts. Experimental results in SpanishâEnglish
and GermanâEnglish plagiarism detection show state-of-the-art performance and provide
interesting insights on the use of knowledge graphs.
Š 2015 Elsevier Ltd. All rights reserved.This research has been carried out in the framework of the European Commission WIQ-EI IRSES (No. 269180) and DIANA APPLICATIONS - Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) projects. We would like to thank Tomas Mikolov, Martin Potthast, and Luis A. Leiva for their support and comments during this research.Franco-Salvador, M.; Rosso, P.; Montes Gomez, M. (2016). A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection. Information Processing and Management. 52(4):550-570. https://doi.org/10.1016/j.ipm.2015.12.004S55057052
SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation
We present SimLex-999, a gold standard resource for evaluating distributional
semantic models that improves on existing resources in several important ways.
First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly
quantifies similarity rather than association or relatedness, so that pairs of
entities that are associated but not actually similar [Freud, psychology] have
a low rating. We show that, via this focus on similarity, SimLex-999
incentivizes the development of models with a different, and arguably wider
range of applications than those which reflect conceptual association. Second,
SimLex-999 contains a range of concrete and abstract adjective, noun and verb
pairs, together with an independent rating of concreteness and (free)
association strength for each pair. This diversity enables fine-grained
analyses of the performance of models on concepts of different types, and
consequently greater insight into how architectures can be improved. Further,
unlike existing gold standard evaluations, for which automatic approaches have
reached or surpassed the inter-annotator agreement ceiling, state-of-the-art
models perform well below this ceiling on SimLex-999. There is therefore plenty
of scope for SimLex-999 to quantify future improvements to distributional
semantic models, guiding the development of the next generation of
representation-learning architectures
Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings
Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (âtopical relatednessâ) on word pairs such as âcoffeeâ and âcupâ or âbusâ and âroadâ. However, they are less successful on pairs showing taxonomic similarity, like âcupâ and âmugâ (near synonyms) or âbusâ and âtrainâ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNetâs structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like WordNet. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNetâs structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora benefit the most from taxonomic enrichment whilst embeddings trained on large natural corpora only benefit from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we find that, whilst the WordNet structure is finite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks effectively reinforces taxonomic relations in the learned embeddings
Recommended from our members
Semantics and statistics for automated image annotation
Automated image annotation consists of a number of techniques that aim to find the correlation between words and image features such as colour, shape, and texture to provide correct annotation words to images. In particular, approaches based on Bayesian theory use machine-learning techniques to learn statistical models from a training set of pre-annotated images and apply them to generate annotations for unseen images.
The focus of this thesis lies in demonstrating that an approach, which goes beyond learning the statistical correlation between words and visual features and also exploits information about the actual semantics of the words used in the annotation process, is able to improve the performance of probabilistic annotation systems. Specifically, I present three experiments. Firstly, I introduce a novel approach that automatically refines the annotation words generated by a non-parametric density estimation model using semantic relatedness measures. Initially, I consider semantic measures based on co-occurrence of words in the training set. However, this approach can exhibit limitations, as its performance depends on the quality and coverage provided by the training data. For this reason, I devise an alternative solution that combines semantic measures based on knowledge sources, such as WordNet and Wikipedia, with word co-occurrence in the training set and on the web, to achieve statistically significant results over the baseline. Secondly, I investigate the effect of using semantic measures inside an evaluation measure that computes the performance of an automated image annotation system, whose annotation words adopt the hierarchical structure of an ontology. This is the case of the ImageCLEF2009 collection. Finally, I propose a Markov Random Field that exploits the semantic context dependencies of the image. The best result obtains a mean average precision of 0.32, which is consistent with the state-of-the-art in automated image annotation for the Corel 5k dataset.
</br
- âŚ