598 research outputs found

    A Unified multilingual semantic representation of concepts

    Get PDF
    Semantic representation lies at the core of several applications in Natural Language Processing. However, most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called MUFFIN , which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. MUFFIN represents a given concept in a unified semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several standard datasets

    NASARI: a novel approach to a Semantically-Aware Representation of items

    Get PDF
    The semantic representation of individual word senses and concepts is of fundamental importance to several applications in Natural Language Processing. To date, concept modeling techniques have in the main based their representation either on lexicographic resources, such as WordNet, or on encyclopedic resources, such as Wikipedia. We propose a vector representation technique that combines the complementary knowledge of both these types of resource. Thanks to its use of explicit semantics combined with a novel cluster-based dimensionality reduction and an effective weighting scheme, our representation attains state-of-the-art performance on multiple datasets in two standard benchmarks: word similarity and sense clustering. We are releasing our vector representations at http://lcl.uniroma1.it/nasari/

    Mining Meaning from Wikipedia

    Get PDF
    Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.Comment: An extensive survey of re-using information in Wikipedia in natural language processing, information retrieval and extraction and ontology building. Accepted for publication in International Journal of Human-Computer Studie

    A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection

    Full text link
    This is the author’s version of a work that was accepted for publication in Information Processing and Management. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information Processing and Management 52 (2016) 550–570. DOI 10.1016/j.ipm.2015.12.004Cross-language plagiarism detection aims to detect plagiarised fragments of text among documents in different languages. In this paper, we perform a systematic examination of Cross-language Knowledge Graph Analysis; an approach that represents text fragments using knowledge graphs as a language independent content model. We analyse the contributions to cross-language plagiarism detection of the different aspects covered by knowledge graphs: word sense disambiguation, vocabulary expansion, and representation by similarities with a collection of concepts. In addition, we study both the relevance of concepts and their relations when detecting plagiarism. Finally, as a key component of the knowledge graph construction, we present a new weighting scheme of relations between concepts based on distributed representations of concepts. Experimental results in Spanish–English and German–English plagiarism detection show state-of-the-art performance and provide interesting insights on the use of knowledge graphs. © 2015 Elsevier Ltd. All rights reserved.This research has been carried out in the framework of the European Commission WIQ-EI IRSES (No. 269180) and DIANA APPLICATIONS - Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) projects. We would like to thank Tomas Mikolov, Martin Potthast, and Luis A. Leiva for their support and comments during this research.Franco-Salvador, M.; Rosso, P.; Montes Gomez, M. (2016). A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection. Information Processing and Management. 52(4):550-570. https://doi.org/10.1016/j.ipm.2015.12.004S55057052

    Evaluation of taxonomic and neural embedding methods for calculating semantic similarity

    Full text link
    Modelling semantic similarity plays a fundamental role in lexical semantic applications. A natural way of calculating semantic similarity is to access handcrafted semantic networks, but similarity prediction can also be anticipated in a distributional vector space. Similarity calculation continues to be a challenging task, even with the latest breakthroughs in deep neural language models. We first examined popular methodologies in measuring taxonomic similarity, including edge-counting that solely employs semantic relations in a taxonomy, as well as the complex methods that estimate concept specificity. We further extrapolated three weighting factors in modelling taxonomic similarity. To study the distinct mechanisms between taxonomic and distributional similarity measures, we ran head-to-head comparisons of each measure with human similarity judgements from the perspectives of word frequency, polysemy degree and similarity intensity. Our findings suggest that without fine-tuning the uniform distance, taxonomic similarity measures can depend on the shortest path length as a prime factor to predict semantic similarity; in contrast to distributional semantics, edge-counting is free from sense distribution bias in use and can measure word similarity both literally and metaphorically; the synergy of retrofitting neural embeddings with concept relations in similarity prediction may indicate a new trend to leverage knowledge bases on transfer learning. It appears that a large gap still exists on computing semantic similarity among different ranges of word frequency, polysemous degree and similarity intensity

    Knowledge-based approaches to producing large-scale training data from scratch for Word Sense Disambiguation and Sense Distribution Learning

    Get PDF
    Communicating and understanding each other is one of the most important human abilities. As humans, in fact, we can easily assign the correct meaning to the ambiguous words in a text, while, at the same time, being able to abstract, summarise and enrich its content with new information that we learned somewhere else. On the contrary, machines rely on formal languages which do not leave space to ambiguity hence being easy to parse and understand. Therefore, to fill the gap between humans and machines and enabling the latter to better communicate with and comprehend its sentient counterpart, in the modern era of computer-science's much effort has been put into developing Natural Language Processing (NLP) approaches which aim at understanding and handling the ambiguity of the human language. At the core of NLP lies the task of correctly interpreting the meaning of each word in a given text, hence disambiguating its content exactly as a human would do. Researchers in the Word Sense Disambiguation (WSD) field address exactly this issue by leveraging either knowledge bases, i.e. graphs where nodes are concept and edges are semantic relations among them, or manually-annotated datasets for training machine learning algorithms. One common obstacle is the knowledge acquisition bottleneck problem, id est, retrieving or generating semantically-annotated data which are necessary to build both semantic graphs or training sets is a complex task. This phenomenon is even more serious when considering languages other than English where resources to generate human-annotated data are scarce and ready-made datasets are completely absent. With the advent of deep learning this issue became even more serious as more complex models need larger datasets in order to learn meaningful patterns to solve the task. Another critical issue in WSD, as well as in other machine-learning-related fields, is the domain adaptation problem, id est, performing the same task in different application domains. This is particularly hard when dealing with word senses, as, in fact, they are governed by a Zipfian distribution; hence, by slightly changing the application domain, a sense might become very frequent even though it is very rare in the general domain. For example the geometric sense of plane is very frequent in a corpus made of math books, while it is very rare in a general domain dataset. In this thesis we address both these problems. Inter alia, we focus on relieving the burden of human annotations in Word Sense Disambiguation thus enabling the automatic construction of high-quality sense-annotated dataset not only for English, but especially for other languages where sense-annotated data are not available at all. Furthermore, recognising in word-sense distribution one of the main pitfalls for WSD approaches, we also alleviate the dependency on most frequent sense information by automatically inducing the word-sense distribution in a given text of raw sentences. In the following we propose a language-independent and automatic approach to generating semantic annotations given a collection of sentences, and then introduce two methods for the automatic inference of word-sense distributions. Finally, we combine the two kind of approaches to build a semantically-annotated dataset that reflect the sense distribution which we automatically infer from the target text
    • …
    corecore