16,087 research outputs found

    Word Sense Determination from Wikipedia Data Using Neural Networks

    Get PDF
    Many words have multiple meanings. For example, “plant” can mean a type of living organism or a factory. Being able to determine the sense of such words is very useful in natural language processing tasks, such as speech synthesis, question answering, and machine translation. For the project described in this report, we used a modular model to classify the sense of words to be disambiguated. This model consisted of two parts: The first part was a neural-network-based language model to compute continuous vector representations of words from data sets created from Wikipedia pages. The second part classified the meaning of the given word without explicitly knowing what the meaning is. In this unsupervised word sense determination task, we did not need human-tagged training data or a dictionary of senses for each word. We tested the model with some naturally ambiguous words, and compared our experimental results with the related work by Schütze in 1998. Our model achieved similar accuracy as Schütze’s work for some words

    NASARI: a novel approach to a Semantically-Aware Representation of items

    Get PDF
    The semantic representation of individual word senses and concepts is of fundamental importance to several applications in Natural Language Processing. To date, concept modeling techniques have in the main based their representation either on lexicographic resources, such as WordNet, or on encyclopedic resources, such as Wikipedia. We propose a vector representation technique that combines the complementary knowledge of both these types of resource. Thanks to its use of explicit semantics combined with a novel cluster-based dimensionality reduction and an effective weighting scheme, our representation attains state-of-the-art performance on multiple datasets in two standard benchmarks: word similarity and sense clustering. We are releasing our vector representations at http://lcl.uniroma1.it/nasari/

    Clustering documents with active learning using Wikipedia

    Get PDF
    Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%

    Embedding Words and Senses Together via Joint Knowledge-Enhanced Training

    Get PDF
    Word embeddings are widely used in Nat-ural Language Processing, mainly due totheir success in capturing semantic infor-mation from massive corpora. However,their creation process does not allow thedifferent meanings of a word to be auto-matically separated, as it conflates theminto a single vector. We address this issueby proposing a new model which learnsword and sense embeddings jointly. Ourmodel exploits large corpora and knowl-edge from semantic networks in order toproduce a unified vector space of wordand sense embeddings. We evaluate themain features of our approach both qual-itatively and quantitatively in a variety oftasks, highlighting the advantages of theproposed method in comparison to state-of-the-art word- and sense-based models

    From Word to Sense Embeddings: A Survey on Vector Representations of Meaning

    Get PDF
    Over the past years, distributed semantic representations have proved to be effective and flexible keepers of prior knowledge to be integrated into downstream applications. This survey focuses on the representation of meaning. We start from the theoretical background behind word vector space models and highlight one of their major limitations: the meaning conflation deficiency, which arises from representing a word with all its possible meanings as a single vector. Then, we explain how this deficiency can be addressed through a transition from the word level to the more fine-grained level of word senses (in its broader acceptation) as a method for modelling unambiguous lexical meaning. We present a comprehensive overview of the wide range of techniques in the two main branches of sense representation, i.e., unsupervised and knowledge-based. Finally, this survey covers the main evaluation procedures and applications for this type of representation, and provides an analysis of four of its important aspects: interpretability, sense granularity, adaptability to different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence Researc

    Large scale homophily analysis in twitter using a twixonomy

    Get PDF
    In this paper we perform a large-scale homophily analysis on Twitter using a hierarchical representation of users' interests which we call a Twixonomy. In order to build a population, community, or single-user Twixonomy we first associate "topical" friends in users' friendship lists (i.e. friends representing an interest rather than a social relation between peers) with Wikipedia categories. A wordsense disambiguation algorithm is used to select the appropriate wikipage for each topical friend. Starting from the set of wikipages representing "primitive" interests, we extract all paths connecting these pages with topmost Wikipedia category nodes, and we then prune the resulting graph G efficiently so as to induce a direct acyclic graph. This graph is the Twixonomy. Then, to analyze homophily, we compare different methods to detect communities in a peer friends Twitter network, and then for each community we compute the degree of homophily on the basis of a measure of pairwise semantic similarity. We show that the Twixonomy provides a means for describing users' interests in a compact and readable way and allows for a fine-grained homophily analysis. Furthermore, we show that midlow level categories in the Twixonomy represent the best balance between informativeness and compactness of the representation
    corecore