6,624 research outputs found

    An analysis of word embedding spaces and regularities

    Get PDF
    Word embeddings are widely use in several applications due to their ability to capture semantic relationships between words as relations between vectors in high dimensional spaces. One of the main problems to obtain the information is to deal with the phenomena known as the Curse of Dimensionality, the fact that some intuitive results for well known distances are not valid in high dimensional contexts. In this thesis we explore the problem to distinguish between synonyms or antonyms pairs of words and non-related pairs of words attending just to the distance between the words of the pair. We considerer several norms and explore the problem in the two principal kinds of embeddings, GloVe and Word2Vec

    CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information

    Full text link
    Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manuallydefined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) - a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.Comment: Accepted at WWW 201

    Econometrics meets sentiment : an overview of methodology and applications

    Get PDF
    The advent of massive amounts of textual, audio, and visual data has spurred the development of econometric methodology to transform qualitative sentiment data into quantitative sentiment variables, and to use those variables in an econometric analysis of the relationships between sentiment and other variables. We survey this emerging research field and refer to it as sentometrics, which is a portmanteau of sentiment and econometrics. We provide a synthesis of the relevant methodological approaches, illustrate with empirical results, and discuss useful software

    Vector Representation for Sub-Graph Encoding to Resolve Entities

    Get PDF
    AbstractEntity Resolution, i.e., determining whether two mentions refer to the same entity, is a crucial step in combining evidence from multiple sources, and is a problem encountered in a wide-range of areas, from modeling causes of cancer to identifying terrorist networks. Entity mentions are represented by attributes and relations to other entities. However, entity attributes and relations from different sources often use different names and specify relationships differently, which leads to low entity resolution precision and recall. Our contribution is based on our observation that relationships are more reliable than attributes when comparison is based on relational similarity, not exact matches. Traditional graph comparison techniques rely on finding precise matches of a significant part of the graph structure, and require custom comparison functions for every type of attribute and every type of relation. This leads to a system that is difficult to maintain and enhance. We encode entity nodes and their graph neighborhoods in semantic vectors, efficiently indexing the vectors, and calculating vector similarity. Our approach is insensitive to small variations in relational graph representation. Our approach uses simple vector addition, permutation, and difference only, leading to reduced computational complexity. Our preliminary experiment shows 83.05% accuracy
    corecore