637 research outputs found

    A generic open world named entity disambiguation approach for tweets

    Get PDF
    Social media is a rich source of information. To make use of this information, it is sometimes required to extract and disambiguate named entities. In this paper we focus on named entity disambiguation (NED) in twitter messages. NED in tweets is challenging in two ways. First, the limited length of Tweet makes it hard to have enough context while many disambiguation techniques depend on it. The second is that many named entities in tweets do not exist in a knowledge base (KB). In this paper we share ideas from information retrieval (IR) and NED to propose solutions for both challenges. For the first problem we make use of the gregarious nature of tweets to get enough context needed for disambiguation. For the second problem we look for an alternative home page if there is no Wikipedia page represents the entity. Given a mention, we obtain a list of Wikipedia candidates from YAGO KB in addition to top ranked pages from Google search engine. We use Support Vector Machine (SVM) to rank the candidate pages to find the best representative entities. Experiments conducted on two data sets show better disambiguation results compared with the baselines and a competitor

    WikiM: Metapaths based Wikification of Scientific Abstracts

    Full text link
    In order to disseminate the exponential extent of knowledge being produced in the form of scientific publications, it would be best to design mechanisms that connect it with already existing rich repository of concepts -- the Wikipedia. Not only does it make scientific reading simple and easy (by connecting the involved concepts used in the scientific articles to their Wikipedia explanations) but also improves the overall quality of the article. In this paper, we present a novel metapath based method, WikiM, to efficiently wikify scientific abstracts -- a topic that has been rarely investigated in the literature. One of the prime motivations for this work comes from the observation that, wikified abstracts of scientific documents help a reader to decide better, in comparison to the plain abstracts, whether (s)he would be interested to read the full article. We perform mention extraction mostly through traditional tf-idf measures coupled with a set of smart filters. The entity linking heavily leverages on the rich citation and author publication networks. Our observation is that various metapaths defined over these networks can significantly enhance the overall performance of the system. For mention extraction and entity linking, we outperform most of the competing state-of-the-art techniques by a large margin arriving at precision values of 72.42% and 73.8% respectively over a dataset from the ACL Anthology Network. In order to establish the robustness of our scheme, we wikify three other datasets and get precision values of 63.41%-94.03% and 67.67%-73.29% respectively for the mention extraction and the entity linking phase

    Entity-Oriented Search

    Get PDF
    This open access book covers all facets of entity-oriented search—where “search” can be interpreted in the broadest sense of information access—from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book. The book is divided into three main parts, sandwiched between introductory and concluding chapters. The first two chapters introduce readers to the basic concepts, provide an overview of entity-oriented search tasks, and present the various types and sources of data that will be used throughout the book. Part I deals with the core task of entity ranking: given a textual query, possibly enriched with additional elements or structural hints, return a ranked list of entities. This core task is examined in a number of different variants, using both structured and unstructured data collections, and numerous query formulations. In turn, Part II is devoted to the role of entities in bridging unstructured and structured data. Part III explores how entities can enable search engines to understand the concepts, meaning, and intent behind the query that the user enters into the search box, and how they can provide rich and focused responses (as opposed to merely a list of documents)—a process known as semantic search. The final chapter concludes the book by discussing the limitations of current approaches, and suggesting directions for future research. Researchers and graduate students are the primary target audience of this book. A general background in information retrieval is sufficient to follow the material, including an understanding of basic probability and statistics concepts as well as a basic knowledge of machine learning concepts and supervised learning algorithms

    Web knowledge bases

    Get PDF
    Knowledge is key to natural language understanding. References to specific people, places and things in text are crucial to resolving ambiguity and extracting meaning. Knowledge Bases (KBs) codify this information for automated systems — enabling applications such as entity-based search and question answering. This thesis explores the idea that sites on the web may act as a KB, even if that is not their primary intent. Dedicated kbs like Wikipedia are a rich source of entity information, but are built and maintained at an ongoing cost in human effort. As a result, they are generally limited in terms of the breadth and depth of knowledge they index about entities. Web knowledge bases offer a distributed solution to the problem of aggregating entity knowledge. Social networks aggregate content about people, news sites describe events with tags for organizations and locations, and a diverse assortment of web directories aggregate statistics and summaries for long-tail entities notable within niche movie, musical and sporting domains. We aim to develop the potential of these resources for both web-centric entity Information Extraction (IE) and structured KB population. We first investigate the problem of Named Entity Linking (NEL), where systems must resolve ambiguous mentions of entities in text to their corresponding node in a structured KB. We demonstrate that entity disambiguation models derived from inbound web links to Wikipedia are able to complement and in some cases completely replace the role of resources typically derived from the KB. Building on this work, we observe that any page on the web which reliably disambiguates inbound web links may act as an aggregation point for entity knowledge. To uncover these resources, we formalize the task of Web Knowledge Base Discovery (KBD) and develop a system to automatically infer the existence of KB-like endpoints on the web. While extending our framework to multiple KBs increases the breadth of available entity knowledge, we must still consolidate references to the same entity across different web KBs. We investigate this task of Cross-KB Coreference Resolution (KB-Coref) and develop models for efficiently clustering coreferent endpoints across web-scale document collections. Finally, assessing the gap between unstructured web knowledge resources and those of a typical KB, we develop a neural machine translation approach which transforms entity knowledge between unstructured textual mentions and traditional KB structures. The web has great potential as a source of entity knowledge. In this thesis we aim to first discover, distill and finally transform this knowledge into forms which will ultimately be useful in downstream language understanding tasks

    Automatic Gloss Finding for a Knowledge Base using Ontological Constraints

    Full text link
    While there has been much research on automatically construct-ing structured Knowledge Bases (KBs), most of it has focused on generating facts to populate a KB. However, a useful KB must go beyond facts. For example, glosses (short natural language defi-nitions) have been found to be very useful in tasks such as Word Sense Disambiguation. However, the important problem of Auto-matic Gloss Finding, i.e., assigning glosses to entities in an ini-tially gloss-free KB, is relatively unexplored. We address that gap in this paper. In particular, we propose GLOFIN, a hierarchical semi-supervised learning algorithm for this problem which makes effective use of limited amounts of supervision and available onto-logical constraints. To the best of our knowledge, GLOFIN is the first system for this task. Through extensive experiments on real-world datasets, we demon-strate GLOFIN’s effectiveness. It is encouraging to see that GLOFIN outperforms other state-of-the-art SSL algorithms, especially in low supervision settings. We also demonstrate GLOFIN’s robustness to noise through experiments on a wide variety of KBs, ranging from user contributed (e.g., Freebase) to automatically constructed (e.g., NELL). To facilitate further research in this area, we have already made the datasets and code used in this paper publicly available. 1

    Grounding event references in news

    Get PDF
    Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation

    Rapid Exploitation and Analysis of Documents

    Full text link

    Entity-centric knowledge discovery for idiosyncratic domains

    Get PDF
    Technical and scientific knowledge is produced at an ever-accelerating pace, leading to increasing issues when trying to automatically organize or process it, e.g., when searching for relevant prior work. Knowledge can today be produced both in unstructured (plain text) and structured (metadata or linked data) forms. However, unstructured content is still themost dominant formused to represent scientific knowledge. In order to facilitate the extraction and discovery of relevant content, new automated and scalable methods for processing, structuring and organizing scientific knowledge are called for. In this context, a number of applications are emerging, ranging fromNamed Entity Recognition (NER) and Entity Linking tools for scientific papers to specific platforms leveraging information extraction techniques to organize scientific knowledge. In this thesis, we tackle the tasks of Entity Recognition, Disambiguation and Linking in idiosyncratic domains with an emphasis on scientific literature. Furthermore, we study the related task of co-reference resolution with a specific focus on named entities. We start by exploring Named Entity Recognition, a task that aims to identify the boundaries of named entities in textual contents. We propose a newmethod to generate candidate named entities based on n-gram collocation statistics and design several entity recognition features to further classify them. In addition, we show how the use of external knowledge bases (either domain-specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic domains. Subsequently, we move to Entity Disambiguation, which is typically performed after entity recognition in order to link an entity to a knowledge base. We propose novel semi-supervised methods for word disambiguation leveraging the structure of a community-based ontology of scientific concepts. Our approach exploits the graph structure that connects different terms and their definitions to automatically identify the correct sense that was originally picked by the authors of a scientific publication. We then turn to co-reference resolution, a task aiming at identifying entities that appear using various forms throughout the text. We propose an approach to type entities leveraging an inverted index built on top of a knowledge base, and to subsequently re-assign entities based on the semantic relatedness of the introduced types. Finally, we describe an application which goal is to help researchers discover and manage scientific publications. We focus on the problem of selecting relevant tags to organize collections of research papers in that context. We experimentally demonstrate that the use of a community-authored ontology together with information about the position of the concepts in the documents allows to significantly increase the precision of tag selection over standard methods
    • …
    corecore