38 research outputs found

    MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach

    Full text link
    Entity linking has recently been the subject of a significant body of research. Currently, the best performing approaches rely on trained mono-lingual models. Porting these approaches to other languages is consequently a difficult endeavor as it requires corresponding training data and retraining of the models. We address this drawback by presenting a novel multilingual, knowledge-based agnostic and deterministic approach to entity linking, dubbed MAG. MAG is based on a combination of context-based retrieval on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data sets and in 7 languages. Our results show that the best approach trained on English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse on datasets in other languages. MAG, on the other hand, achieves state-of-the-art performance on English datasets and reaches a micro F-measure that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc

    Concept and entity grounding using indirect supervision

    Get PDF
    Extracting and disambiguating entities and concepts is a crucial step toward understanding natural language text. In this thesis, we consider the problem of grounding concepts and entities mentioned in text to one or more knowledge bases (KBs). A well-studied scenario of this problem is the one in which documents are given in English and the goal is to identify concept and entity mentions, and find the corresponding entries the mentions refer to in Wikipedia. We extend this problem in two directions: First, we study identifying and grounding entities written in any language to the English Wikipedia. Second, we investigate using multiple KBs which do not contain rich textual and structural information Wikipedia does. These more involved settings pose a few additional challenges beyond those addressed in the standard English Wikification problem. Key among them is that no supervision is available to facilitate training machine learning models. The first extension, cross-lingual Wikification, introduces problems such as recognizing multilingual named entities mentioned in text, translating non-English names into English, and computing word similarity across languages. Since it is impossible to acquire manually annotated examples for all languages, building models for all languages in Wikipedia requires exploring indirect or incidental supervision signals which already exist in Wikipedia. For the second setting, we need to deal with the fact that most KBs do not contain the rich information Wikipedia has; consequently, the main supervision signal used to train Wikification rankers does not exist anymore. In this thesis, we show that supervision signals can be obtained by carefully examining the redundancy and relations between multiple KBs. By developing algorithms and models which harvest these incidental signals, we can achieve better performance on these tasks

    Mining Meaning from Wikipedia

    Get PDF
    Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.Comment: An extensive survey of re-using information in Wikipedia in natural language processing, information retrieval and extraction and ontology building. Accepted for publication in International Journal of Human-Computer Studie

    数学情報アクセスのための数式表現の検索と曖昧性解消

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学准教授 渋谷 哲朗, 東京大学教授 萩谷 昌己, 東京大学准教授 蓮尾 一郎, 東京大学准教授 鶴岡 慶雅, 東京工業大学准教授 藤井 敦University of Tokyo(東京大学

    Applying Wikipedia to Interactive Information Retrieval

    Get PDF
    There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This thesis claims that the roadblock can be sidestepped: Wikipedia can be applied effectively to open-domain information retrieval with minimal natural language processing or information extraction. The key is to focus on gathering and applying human-readable rather than machine-readable knowledge. To demonstrate this claim, the thesis tackles three separate problems: extracting knowledge from Wikipedia; connecting it to textual documents; and applying it to the retrieval process. First, we demonstrate that a large thesaurus-like structure can be obtained directly from Wikipedia, and that accurate measures of semantic relatedness can be efficiently mined from it. Second, we show that Wikipedia provides the necessary features and training data for existing data mining techniques to accurately detect and disambiguate topics when they are mentioned in plain text. Third, we provide two systems and user studies that demonstrate the utility of the Wikipedia-derived knowledge base for interactive information retrieval

    Harnessing sense-level information for semantically augmented knowledge extraction

    Get PDF
    Nowadays, building accurate computational models for the semantics of language lies at the very core of Natural Language Processing and Artificial Intelligence. A first and foremost step in this respect consists in moving from word-based to sense-based approaches, in which operating explicitly at the level of word senses enables a model to produce more accurate and unambiguous results. At the same time, word senses create a bridge towards structured lexico-semantic resources, where the vast amount of available machine-readable information can help overcome the shortage of annotated data in many languages and domains of knowledge. This latter phenomenon, known as the knowledge acquisition bottlneck, is a crucial problem that hampers the development of large-scale, data-driven approaches for many Natural Language Processing tasks, especially when lexical semantics is directly involved. One of these tasks is Information Extraction, where an effective model has to cope with data sparsity, as well as with lexical ambiguity that can arise at the level of both arguments and relational phrases. Even in more recent Information Extraction approaches where semantics is implicitly modeled, these issues have not yet been addressed in their entirety. On the other hand, however, having access to explicit sense-level information is a very demanding task on its own, which can rarely be performed with high accuracy on a large scale. With this in mind, in ths thesis we will tackle a two-fold objective: our first focus will be on studying fully automatic approaches to obtain high-quality sense-level information from textual corpora; then, we will investigate in depth where and how such sense-level information has the potential to enhance the extraction of knowledge from open text. In the first part of this work, we will explore three different disambiguation scenar- ios (semi-structured text, parallel text, and definitional text) and devise automatic disambiguation strategies that are not only capable of scaling to different corpus sizes and different languages, but that actually take advantage of a multilingual and/or heterogeneous setting to improve and refine their performance. As a result, we will obtain three sense-annotated resources that, when tested experimentally with a baseline system in a series of downstream semantic tasks (i.e. Word Sense Disam- biguation, Entity Linking, Semantic Similarity), show very competitive performances on standard benchmarks against both manual and semi-automatic competitors. In the second part we will instead focus on Information Extraction, with an emphasis on Open Information Extraction (OIE), where issues like sparsity and lexical ambiguity are especially critical, and study how to exploit at best sense-level information within the extraction process. We will start by showing that enforcing a deeper semantic analysis in a definitional setting enables a full-fledged extraction pipeline to compete with state-of-the-art approaches based on much larger (but noisier) data. We will then demonstrate how working at the sense level at the end of an extraction pipeline is also beneficial: indeed, by leveraging sense-based techniques, very heterogeneous OIE-derived data can be aligned semantically, and unified with respect to a common sense inventory. Finally, we will briefly shift the focus to the more constrained setting of hypernym discovery, and study a sense-aware supervised framework for the task that is robust and effective, even when trained on heterogeneous OIE-derived hypernymic knowledge

    Understanding and generating language with abstract meaning representation

    Get PDF
    Abstract Meaning Representation (AMR) is a semantic representation for natural language that encompasses annotations related to traditional tasks such as Named Entity Recognition (NER), Semantic Role Labeling (SRL), word sense disambiguation (WSD), and Coreference Resolution. AMR represents sentences as graphs, where nodes represent concepts and edges represent semantic relations between them. Sentences are represented as graphs and not trees because nodes can have multiple incoming edges, called reentrancies. This thesis investigates the impact of reentrancies for parsing (from text to AMR) and generation (from AMR to text). For the parsing task, we showed that it is possible to use techniques from tree parsing and adapt them to deal with reentrancies. To better analyze the quality of AMR parsers, we developed a set of fine-grained metrics and found that state-of-the-art parsers predict reentrancies poorly. Hence we provided a classification of linguistic phenomena causing reentrancies, categorized the type of errors parsers do with respect to reentrancies, and proved that correcting these errors can lead to significant improvements. For the generation task, we showed that neural encoders that have access to reentrancies outperform those who do not, demonstrating the importance of reentrancies also for generation. This thesis also discusses the problem of using AMR for languages other than English. Annotating new AMR datasets for other languages is an expensive process and requires defining annotation guidelines for each new language. It is therefore reasonable to ask whether we can share AMR annotations across languages. We provided evidence that AMR datasets for English can be successfully transferred to other languages: we trained parsers for Italian, Spanish, German, and Chinese to investigate the cross-linguality of AMR. We showed cases where translational divergences between languages pose a problem and cases where they do not. In summary, this thesis demonstrates the impact of reentrancies in AMR as well as providing insights on AMR for languages that do not yet have AMR datasets
    corecore