5 research outputs found

    Linking named entities to Wikipedia

    Get PDF
    Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems


    Get PDF

    Enhancing knowledge acquisition systems with user generated and crowdsourced resources

    Get PDF
    This thesis is on leveraging knowledge acquisition systems with collaborative data and crowdsourcing work from internet. We propose two strategies and apply them for building effective entity linking and question answering (QA) systems. The first strategy is on integrating an information extraction system with online collaborative knowledge bases, such as Wikipedia and Freebase. We construct a Cross-Lingual Entity Linking (CLEL) system to connect Chinese entities, such as people and locations, with corresponding English pages in Wikipedia. The main focus is to break the language barrier between Chinese entities and the English KB, and to resolve the synonymy and polysemy of Chinese entities. To address those problems, we create a cross-lingual taxonomy and a Chinese knowledge base (KB). We investigate two methods of connecting the query representation with the KB representation. Based on our CLEL system participating in TAC KBP 2011 evaluation, we finally propose a simple and effective generative model, which achieved much better performance. The second strategy is on creating annotation for QA systems with the help of crowd- sourcing. Crowdsourcing is to distribute a task via internet and recruit a lot of people to complete it simultaneously. Various annotated data are required to train the data-driven statistical machine learning algorithms for underlying components in our QA system. This thesis demonstrates how to convert the annotation task into crowdsourcing micro-tasks, investigate different statistical methods for enhancing the quality of crowdsourced anno- tation, and finally use enhanced annotation to train learning to rank models for passage ranking algorithms for QA.Gegenstand dieser Arbeit ist das Nutzbarmachen sowohl von Systemen zur Wissener- fassung als auch von kollaborativ erstellten Daten und Arbeit aus dem Internet. Es werden zwei Strategien vorgeschlagen, welche für die Erstellung effektiver Entity Linking (Disambiguierung von Entitätennamen) und Frage-Antwort Systeme eingesetzt werden. Die erste Strategie ist, ein Informationsextraktions-System mit kollaborativ erstellten Online- Datenbanken zu integrieren. Wir entwickeln ein Cross-Linguales Entity Linking-System (CLEL), um chinesische Entitäten, wie etwa Personen und Orte, mit den entsprechenden Wikipediaseiten zu verknüpfen. Das Hauptaugenmerk ist es, die Sprachbarriere zwischen chinesischen Entitäten und englischer Datenbank zu durchbrechen, und Synonymie und Polysemie der chinesis- chen Entitäten aufzulösen. Um diese Probleme anzugehen, erstellen wir eine cross linguale Taxonomie und eine chinesische Datenbank. Wir untersuchen zwei Methoden, die Repräsentation der Anfrage und die Repräsentation der Datenbank zu verbinden. Schließlich stellen wir ein einfaches und effektives generatives Modell vor, das auf unserem System für die Teilnahme an der TAC KBP 2011 Evaluation basiert und eine erheblich bessere Performanz erreichte. Die zweite Strategie ist, Annotationen für Frage-Antwort-Systeme mit Hilfe von "Crowd- sourcing" zu erstellen. "Crowdsourcing" bedeutet, eine Aufgabe via Internet an eine große Menge an angeworbene Menschen zu verteilen, die diese simultan erledigen. Verschiedene annotierte Daten sind notwendig, um die datengetriebenen statistischen Lernalgorithmen zu trainieren, die unserem Frage-Antwort System zugrunde liegen. Wir zeigen, wie die Annotationsaufgabe in Mikro-Aufgaben für das Crowdsourcing umgewan- delt werden kann, wir untersuchen verschiedene statistische Methoden, um die Qualität der Annotation aus dem Crowdsourcing zu erweitern, und schließlich nutzen wir die erwei- erte Annotation, um Modelle zum Lernen von Ranglisten von Textabschnitten zu trainieren

    Protocoles d'évaluation pour l'extraction d'information libre

    Get PDF
    On voudrait apprendre à "lire automatiquement". L'extraction d'information consiste à transformer des paragraphes de texte écrits en langue naturelle en une liste d'éléments d'information autosuffisants, de façon à pouvoir comparer et colliger l'information extraite de plusieurs sources. Les éléments d'information sont ici représentés comme des relations entre entités : (Athéna ; est la fille de ; Zeus). L'extraction d'information libre (EIL) est un paradigme récent, visant à extraire un grand nombre de relations contenues dans le texte analysé, découvertes au fur et à mesure, par opposition à un nombre restreint de relations prédéterminées comme il est plus courant. Cette thèse porte sur l'évaluation des méthodes d'EIL. Dans les deux premiers chapitres, on évalue automatiquement les extractions d'un système d'EIL, en les comparant à des références écrites à la main, mettant respectivement l'accent sur l'informativité de l'extraction, puis sur son exhaustivité. Dans les deux chapitres suivants, on étudie et propose des alternatives à la fonction de confiance, qui juge des productions d'un système. En particulier, on y analyse et remet en question les méthodologies suivant lesquelles cette fonction est évaluée : d'abord comme modèle de validation de requêtes, puis en comparaison du cadre bien établi de la complétion de bases de connaissances.Information extraction consists in the processing of natural language documents into a list of self-sufficient informational elements, which allows for cross collection into Knowledge Bases, and automatic processing. The facts that result from this process are in the form of relationships between entities : (Athena ; is the daughter of ; Zeus). Open Information Extraction (OIE) is a recent paradigm the purpose of which is to extract an order of magnitude more relations from the input corpus than classical IE methods, what is achieved by encoding or learning more general patterns, in a less supervised fashion. In this thesis, I study and propose new evaluation protocols for the task of Open Information Extraction, with links to that of Knowledge Base Completion. In the first two chapters, I propose to automatically score the output of an OIE system, against a manually established reference, with particular attention paid to the informativity and exhaustivity of the extractions. I then turn my focus to the confidence function that qualifies all extracted elements, to evaluate it in a variety of settings, and propose alternative models