254 research outputs found

    Automatic extraction of facts, relations, and entities for web-scale knowledge base population

    Get PDF
    Equipping machines with knowledge, through the construction of machinereadable knowledge bases, presents a key asset for semantic search, machine translation, question answering, and other formidable challenges in artificial intelligence. However, human knowledge predominantly resides in books and other natural language text forms. This means that knowledge bases must be extracted and synthesized from natural language text. When the source of text is the Web, extraction methods must cope with ambiguity, noise, scale, and updates. The goal of this dissertation is to develop knowledge base population methods that address the afore mentioned characteristics of Web text. The dissertation makes three contributions. The first contribution is a method for mining high-quality facts at scale, through distributed constraint reasoning and a pattern representation model that is robust against noisy patterns. The second contribution is a method for mining a large comprehensive collection of relation types beyond those commonly found in existing knowledge bases. The third contribution is a method for extracting facts from dynamic Web sources such as news articles and social media where one of the key challenges is the constant emergence of new entities. All methods have been evaluated through experiments involving Web-scale text collections.Maschinenlesbare Wissensbasen sind ein zentraler Baustein für semantische Suche, maschinelles Übersetzen, automatisches Beantworten von Fragen und andere komplexe Fragestellungen der Künstlichen Intelligenz. Allerdings findet man menschliches Wissen bis dato überwiegend in Büchern und anderen natürlichsprachigen Texten. Das hat zur Folge, dass Wissensbasen durch automatische Extraktion aus Texten erstellt werden müssen. Bei Texten aus dem Web müssen Extraktionsmethoden mit einem hohen Maß an Mehrdeutigkeit und Rauschen sowie mit sehr großen Datenvolumina und häufiger Aktualisierung zurechtkommen. Das Ziel dieser Dissertation ist, Methoden zu entwickeln, die die automatische Erstellung von Wissensbasen unter den zuvor genannten Unwägbarkeiten von Texten aus dem Web ermöglichen. Die Dissertation leistet dazu drei Beiträge. Der erste Beitrag ist ein skalierbar verteiltes Verfahren, das die effiziente Extraktion hochwertiger Fakten unterstützt, indem logische Inferenzen mit robuster Textmustererkennung kombiniert werden. Der zweite Beitrag der Arbeit ist eine Methodik zur automatischen Konstruktion einer umfassenden Sammlung typisierter Relationen, die weit über die in existierenden Wissensbasen bekannten Relationen hinausgeht. Der dritte Beitrag ist ein neuartiges Verfahren zur Extraktion von Fakten aus dynamischen Webinhalten wie Nachrichtenartikeln und sozialen Medien. Insbesondere werden Lösungen vorgestellt zur Erkennung und Registrierung neuer Entitäten, die bislang in keiner Wissenbasis enthalten sind. Alle Verfahren wurden durch umfassende Experimente auf großen Text und Webkorpora evaluiert

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 – to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 – to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects

    Macro-micro approach for mining public sociopolitical opinion from social media

    Get PDF
    During the past decade, we have witnessed the emergence of social media, which has prominence as a means for the general public to exchange opinions towards a broad range of topics. Furthermore, its social and temporal dimensions make it a rich resource for policy makers and organisations to understand public opinion. In this thesis, we present our research in understanding public opinion on Twitter along three dimensions: sentiment, topics and summary. In the first line of our work, we study how to classify public sentiment on Twitter. We focus on the task of multi-target-specific sentiment recognition on Twitter, and propose an approach which utilises the syntactic information from parse-tree in conjunction with the left-right context of the target. We show the state-of-the-art performance on two datasets including a multi-target Twitter corpus on UK elections which we make public available for the research community. Additionally we also conduct two preliminary studies including cross-domain emotion classification on discourse around arts and cultural experiences, and social spam detection to improve the signal-to-noise ratio of our sentiment corpus. Our second line of work focuses on automatic topical clustering of tweets. Our aim is to group tweets into a number of clusters, with each cluster representing a meaningful topic, story, event or a reason behind a particular choice of sentiment. We explore various ways of tackling this challenge and propose a two-stage hierarchical topic modelling system that is efficient and effective in achieving our goal. Lastly, for our third line of work, we study the task of summarising tweets on common topics, with the goal to provide informative summaries for real-world events/stories or explanation underlying the sentiment expressed towards an issue/entity. As most existing tweet summarisation approaches rely on extractive methods, we propose to apply state-of-the-art neural abstractive summarisation model for tweets. We also tackle the challenge of cross-medium supervised summarisation with no target-medium training resources. To the best of our knowledge, there is no existing work on studying neural abstractive summarisation on tweets. In addition, we present a system for providing interactive visualisation of topic-entity sentiments and the corresponding summaries in chronological order. Throughout our work presented in this thesis, we conduct experiments to evaluate and verify the effectiveness of our proposed models, comparing to relevant baseline methods. Most of our evaluations are quantitative, however, we do perform qualitative analyses where it is appropriate. This thesis provides insights and findings that can be used for better understanding public opinion in social media

    Unsupervised German predicate entailment using the distributional inclusion hypothesis

    Get PDF
    Recognizing textual entailment is an important prerequisite to many tasks in NLP, e.g. question answering and semantic parsing. Knowing that for example buying a thing entails subsequently owning it is a relation that humans learn by interacting with the world, while machines need other ways to acquire this knowledge. Previous approaches at learning predicate entailment relations from text have focused only on English. In this thesis we present the adaptation of the unsupervised entailment graph building algorithm of Hosseini et al. to German, which can be seen as a study of challenges in language adaptation for this task in general. We create a variety of German tools necessary for this approach and give a detailed account of the challenges faced and the insights gained from them. First, we create a German relation extraction system and compare it against the English system presented by Hosseini et al. Finding that the typing of German entities constitutes a bottleneck, we create German fine-grained typing system for named and general entities. In doing so we examine the methods of annotation projection and zero-shot cross-lingual transfer, finding that for German fine-grained named entity typing zero-shot cross-lingual transfer performs best. We then move on to creating a German system that types general entities (e.g. ``ex-president'') as well as named entities (e.g. ``Obama''), by augmenting our training data with data automatically generated from a German WordNet. We find that this way up to 10 percent points improvement in general entity typing performance can be reached, while only slightly impacting named entity typing performance by 1 percent point. We use these components in the pipeline to construct German entailment graphs. We also present a method that uses German and English entailment graphs to generate training data for a supervised predicate entailment detection system, and show that this method outperforms current approaches at this task. This way we create a multilingual predicate entailment detection system, that outperforms both the monolingual German system and the zero-shot cross-lingual system on German test data, and also performs better than a monolingual English system on English test data

    Populating knowledge bases with temporal information

    Get PDF
    Recent progress in information extraction has enabled the automatic construction of large knowledge bases. Knowledge bases contain millions of entities (e.g. persons, organizations, events, etc.), their semantic classes, and facts about them. Knowledge bases have become a great asset for semantic search, entity linking, deep analytics, and question answering. However, a common limitation of current knowledge bases is the poor coverage of temporal knowledge. First of all, so far, knowledge bases have focused on popular events and ignored long tail events such as political scandals, local festivals, or protests. Secondly, they do not cover the textual phrases denoting events and temporal facts at all. The goal of this dissertation, thus, is to automatically populate knowledge bases with this kind of temporal knowledge. The dissertation makes the following contributions to address the afore mentioned limitations. The first contribution is a method for extracting events from news articles. The method reconciles the extracted events into canonicalized representations and organizes them into fine-grained semantic classes. The second contribution is a method for mining the textual phrases denoting the events and facts. The method infers the temporal scopes of these phrases and maps them to a knowledge base. Our experimental evaluations demonstrate that our methods yield high quality output compared to state-of- the-art approaches, and can indeed populate knowledge bases with temporal knowledge.Der Fortschritt in der Informationsextraktion ermöglicht heute das automatischen Erstellen von Wissensbasen. Derartige Wissensbasen enthalten Entitäten wie Personen, Organisationen oder Events sowie Informationen über diese und deren semantische Klasse. Automatisch generierte Wissensbasen bilden eine wesentliche Grundlage für das semantische Suchen, das Verknüpfen von Entitäten, die Textanalyse und für natürlichsprachliche Frage-Antwortsysteme. Eine Schwäche aktueller Wissensbasen ist jedoch die unzureichende Erfassung von temporalen Informationen. Wissenbasen fokussieren in erster Linie auf populäre Events und ignorieren weniger bekannnte Events wie z.B. politische Skandale, lokale Veranstaltungen oder Demonstrationen. Zudem werden Textphrasen zur Bezeichung von Events und temporalen Fakten nicht erfasst. Ziel der vorliegenden Arbeit ist es, Methoden zu entwickeln, die temporales Wissen au- tomatisch in Wissensbasen integrieren. Dazu leistet die Dissertation folgende Beiträge: 1. Die Entwicklung einer Methode zur Extrahierung von Events aus Nachrichtenartikeln sowie deren Darstellung in einer kanonischen Form und ihrer Einordnung in detaillierte semantische Klassen. 2. Die Entwicklung einer Methode zur Gewinnung von Textphrasen, die Events und Fakten in Wissensbasen bezeichnen sowie einer Methode zur Ableitung ihres zeitlichen Verlaufs und ihrer Dauer. Unsere Experimente belegen, dass die von uns entwickelten Methoden zu qualitativ deutlich besseren Ausgabewerten führen als bisherige Verfahren und Wissensbasen tatsächlich um temporales Wissen erweitern können

    Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

    Get PDF
    This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl

    Temporality and modality in entailment graph induction

    Get PDF
    The ability to draw inferences is core to semantics and the field of Natural Language Processing. Answering a seemingly simple question like ‘Did Arsenal play Manchester yesterday’ from textual evidence that says ‘Arsenal won against Manchester yesterday’ requires modeling the inference that ‘winning’ entails ‘playing’. One way of modeling this type of lexical semantics is with Entailment Graphs, collections of meaning postulates that can be learned in an unsupervised way from large text corpora. In this work, we explore the role that temporality and linguistic modality can play in inducing Entailment Graphs. We identify inferences that were previously not supported by Entailment Graphs (such as that ‘visiting’ entails an ‘arrival’ before the visit) and inferences that were likely to be learned incorrectly (such as that ‘winning’ entails ‘losing’). Temporality is shown to be useful in alleviating these challenges, in the Entailment Graph representation as well as the learning algorithm. An exploration of linguistic modality in the training data shows, counterintuitively, that there is valuable signal in modalized predications. We develop three datasets for evaluating a system’s capability of modeling these inferences, which were previously underrepresented in entailment rule evaluations. Finally, in support of the work on modality, we release a relation extraction system that is capable of annotating linguistic modality, together with a comprehensive modality lexicon

    Doctor of Philosophy

    Get PDF
    dissertationEvents are one important type of information throughout text. Event extraction is an information extraction (IE) task that involves identifying entities and objects (mainly noun phrases) that represent important roles in events of a particular type. However, the extraction performance of current event extraction systems is limited because they mainly consider local context (mostly isolated sentences) when making each extraction decision. My research aims to improve both coverage and accuracy of event extraction performance by explicitly identifying event contexts before extracting individual facts. First, I introduce new event extraction architectures that incorporate discourse information across a document to seek out and validate pieces of event descriptions within the document. TIER is a multilayered event extraction architecture that performs text analysis at multiple granularities to progressively \zoom in" on relevant event information. LINKER is a unied discourse-guided approach that includes a structured sentence classier to sequentially read a story and determine which sentences contain event information based on both the local and preceding contexts. Experimental results on two distinct event domains show that compared to previous event extraction systems, TIER can nd more event information while maintaining a good extraction accuracy, and LINKER can further improve extraction accuracy. Finding documents that describe a specic type of event is also highly challenging because of the wide variety and ambiguity of event expressions. In this dissertation, I present the multifaceted event recognition approach that uses event dening characteristics (facets), in addition to event expressions, to eectively resolve the complexity of event descriptions. I also present a novel bootstrapping algorithm to automatically learn event expressions as well as facets of events, which requires minimal human supervision. Experimental results show that the multifaceted event recognition approach can eectively identify documents that describe a particular type of event and make event extraction systems more precise
    • …
    corecore