53 research outputs found

    Extracting Temporal Expressions from Unstructured Open Resources

    Get PDF
    AETAS is an end-to-end system with SOA approach that retrieves plain text data from web and blog news and represents and stores them in RDF, with a special focus on their temporal dimension. The system allows users to acquire, browse and query Linked Data obtained from unstructured sources

    Domain-sensitive Temporal Tagging for Event-centric Information Retrieval

    Get PDF
    Temporal and geographic information is of major importance in virtually all contexts. Thus, it also occurs frequently in many types of text documents in the form of temporal and geographic expressions. Often, those are used to refer to something that was, is, or will be happening at some specific time and some specific place – in other words, temporal and geographic expressions are often used to refer to events. However, so far, event-related information needs are not well served by standard information retrieval approaches, which motivates the topic of this thesis: event-centric information retrieval. An important characteristic of temporal and geographic expressions – and thus of two components of events – is that they can be normalized so that their meaning is unambiguous and can be placed on a timeline or pinpointed on a map. In many research areas in which natural language processing is involved, e.g., in information retrieval, document summarization, and question answering, applications can highly benefit from having access to normalized information instead of only the words as they occur in documents. In this thesis, we present several frameworks for searching and exploring document collections with respect to occurring temporal, geographic, and event information. While we rely on an existing tool for extracting and normalizing geographic expressions, we study the task of temporal tagging, i.e., the extraction and normalization of temporal expressions. A crucial issue is that so far most research on temporal tagging dealt with English news-style documents. However, temporal expressions have to be handled in different ways depending on the domain of the documents from which they are extracted. Since we do not want to limit our research to one domain and one language, we develop the multilingual, cross-domain temporal tagger HeidelTime. It is the only publicly available temporal tagger for several languages and easy to extend to further languages. In addition, it achieves state-of-the-art evaluation results for all addressed domains and languages, and lays the foundations for all further contributions developed in this thesis. To achieve our goal of exploiting temporal and geographic expressions for event-centric information retrieval from a variety of text documents, we introduce the concept of spatio-temporal events and several concepts to "compute" with temporal, geographic, and event information. These concepts are used to develop a spatio-temporal ranking approach, which does not only consider textual, temporal, and geographic query parts but also two different types of proximity information. Furthermore, we adapt the spatio-temporal search idea by presenting a framework to directly search for events. Additionally, several map-based exploration frameworks are introduced that allow a new way of exploring event information latently contained in huge document collections. Finally, an event-centric document similarity model is developed that calculates document similarity on multilingual corpora solely based on extracted and normalized event information

    Robust input representations for low-resource information extraction

    Get PDF
    Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die jüngsten Fortschritte auf dem Gebiet der Verarbeitung natürlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies führte zu einer Vielzahl neuer Forschungsfragen bezüglich der Stabilität solcher großen Systeme und ihrer Anwendbarkeit über gut untersuchte Aufgaben und Datensätze hinaus, wie z. B. die Informationsextraktion für Nicht-Standardsprachen, aber auch Textdomänen und Aufgaben, für die selbst im Englischen nur wenige Trainingsdaten zur Verfügung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige Beiträge in Bereichen wie Repräsentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende Beschränkungen zu überwinden, darunter fehlende Trainingsressourcen, ungesehene Domänen und Sprachbarrieren. Insbesondere schlagen wir Lösungen vor, um die Domänenlücke zwischen Repräsentationsmodellen zu schließen, z.B. durch domänenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen Repräsentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die Leistungsfähigkeit unserer Methoden für verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und Domänen

    Dataset and Baseline System for Multi-lingual Extraction and Normalization of Temporal and Numerical Expressions

    Full text link
    Temporal and numerical expression understanding is of great importance in many downstream Natural Language Processing (NLP) and Information Retrieval (IR) tasks. However, much previous work covers only a few sub-types and focuses only on entity extraction, which severely limits the usability of identified mentions. In order for such entities to be useful in downstream scenarios, coverage and granularity of sub-types are important; and, even more so, providing resolution into concrete values that can be manipulated. Furthermore, most previous work addresses only a handful of languages. Here we describe a multi-lingual evaluation dataset - NTX - covering diverse temporal and numerical expressions across 14 languages and covering extraction, normalization, and resolution. Along with the dataset we provide a robust rule-based system as a strong baseline for comparisons against other models to be evaluated in this dataset. Data and code are available at \url{https://aka.ms/NTX}.Comment: Technical Repor

    Extracting Temporal and Causal Relations between Events

    Full text link
    Structured information resulting from temporal information processing is crucial for a variety of natural language processing tasks, for instance to generate timeline summarization of events from news documents, or to answer temporal/causal-related questions about some events. In this thesis we present a framework for an integrated temporal and causal relation extraction system. We first develop a robust extraction component for each type of relations, i.e. temporal order and causality. We then combine the two extraction components into an integrated relation extraction system, CATENA---CAusal and Temporal relation Extraction from NAtural language texts---, by utilizing the presumption about event precedence in causality, that causing events must happened BEFORE resulting events. Several resources and techniques to improve our relation extraction systems are also discussed, including word embeddings and training data expansion. Finally, we report our adaptation efforts of temporal information processing for languages other than English, namely Italian and Indonesian.Comment: PhD Thesi

    Populating knowledge bases with temporal information

    Get PDF
    Recent progress in information extraction has enabled the automatic construction of large knowledge bases. Knowledge bases contain millions of entities (e.g. persons, organizations, events, etc.), their semantic classes, and facts about them. Knowledge bases have become a great asset for semantic search, entity linking, deep analytics, and question answering. However, a common limitation of current knowledge bases is the poor coverage of temporal knowledge. First of all, so far, knowledge bases have focused on popular events and ignored long tail events such as political scandals, local festivals, or protests. Secondly, they do not cover the textual phrases denoting events and temporal facts at all. The goal of this dissertation, thus, is to automatically populate knowledge bases with this kind of temporal knowledge. The dissertation makes the following contributions to address the afore mentioned limitations. The first contribution is a method for extracting events from news articles. The method reconciles the extracted events into canonicalized representations and organizes them into fine-grained semantic classes. The second contribution is a method for mining the textual phrases denoting the events and facts. The method infers the temporal scopes of these phrases and maps them to a knowledge base. Our experimental evaluations demonstrate that our methods yield high quality output compared to state-of- the-art approaches, and can indeed populate knowledge bases with temporal knowledge.Der Fortschritt in der Informationsextraktion ermöglicht heute das automatischen Erstellen von Wissensbasen. Derartige Wissensbasen enthalten Entitäten wie Personen, Organisationen oder Events sowie Informationen über diese und deren semantische Klasse. Automatisch generierte Wissensbasen bilden eine wesentliche Grundlage für das semantische Suchen, das Verknüpfen von Entitäten, die Textanalyse und für natürlichsprachliche Frage-Antwortsysteme. Eine Schwäche aktueller Wissensbasen ist jedoch die unzureichende Erfassung von temporalen Informationen. Wissenbasen fokussieren in erster Linie auf populäre Events und ignorieren weniger bekannnte Events wie z.B. politische Skandale, lokale Veranstaltungen oder Demonstrationen. Zudem werden Textphrasen zur Bezeichung von Events und temporalen Fakten nicht erfasst. Ziel der vorliegenden Arbeit ist es, Methoden zu entwickeln, die temporales Wissen au- tomatisch in Wissensbasen integrieren. Dazu leistet die Dissertation folgende Beiträge: 1. Die Entwicklung einer Methode zur Extrahierung von Events aus Nachrichtenartikeln sowie deren Darstellung in einer kanonischen Form und ihrer Einordnung in detaillierte semantische Klassen. 2. Die Entwicklung einer Methode zur Gewinnung von Textphrasen, die Events und Fakten in Wissensbasen bezeichnen sowie einer Methode zur Ableitung ihres zeitlichen Verlaufs und ihrer Dauer. Unsere Experimente belegen, dass die von uns entwickelten Methoden zu qualitativ deutlich besseren Ausgabewerten führen als bisherige Verfahren und Wissensbasen tatsächlich um temporales Wissen erweitern können
    • …
    corecore