49 research outputs found

    Extracting Temporal Expressions from Unstructured Open Resources

    Get PDF
    AETAS is an end-to-end system with SOA approach that retrieves plain text data from web and blog news and represents and stores them in RDF, with a special focus on their temporal dimension. The system allows users to acquire, browse and query Linked Data obtained from unstructured sources

    Dataset and Baseline System for Multi-lingual Extraction and Normalization of Temporal and Numerical Expressions

    Full text link
    Temporal and numerical expression understanding is of great importance in many downstream Natural Language Processing (NLP) and Information Retrieval (IR) tasks. However, much previous work covers only a few sub-types and focuses only on entity extraction, which severely limits the usability of identified mentions. In order for such entities to be useful in downstream scenarios, coverage and granularity of sub-types are important; and, even more so, providing resolution into concrete values that can be manipulated. Furthermore, most previous work addresses only a handful of languages. Here we describe a multi-lingual evaluation dataset - NTX - covering diverse temporal and numerical expressions across 14 languages and covering extraction, normalization, and resolution. Along with the dataset we provide a robust rule-based system as a strong baseline for comparisons against other models to be evaluated in this dataset. Data and code are available at \url{https://aka.ms/NTX}.Comment: Technical Repor

    Domain-sensitive Temporal Tagging for Event-centric Information Retrieval

    Get PDF
    Temporal and geographic information is of major importance in virtually all contexts. Thus, it also occurs frequently in many types of text documents in the form of temporal and geographic expressions. Often, those are used to refer to something that was, is, or will be happening at some specific time and some specific place – in other words, temporal and geographic expressions are often used to refer to events. However, so far, event-related information needs are not well served by standard information retrieval approaches, which motivates the topic of this thesis: event-centric information retrieval. An important characteristic of temporal and geographic expressions – and thus of two components of events – is that they can be normalized so that their meaning is unambiguous and can be placed on a timeline or pinpointed on a map. In many research areas in which natural language processing is involved, e.g., in information retrieval, document summarization, and question answering, applications can highly benefit from having access to normalized information instead of only the words as they occur in documents. In this thesis, we present several frameworks for searching and exploring document collections with respect to occurring temporal, geographic, and event information. While we rely on an existing tool for extracting and normalizing geographic expressions, we study the task of temporal tagging, i.e., the extraction and normalization of temporal expressions. A crucial issue is that so far most research on temporal tagging dealt with English news-style documents. However, temporal expressions have to be handled in different ways depending on the domain of the documents from which they are extracted. Since we do not want to limit our research to one domain and one language, we develop the multilingual, cross-domain temporal tagger HeidelTime. It is the only publicly available temporal tagger for several languages and easy to extend to further languages. In addition, it achieves state-of-the-art evaluation results for all addressed domains and languages, and lays the foundations for all further contributions developed in this thesis. To achieve our goal of exploiting temporal and geographic expressions for event-centric information retrieval from a variety of text documents, we introduce the concept of spatio-temporal events and several concepts to "compute" with temporal, geographic, and event information. These concepts are used to develop a spatio-temporal ranking approach, which does not only consider textual, temporal, and geographic query parts but also two different types of proximity information. Furthermore, we adapt the spatio-temporal search idea by presenting a framework to directly search for events. Additionally, several map-based exploration frameworks are introduced that allow a new way of exploring event information latently contained in huge document collections. Finally, an event-centric document similarity model is developed that calculates document similarity on multilingual corpora solely based on extracted and normalized event information

    tieval: An Evaluation Framework for Temporal Information Extraction Systems

    Full text link
    Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades, leading to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, thus hindering the comparison between competitors across different corpora. On the other hand, the fact that each corpus is commonly disseminated in a different format requires a considerable engineering effort for a researcher/practitioner to develop parsers for all of them. This constraint forces researchers to select a limited amount of datasets to evaluate their systems which consequently limits the comparability of the systems. Yet another obstacle that hinders the comparability of the TIE systems is the evaluation metric employed. While most research works adopt traditional metrics such as precision, recall, and F1F_1, a few others prefer temporal awareness -- a metric tailored to be more comprehensive on the evaluation of temporal systems. Although the reason for the absence of temporal awareness in the evaluation of most systems is not clear, one of the factors that certainly weights this decision is the necessity to implement the temporal closure algorithm in order to compute temporal awareness, which is not straightforward to implement neither is currently easily available. All in all, these problems have limited the fair comparison between approaches and consequently, the development of temporal extraction systems. To mitigate these problems, we have developed tieval, a Python library that provides a concise interface for importing different corpora and facilitates system evaluation. In this paper, we present the first public release of tieval and highlight its most relevant features.Comment: 10 page

    Extracting Temporal and Causal Relations between Events

    Full text link
    Structured information resulting from temporal information processing is crucial for a variety of natural language processing tasks, for instance to generate timeline summarization of events from news documents, or to answer temporal/causal-related questions about some events. In this thesis we present a framework for an integrated temporal and causal relation extraction system. We first develop a robust extraction component for each type of relations, i.e. temporal order and causality. We then combine the two extraction components into an integrated relation extraction system, CATENA---CAusal and Temporal relation Extraction from NAtural language texts---, by utilizing the presumption about event precedence in causality, that causing events must happened BEFORE resulting events. Several resources and techniques to improve our relation extraction systems are also discussed, including word embeddings and training data expansion. Finally, we report our adaptation efforts of temporal information processing for languages other than English, namely Italian and Indonesian.Comment: PhD Thesi

    Tint, the Swiss-Army Tool for Natural Language Processing in Italian

    Get PDF
    In this we paper present the last version of Tint, an opensource, fast and extendable Natural Language Processing suite for Italian based on Stanford CoreNLP. The new release includes a set of text processing components for fine-grained linguistic analysis, from tokenization to relation extraction, including part-of-speech tagging, morphological analysis, lemmatization, multi-word expression recognition, dependency parsing, named-entity recognition, keyword extraction, and much more. Tint is written in Java freely distributed under the GPL license. Although some modules do not perform at a state-of-the-art level, Tint reaches very good accuracy in all modules, and can be easily used out-of-the-box
    corecore