49 research outputs found
Extracting Temporal Expressions from Unstructured Open Resources
AETAS is an end-to-end system with SOA approach that retrieves plain text data from web and blog news and represents and stores them in RDF, with a special focus on their temporal dimension. The system allows users to acquire, browse and query Linked Data obtained from unstructured sources
Dataset and Baseline System for Multi-lingual Extraction and Normalization of Temporal and Numerical Expressions
Temporal and numerical expression understanding is of great importance in
many downstream Natural Language Processing (NLP) and Information Retrieval
(IR) tasks. However, much previous work covers only a few sub-types and focuses
only on entity extraction, which severely limits the usability of identified
mentions. In order for such entities to be useful in downstream scenarios,
coverage and granularity of sub-types are important; and, even more so,
providing resolution into concrete values that can be manipulated. Furthermore,
most previous work addresses only a handful of languages. Here we describe a
multi-lingual evaluation dataset - NTX - covering diverse temporal and
numerical expressions across 14 languages and covering extraction,
normalization, and resolution. Along with the dataset we provide a robust
rule-based system as a strong baseline for comparisons against other models to
be evaluated in this dataset. Data and code are available at
\url{https://aka.ms/NTX}.Comment: Technical Repor
Domain-sensitive Temporal Tagging for Event-centric Information Retrieval
Temporal and geographic information is of major importance in virtually all contexts. Thus, it also occurs frequently in many types of text documents in the form of temporal and geographic expressions. Often, those are used to refer to something that was, is, or will be happening at some specific time and some specific place – in other words, temporal and geographic expressions are often used to refer to events. However, so far, event-related information needs are not well served by standard information retrieval approaches, which motivates the topic of this thesis: event-centric information retrieval.
An important characteristic of temporal and geographic expressions – and thus of two components of events – is that they can be normalized so that their meaning is unambiguous and can be placed on a timeline or pinpointed on a map. In many research areas in which natural language processing is involved, e.g., in information retrieval, document summarization, and question answering, applications can highly benefit from having access to normalized information instead of only the words as they occur in documents.
In this thesis, we present several frameworks for searching and exploring document collections with respect to occurring temporal, geographic, and event information. While we rely on an existing tool for extracting and normalizing geographic expressions, we study the task of temporal tagging, i.e., the extraction and normalization of temporal expressions. A crucial issue is that so far most research on temporal tagging dealt with English news-style documents. However, temporal expressions have to be handled in different ways depending on the domain of the documents from which they are extracted. Since we do not want to limit our research to one domain and one language, we develop the multilingual, cross-domain temporal tagger HeidelTime. It is the only publicly available temporal tagger for several languages and easy to extend to further languages. In addition, it achieves state-of-the-art evaluation results for all addressed domains and languages, and lays the foundations for all further contributions developed in this thesis.
To achieve our goal of exploiting temporal and geographic expressions for event-centric information retrieval from a variety of text documents, we introduce the concept of spatio-temporal events and several concepts to "compute" with temporal, geographic, and event information. These concepts are used to develop a spatio-temporal ranking approach, which does not only consider textual, temporal, and geographic query parts but also two different types of proximity information. Furthermore, we adapt the spatio-temporal search idea by presenting a framework to directly search for events. Additionally, several map-based exploration frameworks are introduced that allow a new way of exploring event information latently contained in huge document collections. Finally, an event-centric document similarity model is developed that calculates document similarity on multilingual corpora solely based on extracted and normalized event information
tieval: An Evaluation Framework for Temporal Information Extraction Systems
Temporal information extraction (TIE) has attracted a great deal of interest
over the last two decades, leading to the development of a significant number
of datasets. Despite its benefits, having access to a large volume of corpora
makes it difficult when it comes to benchmark TIE systems. On the one hand,
different datasets have different annotation schemes, thus hindering the
comparison between competitors across different corpora. On the other hand, the
fact that each corpus is commonly disseminated in a different format requires a
considerable engineering effort for a researcher/practitioner to develop
parsers for all of them. This constraint forces researchers to select a limited
amount of datasets to evaluate their systems which consequently limits the
comparability of the systems. Yet another obstacle that hinders the
comparability of the TIE systems is the evaluation metric employed. While most
research works adopt traditional metrics such as precision, recall, and ,
a few others prefer temporal awareness -- a metric tailored to be more
comprehensive on the evaluation of temporal systems. Although the reason for
the absence of temporal awareness in the evaluation of most systems is not
clear, one of the factors that certainly weights this decision is the necessity
to implement the temporal closure algorithm in order to compute temporal
awareness, which is not straightforward to implement neither is currently
easily available. All in all, these problems have limited the fair comparison
between approaches and consequently, the development of temporal extraction
systems. To mitigate these problems, we have developed tieval, a Python library
that provides a concise interface for importing different corpora and
facilitates system evaluation. In this paper, we present the first public
release of tieval and highlight its most relevant features.Comment: 10 page
Extracting Temporal and Causal Relations between Events
Structured information resulting from temporal information processing is
crucial for a variety of natural language processing tasks, for instance to
generate timeline summarization of events from news documents, or to answer
temporal/causal-related questions about some events. In this thesis we present
a framework for an integrated temporal and causal relation extraction system.
We first develop a robust extraction component for each type of relations, i.e.
temporal order and causality. We then combine the two extraction components
into an integrated relation extraction system, CATENA---CAusal and Temporal
relation Extraction from NAtural language texts---, by utilizing the
presumption about event precedence in causality, that causing events must
happened BEFORE resulting events. Several resources and techniques to improve
our relation extraction systems are also discussed, including word embeddings
and training data expansion. Finally, we report our adaptation efforts of
temporal information processing for languages other than English, namely
Italian and Indonesian.Comment: PhD Thesi
Tint, the Swiss-Army Tool for Natural Language Processing in Italian
In this we paper present the last version of Tint, an opensource, fast and extendable Natural Language Processing suite for Italian based on Stanford CoreNLP. The new release includes a set of text processing components for fine-grained linguistic analysis, from tokenization to relation extraction, including part-of-speech tagging, morphological analysis, lemmatization, multi-word expression recognition, dependency parsing, named-entity recognition, keyword extraction, and much more. Tint is written in Java freely distributed under the GPL license. Although some modules do not perform at a state-of-the-art level, Tint reaches very good accuracy in all modules, and can be easily used out-of-the-box