28 research outputs found

    One, no one and one hundred thousand events: Defining and processing events in an inter-disciplinary perspective

    Get PDF
    We present an overview of event definition and processing spanning 25 years of research in NLP. We first provide linguistic background to the notion of event, and then present past attempts to formalize this concept in annotation standards to foster the development of benchmarks for event extraction systems. This ranges from MUC-3 in 1991 to the Time and Space Track challenge at SemEval 2015. Besides, we shed light on other disciplines in which the notion of event plays a crucial role, with a focus on the historical domain. Our goal is to provide a comprehensive study on event definitions and investigate which potential past efforts in the NLP community may have in a different research domain. We present the results of a questionnaire, where the notion of event for historians is put in relation to the NLP perspective

    Temporal processing of news : annotation of temporal expressions, verbal events and temporal relations

    Get PDF
    The ability to capture the temporal dimension of a natural language text is essential to many natural language processing applications, such as Question Answering, Automatic Summarisation, and Information Retrieval. Temporal processing is a ¯eld of Computational Linguistics which aims to access this dimension and derive a precise temporal representation of a natural language text by extracting time expressions, events and temporal relations, and then representing them according to a chosen knowledge framework. This thesis focuses on the investigation and understanding of the di®erent ways time is expressed in natural language, on the implementation of a temporal processing system in accordance with the results of this investigation, on the evaluation of the system, and on the extensive analysis of the errors and challenges that appear during system development. The ultimate goal of this research is to develop the ability to automatically annotate temporal expressions, verbal events and temporal relations in a natural language text. Temporal expression annotation involves two stages: temporal expression identi¯cation concerned with determining the textual extent of a temporal expression, and temporal expression normalisation which ¯nds the value that the temporal expression designates and represents it using an annotation standard. The research presented in this thesis approaches these tasks with a knowledge-based methodology that tackles temporal expressions according to their semantic classi¯cation. Several knowledge sources and normalisation models are experimented with to allow an analysis of their impact on system performance. The annotation of events expressed using either ¯nite or non-¯nite verbs is addressed with a method that overcomes the drawback of existing methods v which associate an event with the class that is most frequently assigned to it in a corpus and are limited in coverage by the small number of events present in the corpus. This limitation is overcome in this research by annotating each WordNet verb with an event class that best characterises that verb. This thesis also describes an original methodology for the identi¯cation of temporal relations that hold among events and temporal expressions. The method relies on sentence-level syntactic trees and a propagation of temporal relations between syntactic constituents, by analysing syntactic and lexical properties of the constituents and of the relations between them. The detailed evaluation and error analysis of the methods proposed for solving di®erent temporal processing tasks form an important part of this research. Various corpora widely used by researchers studying di®erent temporal phenomena are employed in the evaluation, thus enabling comparison with state of the art in the ¯eld. The detailed error analysis targeting each temporal processing task helps identify not only problems of the implemented methods, but also reliability problems of the annotated resources, and encourages potential reexaminations of some temporal processing tasks.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Extracting Temporal Expressions from Unstructured Open Resources

    Get PDF
    AETAS is an end-to-end system with SOA approach that retrieves plain text data from web and blog news and represents and stores them in RDF, with a special focus on their temporal dimension. The system allows users to acquire, browse and query Linked Data obtained from unstructured sources

    Domain-sensitive Temporal Tagging for Event-centric Information Retrieval

    Get PDF
    Temporal and geographic information is of major importance in virtually all contexts. Thus, it also occurs frequently in many types of text documents in the form of temporal and geographic expressions. Often, those are used to refer to something that was, is, or will be happening at some specific time and some specific place – in other words, temporal and geographic expressions are often used to refer to events. However, so far, event-related information needs are not well served by standard information retrieval approaches, which motivates the topic of this thesis: event-centric information retrieval. An important characteristic of temporal and geographic expressions – and thus of two components of events – is that they can be normalized so that their meaning is unambiguous and can be placed on a timeline or pinpointed on a map. In many research areas in which natural language processing is involved, e.g., in information retrieval, document summarization, and question answering, applications can highly benefit from having access to normalized information instead of only the words as they occur in documents. In this thesis, we present several frameworks for searching and exploring document collections with respect to occurring temporal, geographic, and event information. While we rely on an existing tool for extracting and normalizing geographic expressions, we study the task of temporal tagging, i.e., the extraction and normalization of temporal expressions. A crucial issue is that so far most research on temporal tagging dealt with English news-style documents. However, temporal expressions have to be handled in different ways depending on the domain of the documents from which they are extracted. Since we do not want to limit our research to one domain and one language, we develop the multilingual, cross-domain temporal tagger HeidelTime. It is the only publicly available temporal tagger for several languages and easy to extend to further languages. In addition, it achieves state-of-the-art evaluation results for all addressed domains and languages, and lays the foundations for all further contributions developed in this thesis. To achieve our goal of exploiting temporal and geographic expressions for event-centric information retrieval from a variety of text documents, we introduce the concept of spatio-temporal events and several concepts to "compute" with temporal, geographic, and event information. These concepts are used to develop a spatio-temporal ranking approach, which does not only consider textual, temporal, and geographic query parts but also two different types of proximity information. Furthermore, we adapt the spatio-temporal search idea by presenting a framework to directly search for events. Additionally, several map-based exploration frameworks are introduced that allow a new way of exploring event information latently contained in huge document collections. Finally, an event-centric document similarity model is developed that calculates document similarity on multilingual corpora solely based on extracted and normalized event information

    Time, events and temporal relations: an empirical model for temporal processing of Italian texts

    Get PDF
    The aim of this work is the elaboration a computational model for the identification of temporal relations in text/discourse to be used as a component in more complex systems for Open-Domain Question-Answers, Information Extraction and Summarization. More specifically, the thesis will concentrate on the relationships between the various elements which signal temporal relations in Italian texts/discourses, on their roles and how they can be exploited. Time is a pervasive element of human life. It is the primary element thanks to which we are able to observe, describe and reason about what surrounds us and the world. The absence of a correct identification of the temporal ordering of what is narrated and/or described may result in a bad comprehension, which can lead to a misunderstanding. Normally, texts/discourses present situations standing in a particular temporal ordering. Whether these situations precede, or overlap or are included one within the other is inferred during the general process of reading and understanding. Nevertheless, to perform this seemingly easy task, we are taking into account a set of complex information involving different linguistic entities and sources of knowledge. A wide variety of devices is used in natural languages to convey temporal information. Verb tense, temporal prepositions, subordinate conjunctions, adjectival phrases are some of the most obvious. Nevertheless even these obvious devices have different degrees of temporal transparency, which may sometimes be not so obvious as it can appear at a quick and superficial analysis. One of the main shortcomings of previous research on temporal relations is represented by the fact that they concentrated only on a particular discourse segment, namely narrative discourse, disregarding the fact that a text/discourse is composed by different types of discourse segments and relations. A good theory or framework for temporal analysis must take into account all of them. In this work, we have concentrated on the elaboration of a framework which could be applied to all text/discourse segments, without paying too much attention to their type, since we claim that temporal relations can be recovered in every kind of discourse segments and not only in narrative ones. The model we propose is obtained by mixing together theoretical assumptions and empirical data, collected by means of two tests submitted to a total of 35 subjects with different backgrounds. The main results we have obtained from these empirical studies are: (i.) a general evaluation of the difficulty of the task of recovering temporal relations; (ii.) information on the level of granularity of temporal relations; (iii.) a saliency-based order of application of the linguistic devices used to express the temporal relations between two eventualities; (iv.) the proposal of tense temporal polysemy, as a device to identify the set of preferences which can assign unique values to possibly multiple temporal relations. On the basis of the empirical data, we propose to enlarge the set of classical finely grained interval relations (Allen, 1983) by including also coarse-grained temporal relations (Freska, 1992). Moreover, there could be cases in which we are not able to state in a reliable way if there exists a temporal relation or what the particular relation between two entities is. To overcome this issue we have adopted the proposal by Mani (2007) which allows the system to have differentiated levels of temporal representation on the basis of the temporal granularity associated with each discourse segment. The lack of an annotated corpus for eventualities, temporal expressions and temporal relations in Italian represents the biggest shortcomings of this work which has prevented the implementation of the model and its evaluation. Nevertheless, we have been able to conduct a series of experiments for the validation of procedures for the further realization of a working prototype. In addition to this, we have been able to implement and validate a working prototype for the spotting of temporal expressions in texts/discourses

    Normalisation of imprecise temporal expressions extracted from text

    Get PDF
    Orientador : Prof. Dr. Marcos Didonet Del FabroCo-Orientador : Prof. Dr. Angus RobertsTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 05/04/2016Inclui referências : f. 95-105Resumo: Técnicas e sistemas de extração de informações são capazes de lidar com a crescente quantidade de dados não estruturados disponíveis hoje em dia. A informação temporal está entre os diferentes tipos de informações que podem ser extraídos a partir de tais fontes de dados não estruturados, como documentos de texto. Informações temporais descrevem as mudanças que acontecem através da ocorrência de eventos, e fornecem uma maneira de gravar, ordenar e medir a duração de tais ocorrências. A impossibilidade de identificar e extrair informação temporal a partir de documentos textuais faz com que seja difícil entender como os eventos são organizados em ordem cronológica. Além disso, em muitas situações, o significado das expressões temporais é impreciso, e não pode ser descrito com precisão, o que leva a erros de interpretação. As soluções existentes proporcionam formas alternativas de representar expressões temporais imprecisas. Elas são, entretanto, específicas e difíceis de generalizar. Além disso, a análise de dados temporais pode ser particularmente ineficiente na presença de erros ortográficos. As abordagens existentes usam métodos de similaridade para procurar palavras válidas dentro de um texto. No entanto, elas não são suficientes para processos erros de ortografia de uma forma eficiente. Nesta tese é apresentada uma metodologia para analisar e normalizar das expressões temporais imprecisas, em que, após a coleta e pré-processamento de dados sobre a forma como as pessoas interpretam descrições vagas de tempo no texto, diferentes técnicas são comparadas a fim de criar e selecionar o modelo de normalização mais apropriada para diferentes tipos de expressões imprecisas. Também são comparados um sistema baseado em regras e uma abordagem de aprendizagem de máquina na tentativa de identificar expressões temporais em texto, e é analisado o processo de produção de padrões de anotação, identificando possíveis fontes de problemas, dando algumas recomendações para serem consideradas no futuro esforços de anotação manual. Finalmente, é proposto um mapa fonético e é avaliado como a codificação de informação fonética poderia ser usado a fim de auxiliar os métodos de busca de similaridade e melhorar a qualidade da informação extraída.Abstract: Information Extraction systems and techniques are able to deal with the increasing amount of unstructured data available nowadays. Time is amongst the different kinds of information that may be extracted from such unstructured data sources, including text documents. Time describes changes which happen through the occurrence of events, and provides a way to record, order, and measure the duration of such occurrences. The inability to identify and extract temporal information from text makes it difficult to understand how the events are organized in a chronological order. Moreover, in many situations, the meaning of temporal expressions is imprecise, and cannot be accurately described, leading to interpretation errors. Existing solutions provide alternative ways of representing imprecise temporal expressions, though they are specific and hard to generalise. Furthermore, the analysis of temporal data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text. However, they are not rich enough to processes misspellings in an efficient way. In this thesis, we present a methodology to analyse and normalise of imprecise temporal expressions, in which, after collecting and pre-processing data on how people interpret vague descriptions of time in text, we compare different techniques in order to create and select the most appropriate normalisation model for different kinds of imprecise expressions. We also compare how a rule-based system and a machine learning approach perform on trying to identify temporal expression from text, and we analyse the process of producing gold standards, identifying possible sources of issues, giving some recommendations to be considered in future manual annotation efforts. Finally, we propose a phonetic map and evaluate how encoding phonetic information could be used in order to assist similarity search methods and improve information extraction quality
    corecore