64 research outputs found
Supervised Machine Learning Techniques to Detect TimeML Events in French and English
International audienceIdentifying events from texts is an information extraction task necessary for many NLP applications. Through the TimeML specifications and TempEval challenges, it has received some attention in the last years; yet, no reference result is available for French. In this paper, we try to fill this gap by proposing several event extraction systems, combining for instance Conditional Random Fields, language modeling and k-nearest-neighbors. These systems are evaluated on French corpora and compared with state-of-the-art methods on English. The very good results obtained on both languages validate our whole approach
Extracting Temporal and Causal Relations between Events
Structured information resulting from temporal information processing is
crucial for a variety of natural language processing tasks, for instance to
generate timeline summarization of events from news documents, or to answer
temporal/causal-related questions about some events. In this thesis we present
a framework for an integrated temporal and causal relation extraction system.
We first develop a robust extraction component for each type of relations, i.e.
temporal order and causality. We then combine the two extraction components
into an integrated relation extraction system, CATENA---CAusal and Temporal
relation Extraction from NAtural language texts---, by utilizing the
presumption about event precedence in causality, that causing events must
happened BEFORE resulting events. Several resources and techniques to improve
our relation extraction systems are also discussed, including word embeddings
and training data expansion. Finally, we report our adaptation efforts of
temporal information processing for languages other than English, namely
Italian and Indonesian.Comment: PhD Thesi
Textual Information Extraction
Material completo EI
Processing temporal information in unstructured documents
Tese de doutoramento, Informática (Ciência da Computação), Universidade de Lisboa, Faculdade de Ciências, 2013Temporal information processing has received substantial attention in the last few years, due to the appearance of evaluation challenges focused on the extraction of temporal information from texts written in natural language. This research area belongs to the broader field of information extraction, which aims to automatically find specific pieces of information in texts, producing structured representations of that information, which can then be easily used by other computer applications. It has the potential to be useful in several applications that deal with natural language, given that many languages, among which we find Portuguese, extensively refer to time. Despite that, temporal processing is still incipient for many language, Portuguese being one of them. The present dissertation has various goals. On one hand, it addresses this current gap, by developing and making available resources that support the development of tools for this task, employing this language, and also by developing precisely this kind of tools. On the other hand, its purpose is also to report on important results of the research on this area of temporal processing. This work shows how temporal processing requires and benefits from modeling different kinds of knowledge: grammatical knowledge, logical knowledge, knowledge about the world, etc. Additionally, both machine learning methods and rule-based approaches are explored and used in the development of hybrid systems that are capable of taking advantage of the strengths of each of these two types of approach.O processamento de informação temporal tem recebido bastante atenção nos últimos anos, devido ao surgimento de desafios de avaliação focados na extração de informação temporal de textos escritos em linguagem natural. Esta área de investigação enquadra-se no campo mais lato da extração de informação, que visa encontrar automaticamente informação específica presente em textos, produzindo representações estruturadas da mesma, que podem depois ser facilmente utilizadas por outras aplicações computacionais. Tem o potencial de ser útil em diversas aplicações que lidam com linguagem natural, dado o caráter quase ubíquo da referência ao tempo cronólogico em muitas línguas, entre as quais o Português. Apesar de tudo, o processamento temporal encontra-se ainda incipiente para bastantes línguas, sendo o Português uma delas. A presente dissertação tem vários objetivos. Por um lado vem colmatar esta lacuna existente, desenvolvendo e disponibilizando recursos que suportam o desenvolvimento de ferramentas para esta tarefa, utilizando esta língua, e desenvolvendo também precisamente este tipo de ferramentas. Por outro serve também para relatar resultados importantes da pesquisa nesta área do processamento temporal. Neste trabalho, mostra- -se como o processamento temporal requer e beneficia da modelação de conhecimento de diversos níveis: gramatical, lógico, acerca do mundo, etc. Adicionalmente, são explorados tanto métodos de aprendizagem automática como abordagens baseadas em regras, desenvolvendo-se sistemas híbridos capazes de tirar partido das vantagens de cada um destes dois tipos de abordagem.Fundação para a Ciência e a Tecnologia (FCT, SFRH/BD/40140/2007
Recommended from our members
Content Selection for Timeline Generation from Single History Articles
This thesis investigates the problem of content selection for timeline generation from single history articles. While the task of timeline generation has been addressed before, most previous approaches assume the existence of a large corpus of history articles from the same era. They exploit the fact that salient information is likely to be mentioned multiple times in such corpora. However, large resources of this kind are only available for historical events that happened in the most recent decades. In this thesis, I present approaches which can be used to create history timelines for any historical period, even for eras such as the Middle Ages, for which no large corpora of supplementary text exist.
The thesis first presents a system that selects relevant historical figures in a given article, a task which is substantially easier than full timeline generation.
I show that a supervised approach which uses linguistic, structural and semantic features outperforms a competitive baseline on this task.
Based on the observations made in this initial study, I then develop approaches for timeline generation. I find that an unsupervised approach that takes into account the article's subject area outperforms several supervised and unsupervised baselines.
A main focus of this thesis is the development of evaluation methodologies and resources, as no suitable corpora existed when work began.
For the initial experiment on important historical figures, I construct a corpus of existing timelines and textual articles, and devise a method for evaluating algorithms based on this resource.
For timeline generation, I present a comprehensive evaluation methodology which is based on the interpretation of the task as a special form of single-document summarisation. This methodology scores algorithms based on meaning units rather than surface similarity. Unlike previous semantic-units-based evaluation methods for summarisation, my evaluation method does not require any manual annotation of system timelines. Once an evaluation resource has been created, which involves only annotation of the input texts, new timeline generation algorithms can be tested at no cost. This crucial advantage should make my new evaluation methodology attractive for the evaluation of general single-document summaries beyond timelines.
I also present an evaluation resource which is based on this methodology. It was constructed using gold-standard timelines elicited from 30 human timeline writers, and has been made publicly available.
This thesis concentrates on the content selection stage of timeline generation, and leaves the surface realisation step for future work. However, my evaluation methodology is designed in such a way that it can in principle also quantify the degree to which surface realisation is successful
Extracting Temporal Expressions from Unstructured Open Resources
AETAS is an end-to-end system with SOA approach that retrieves plain text data from web and blog news and represents and stores them in RDF, with a special focus on their temporal dimension. The system allows users to acquire, browse and query Linked Data obtained from unstructured sources
Disambiguating Discourse Connectives for Statistical Machine Translation
This paper shows that the automatic labeling of discourse connectives with the relations they signal, prior to machine translation (MT), can be used by phrase-based statistical MT systems to improve their translations. This improvement is demonstrated here when translating from English to four target languages - French, German, Italian and Arabic - using several test sets from recent MT evaluation campaigns. Using automatically labeled data for training, tuning and testing MT systems is beneficial on condition that labels are sufficiently accurate, typically above 70%. To reach such an accuracy, a large array of features for discourse connective labeling (morpho-syntactic, semantic and discursive) are extracted using state-of-the-art tools and exploited in factored MT models. The translation of connectives is improved significantly, between 0.7% and 10% as measured with the dedicated ACT metric. The improvements depend mainly on the level of ambiguity of the connectives in the test sets
- …