4 research outputs found

    Processing temporal information in unstructured documents

    Get PDF
    Tese de doutoramento, Informática (Ciência da Computação), Universidade de Lisboa, Faculdade de Ciências, 2013Temporal information processing has received substantial attention in the last few years, due to the appearance of evaluation challenges focused on the extraction of temporal information from texts written in natural language. This research area belongs to the broader field of information extraction, which aims to automatically find specific pieces of information in texts, producing structured representations of that information, which can then be easily used by other computer applications. It has the potential to be useful in several applications that deal with natural language, given that many languages, among which we find Portuguese, extensively refer to time. Despite that, temporal processing is still incipient for many language, Portuguese being one of them. The present dissertation has various goals. On one hand, it addresses this current gap, by developing and making available resources that support the development of tools for this task, employing this language, and also by developing precisely this kind of tools. On the other hand, its purpose is also to report on important results of the research on this area of temporal processing. This work shows how temporal processing requires and benefits from modeling different kinds of knowledge: grammatical knowledge, logical knowledge, knowledge about the world, etc. Additionally, both machine learning methods and rule-based approaches are explored and used in the development of hybrid systems that are capable of taking advantage of the strengths of each of these two types of approach.O processamento de informação temporal tem recebido bastante atenção nos últimos anos, devido ao surgimento de desafios de avaliação focados na extração de informação temporal de textos escritos em linguagem natural. Esta área de investigação enquadra-se no campo mais lato da extração de informação, que visa encontrar automaticamente informação específica presente em textos, produzindo representações estruturadas da mesma, que podem depois ser facilmente utilizadas por outras aplicações computacionais. Tem o potencial de ser útil em diversas aplicações que lidam com linguagem natural, dado o caráter quase ubíquo da referência ao tempo cronólogico em muitas línguas, entre as quais o Português. Apesar de tudo, o processamento temporal encontra-se ainda incipiente para bastantes línguas, sendo o Português uma delas. A presente dissertação tem vários objetivos. Por um lado vem colmatar esta lacuna existente, desenvolvendo e disponibilizando recursos que suportam o desenvolvimento de ferramentas para esta tarefa, utilizando esta língua, e desenvolvendo também precisamente este tipo de ferramentas. Por outro serve também para relatar resultados importantes da pesquisa nesta área do processamento temporal. Neste trabalho, mostra- -se como o processamento temporal requer e beneficia da modelação de conhecimento de diversos níveis: gramatical, lógico, acerca do mundo, etc. Adicionalmente, são explorados tanto métodos de aprendizagem automática como abordagens baseadas em regras, desenvolvendo-se sistemas híbridos capazes de tirar partido das vantagens de cada um destes dois tipos de abordagem.Fundação para a Ciência e a Tecnologia (FCT, SFRH/BD/40140/2007

    Automatic recognition and normalization of temporal expressions in Serbian unstructured newspaper and medical texts

    Get PDF
    Ljudi u svakodnevnom životu koriste vreme kao univerzalni referentni sistem, u okviru koga se doga¯daji ili stanja nižu jedan za drugim, utvr¯duje dužina njihovog trajanja i navodi kada se neki doga¯daj desio. Znaˇcenje vremena i naˇcin na koji ˇcovek poima vreme ogledaju se i u komunikaciji, pre svega, u jeziˇckim izrazima koji se uˇcestalo koriste u svakodnevnom govoru. Vremenski izrazi, kao fraze prirodnog jezika koje na direktan naˇcin ukazuju na vreme, pružaju informaciju o tome kada se nešto dogodilo, koliko dugo je trajalo ili koliko ˇcesto se dešava. Uporedo s razvojem informatiˇckog društva, pove´cava se i koliˇcina slobodno dostupnih digitalnih informacija, što daje ve´ce mogu´cnosti pronalaženja potrebnih informacija, ali i utiˇce na složenost ovog procesa, iziskuju´ci koriš´cenje naprednih raˇcunarskih alata i mo´cnijih metoda automatske obrade tekstova prirodnih jezika. S obzirom na to da se znaˇcenje ve´cine elektronskih informacija menja u zavisnosti od vremena iskazanog u njima, radi uspešnog razumevanja tekstova pisanih prirodnim jezikom, neophodno je koriš´cenje alata koji su sposobni da automatski oznaˇce i informacije koje referišu na vreme i omogu´ce uspostavljanje hronološkog sleda opisanih doga¯daja. Stoga je potrebno razviti alate namenjene ekstrakciji vremenskih izraza, kod kojih su preciznost i odziv na visokom nivou i koji se brzo i jednostavno mogu prilagoditi novim zahtevima ili tekstovima drugog domena. Postojanje ovakvog sistema može u velikoj meri uticati na poboljšanje uˇcinka primene mnogih drugih aplikacija iz oblasti jeziˇckih tehnologija (ekstrakcija informacija, pronalaženje informacija, odgovaranje na pitanja, rezimiranje teksta itd.), ali i doprineti oˇcuvanju srpskog jezika u savremenom digitalnom okruženju...People in everyday life use time as a universal reference system, within which, events or states are sequenced one after the other, it is established how long they lasted and it is stated when an event occurred. The meaning of time and the way humans perceive time is reflected in communication, most of all, in linguistic expressions frequently used in everyday speech. Temporal expressions, as natural language phrases which directly refer to time, provide information on when something happened, how long it lasted and how often it occurs. Alongside with the information society development, the amount of freely available digital information has increased, which provides a greater possibility of finding the necessary information, but also affects the complexity of this process, by requiring the use of advanced computer tools and more powerful natural language text processing methods. Having in mind that the meaning of most electronic information can change depending on time expressed in them, it is essential to use tools which can both automatically mark the information related to time and enable the establishment of chronological order of described events. Therefore, it is necessary to develop tools for extraction of temporal expressions with high levels of precision and recall, which can be easily and quickly adapted to new demands and texts from different domains. The existence of such a system can, to a great extent, affect the effectiveness improvement in implementation of many other applications from the field of language technology (information extraction, information retrieval, question answering, text summarization, etc.), but also contribute to the preservation of the Serbian language in the contemporary digital environment..

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal