    A constraint-based hypergraph partitioning approach to coreference resolution

    The objectives of this thesis are focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that mention or refer to the same entity. The main contributions of this thesis are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use entity-mention classi cation model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classi cations without context and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and a research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results in the state of the art, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second position in CoNLL-2011.La resolució de correferències és una tasca de processament del llenguatge natural que consisteix en determinar les expressions d'un discurs que es refereixen a la mateixa entitat del mon real. La tasca té un efecte directe en la minería de textos així com en moltes tasques de llenguatge natural que requereixin interpretació del discurs com resumidors, responedors de preguntes o traducció automàtica. Resoldre les correferències és essencial si es vol poder “entendre” un text o un discurs. Els objectius d'aquesta tesi es centren en la recerca en resolució de correferències amb aprenentatge automàtic. Concretament, els objectius de la recerca es centren en els següents camps: + Models de classificació: Els models de classificació més comuns a l'estat de l'art estan basats en la classificació independent de parelles de mencions. Més recentment han aparegut models que classifiquen grups de mencions. Un dels objectius de la tesi és incorporar el model entity-mention a l'aproximació desenvolupada. + Representació del problema: Encara no hi ha una representació definitiva del problema. En aquesta tesi es presenta una representació en hypergraf. + Algorismes de resolució. Depenent de la representació del problema i del model de classificació, els algorismes de ressolució poden ser molt diversos. Un dels objectius d'aquesta tesi és trobar un algorisme de resolució capaç d'utilitzar els models de classificació en la representació d'hypergraf. + Representació del coneixement: Per poder administrar coneixement de diverses fonts, cal una representació simbòlica i expressiva d'aquest coneixement. En aquesta tesi es proposa l'ús de restriccions. + Incorporació de coneixement del mon: Algunes correferències no es poden resoldre només amb informació lingüística. Sovint cal sentit comú i coneixement del mon per poder resoldre coreferències. En aquesta tesi es proposa un mètode per extreure coneixement del mon de Wikipedia i incorporar-lo al sistem de resolució. Les contribucions principals d'aquesta tesi son (i) una nova aproximació al problema de resolució de correferències basada en satisfacció de restriccions, fent servir un hypergraf per representar el problema, i resolent-ho amb l'algorisme relaxation labeling; i (ii) una recerca per millorar els resultats afegint informació del mon extreta de la Wikipedia. L'aproximació presentada pot fer servir els models mention-pair i entity-mention de forma combinada evitant així els problemes que es troben moltes altres aproximacions de l'estat de l'art com per exemple: contradiccions de classificacions independents, falta de context i falta d'informació. A més a més, l'aproximació presentada permet incorporar informació afegint restriccions i s'ha fet recerca per aconseguir afegir informació del mon que millori els resultats. RelaxCor, el sistema que ha estat implementat durant la tesi per experimentar amb l'aproximació proposada, ha aconseguit uns resultats comparables als millors que hi ha a l'estat de l'art. S'ha participat a les competicions internacionals SemEval-2010 i CoNLL-2011. RelaxCor va obtenir la segona posició al CoNLL-2010

    Coreference Resolution in Freeling 4.0

    This paper presents the integration of RelaxCor into FreeLing. RelaxCor is a coreference resolution system based on constraint satisfaction that ranked second in the CoNLL-2011 shared task. FreeLing is an open-source library for NLP with more than fifteen years of existence and a widespread user community. We present the difficulties found in porting RelaxCor from a shared task scenario to a production enviroment, as well as the solutions devised. We present two strategies for this integration and a rough evaluation of the obtained resultsPeer ReviewedPostprint (published version

    An Application of Natural Language Processing for Triangulation of Cognitive Load Assessments in Third Level Education

    Work has been done to measure Mental Workload based on applications mainly related to ergonomics, human factors, and Machine Learning. The influence of Machine Learning is a reflection of an increased use of new technologies applied to areas conventionally dominated by theoretical approaches. However, collaboration between MWL and Natural Language Processing techniques seems to happen rarely. In this sense, the objective of this research is to make use of Natural Languages Processing techniques to contribute to the analysis of the relationship between Mental Workload subjective measures and Relative Frequency Ratios of keywords gathered during pre-tasks and post-tasks of MWL activities in third-level sessions under different topics and instructional designs. This research employs secondary, empirical and inductive methods to investigate Cognitive Load theory, instructional designs, Mental Workload foundations and measures and Natural Language Process Techniques. Then, NASA-TLX, Workload Profile and Relative Frequency Ratios are calculated. Finally, the relationship between NASA-TLX and Workload Profile and Relative Frequency Ratios is analysed using parametric and non-parametric statistical techniques. Results show that the relationship between Mental Workload and Relative Frequency Ratios of keywords, is only medium correlated, or not correlated at all. Furthermore, it has been found out that instructional designs based on the process of hearing and seeing, and the interaction between participants, can overcome other approaches such as those that make use of videos supported with images and text, or of a lecturer\u27s speech supported with slides

    A Constraint-Based Hypergraph Partitioning Approach to Coreference Resolution

    Aprendizagem à distância de anáfora em inglês e espanhol como línguas estrangeiras

    A presente tese de doutoramento investiga a aprendizagem à distância de anáfora em inglês e espanhol como línguas estrangeiras. Analisa-se como falantes nativos de português, aprendizes de inglês ou espanhol, compreendem e produzem anáforas com antecedentes nominais em textos escritos e como diferentes modalidades de ensino à distância podem contribuir para a aprendizagem deste mecanismo discursivo. Ao todo, foram escritos 11 artigos, distribuídos em 4 seções. A primeira seção tem como foco a investigação da resolução de ambiguidade com base em um questionário online distribuído a aprendizes e falantes nativos de português, inglês e espanhol. Enquanto o primeiro texto foi um estudo-piloto realizado em Portugal, o segundo incluiu dados do Brasil, e o terceiro foi escrito após a coleta ser concluída. Nos questionários, foi possível controlar diversas variáveis para analisar como os falantes resolviam a ambiguidade anafórica. A segunda seção destina-se à revisão da literatura sobre o ensino-aprendizagem da anáfora, as teorias e métodos voltados ao ensino de línguas, e as diferentes modalidades de ensino. Estes estudos permitiram a elaboração conceitual do experimento realizado posteriormente. Finalmente, a terceira seção da tese trata do experimento realizado, que consistiu na oferta de um curso sobre anáfora nas modalidades de ensino à distância síncrona e assíncrona, com acompanhamento da aprendizagem ao longo do tempo. O primeiro artigo explica como o curso foi planejado; o segundo apresenta os resultados dos grupos nos testes de compreensão; e o terceiro avalia o curso qualitativamente. A quarta seção apresenta os corpora de aprendizagem compilados, BRANEN e BRANES, e a análise das relações anafóricas produzidas pelos estudantes ao longo de quatro testes (um pré-teste, um teste intermédio, um teste imediatamente final, e um teste de retenção após um mês). A tese conclui-se com uma sinopse dos resultados obtidos, sua discussão e uma conclusão perspectivando linhas de investigação futuras.This doctoral thesis investigates the distance learning of anaphora in English and Spanish as foreign languages. It analyses how native speakers of Portuguese, learners of English or Spanish, understand and produce anaphora with nominal antecedents in written texts and how different distance learning modalities can contribute to the learning of this discursive mechanism. In total, 11 articles were written and distributed in 4 sections. The first section focuses on investigating ambiguity resolution based on an online questionnaire distributed to learners and native speakers of Portuguese, English, and Spanish. While the first paper presents a pilot study conducted in Portugal, the second included data from Brazil, and the third was written after the data collection was completed. In the questionnaires, it was possible to control several variables to analyse how speakers resolved anaphoric ambiguity. The second section reviews the literature on the teaching and learning of anaphora, the theories and methods focused on language teaching, and the different teaching modalities. These studies allowed the conceptual elaboration of the experiment carried out later. Finally, the third section of the thesis presents the experiment carried out, which consisted in offering a course on anaphora in synchronous and asynchronous distance learning modalities, with monitoring of learning over time. The first article explains how the course was planned; the second presents the groups’ results in the comprehension tests; and the third evaluated the course qualitatively. The fourth section presents the new learner corpora, BRANEN and BRANES, and the analysis of the anaphoric relations produced by the students over four tests (a pre-test, an intermediate test, an immediately final test, and a retention test after one month). The thesis ends with a synopsis of the results obtained, their discussion, and a conclusion looking towards future lines of research