26 research outputs found

    Coreference Resolution via Hypergraph Partitioning

    Get PDF
    Coreference resolution is one of the most fundamental Natural Language Processing tasks, aiming to identify the coreference relation in texts. The task is to group mentions (i.e. phrases of interest) into sets, so that all mentions in one set refer to the same entity (i.e. a real world object). Mentions are conventionally proper names, common nouns and pronouns. Lately, the coreference task has been extended to deal with verb phrases too. However, we only work with noun phrase mentions in this thesis. By linking mentions together in a document, not only entities are recovered but also different fragments of the context are connected. This therefore leads to a better text understanding. Coreference resolution is essentially important to many applications, such as text summarization and information extraction. In this thesis, we propose a novel coreference model based on hypergraph partitioning. Our system is named COPA, standing for Coreference Partitioner. Given a raw document, COPA represents it as a hypergraph, upon which the hypergraph partitioning algorithms are applied to derive coreference sets directly. The coreference relation is a high-dimensional relation, because it depends on multiple types of basic relations (e.g. string similarities and semantic relatedness). Most of the previous work on the coreference resolution task combines the basic relations between mentions into single ones and derives the coreference sets afterward. Since it is relatively expensive to learn the combination of the basic relations, we propose a novel hypergraph representation model for coreference resolution. In our model, the mentions are taken as vertices in the hypergraph and the relational features derived from the basic relations as hyperedges. The hypergraph allows for multiple edges between vertices, so that it suits the high-dimension property of the coreference relation. Moreover, in a hypergraph one hyperedge can connect more than two vertices. As a result the hypergraph directly represents the relations between sets of mentions as required for the coreference resolution task. Since the basic relations are incorporated in an overlapping manner, COPA only needs a few training documents to achieve competitive performance. The weakly supervised nature makes COPA a good candidate when applying to different domains or languages, or when only limited training data is available. The inference of the coreference resolution task deals with sets of mentions. It needs to capture the relations between multiple mentions in order to derive the final coreference sets. Therefore, we consider coreference resolution as a set problem. Most of the previous coreference models address the set problem by dividing the resolution into two steps --- a classification step and a clustering step. The classification step makes decisions for each pair of mentions on whether they are coreferent or not. Upon the pairwise decisions, the clustering step further groups mentions into the final sets. The two-step division makes the classification performance not necessarily positively correlated with the end evaluation numbers. It is difficult to track the error propagation and hard to optimize with respect to the final coreference sets. Moreover, since the coreference decisions are made between pairs of mentions independently, global context information is missing in those models. In this thesis, we propose a global coreference model via hypergraph partitioning. We design two algorithms based on the spectral clustering technique --- a hierarchical R2 partitioner and a flat k-way flatK partitioner. We also propose extensions to the clustering algorithms of COPA, aiming to include constraints to enforce the cluster-level consistency. The constrained COPA is the first attempt towards a better learning scheme for our system. It solves the cluster-level inconsistency problem and at the same time contributes to research in the constrained graph clustering field. Since COPA is an end-to-end coreference system, the important implementation issues encountered when applying clustering algorithms to practical uses are also addressed in this thesis. For instance, the existing evaluation metrics become problematic when the automatically identified mentions do not align with the ones in the ground truth. In this thesis, we propose variants of the coreference evaluation metrics to tackle this problem. COPA outperforms several baseline systems in fair settings, using the same features and the same mentions and only comparing the effectiveness of the models themselves. It also performs competitively compared to the state-of-the-art systems across different evaluation metrics, different data sets and different domains

    A constraint-based hypergraph partitioning approach to coreference resolution

    Get PDF
    The objectives of this thesis are focused on research in machine learning for coreference resolution. Coreference resolution is a natural language processing task that consists of determining the expressions in a discourse that mention or refer to the same entity. The main contributions of this thesis are (i) a new approach to coreference resolution based on constraint satisfaction, using a hypergraph to represent the problem and solving it by relaxation labeling; and (ii) research towards improving coreference resolution performance using world knowledge extracted from Wikipedia. The developed approach is able to use entity-mention classi cation model with more expressiveness than the pair-based ones, and overcome the weaknesses of previous approaches in the state of the art such as linking contradictions, classi cations without context and lack of information evaluating pairs. Furthermore, the approach allows the incorporation of new information by adding constraints, and a research has been done in order to use world knowledge to improve performances. RelaxCor, the implementation of the approach, achieved results in the state of the art, and participated in international competitions: SemEval-2010 and CoNLL-2011. RelaxCor achieved second position in CoNLL-2011.La resolució de correferències és una tasca de processament del llenguatge natural que consisteix en determinar les expressions d'un discurs que es refereixen a la mateixa entitat del mon real. La tasca té un efecte directe en la minería de textos així com en moltes tasques de llenguatge natural que requereixin interpretació del discurs com resumidors, responedors de preguntes o traducció automàtica. Resoldre les correferències és essencial si es vol poder “entendre” un text o un discurs. Els objectius d'aquesta tesi es centren en la recerca en resolució de correferències amb aprenentatge automàtic. Concretament, els objectius de la recerca es centren en els següents camps: + Models de classificació: Els models de classificació més comuns a l'estat de l'art estan basats en la classificació independent de parelles de mencions. Més recentment han aparegut models que classifiquen grups de mencions. Un dels objectius de la tesi és incorporar el model entity-mention a l'aproximació desenvolupada. + Representació del problema: Encara no hi ha una representació definitiva del problema. En aquesta tesi es presenta una representació en hypergraf. + Algorismes de resolució. Depenent de la representació del problema i del model de classificació, els algorismes de ressolució poden ser molt diversos. Un dels objectius d'aquesta tesi és trobar un algorisme de resolució capaç d'utilitzar els models de classificació en la representació d'hypergraf. + Representació del coneixement: Per poder administrar coneixement de diverses fonts, cal una representació simbòlica i expressiva d'aquest coneixement. En aquesta tesi es proposa l'ús de restriccions. + Incorporació de coneixement del mon: Algunes correferències no es poden resoldre només amb informació lingüística. Sovint cal sentit comú i coneixement del mon per poder resoldre coreferències. En aquesta tesi es proposa un mètode per extreure coneixement del mon de Wikipedia i incorporar-lo al sistem de resolució. Les contribucions principals d'aquesta tesi son (i) una nova aproximació al problema de resolució de correferències basada en satisfacció de restriccions, fent servir un hypergraf per representar el problema, i resolent-ho amb l'algorisme relaxation labeling; i (ii) una recerca per millorar els resultats afegint informació del mon extreta de la Wikipedia. L'aproximació presentada pot fer servir els models mention-pair i entity-mention de forma combinada evitant així els problemes que es troben moltes altres aproximacions de l'estat de l'art com per exemple: contradiccions de classificacions independents, falta de context i falta d'informació. A més a més, l'aproximació presentada permet incorporar informació afegint restriccions i s'ha fet recerca per aconseguir afegir informació del mon que millori els resultats. RelaxCor, el sistema que ha estat implementat durant la tesi per experimentar amb l'aproximació proposada, ha aconseguit uns resultats comparables als millors que hi ha a l'estat de l'art. S'ha participat a les competicions internacionals SemEval-2010 i CoNLL-2011. RelaxCor va obtenir la segona posició al CoNLL-2010

    Joint Anaphoricity Detection and Coreference Resolution with Constrained Latent Structures

    Get PDF
    International audienceThis paper introduces a new structured model for learninganaphoricity detection and coreference resolution in a jointfashion. Specifically, we use a latent tree to represent the fullcoreference and anaphoric structure of a document at a globallevel, and we jointly learn the parameters of the two modelsusing a version of the structured perceptron algorithm.Our joint structured model is further refined by the use ofpairwise constraints which help the model to capture accuratelycertain patterns of coreference. Our experiments on theCoNLL-2012 English datasets show large improvements inboth coreference resolution and anaphoricity detection, comparedto various competing architectures. Our best coreferencesystem obtains a CoNLL score of 81:97 on gold mentions,which is to date the best score reported on this setting

    Coreference Resolution in Freeling 4.0

    Get PDF
    This paper presents the integration of RelaxCor into FreeLing. RelaxCor is a coreference resolution system based on constraint satisfaction that ranked second in the CoNLL-2011 shared task. FreeLing is an open-source library for NLP with more than fifteen years of existence and a widespread user community. We present the difficulties found in porting RelaxCor from a shared task scenario to a production enviroment, as well as the solutions devised. We present two strategies for this integration and a rough evaluation of the obtained resultsPeer ReviewedPostprint (published version

    Clustering Spectral avec Contraintes de Paires réglées par Noyaux Gaussiens

    Get PDF
    International audienceRésumé Nous considérons le problème du clustering spectral partielle-ment supervisé par des contraintes de la forme « must-link » et « cannot-link ». De telles contraintes apparaissent fréquemment dans divers pro-blèmes, comme la résolution de la coréférence en traitement automatique du langage naturel. L'approche développée dans ce papier consiste à ap-prendre une nouvelle représentation de l'espace pour les données, ainsi qu'une nouvelle distance dans cet espace. Cette représentation est ob-tenue via une transformation linéaire de l'enveloppe spectrale des don-nées. Les contraintes sont exprimées avec des fonctions Gaussiennes qui réajustent localement les similarités entre les objets. Un problème d'op-timisation global et non convexe est alors obtenu et l'apprentissage du modèle se fait grâce à des techniques de descentes de gradient. Nous évaluons notre algorithme sur des jeux de données standards et le com-parons à divers algorithmes de l'état de l'art, comme [14,18,32]. Les ré-sultats sur ces jeux de données, ainsi que sur le jeu de données de la tâche de coréférence CoNLL-2012, montrent que notre algorithme amé-liore significativement la qualité des clusters obtenus par les précédentes approches, et est plus robuste en montée en charge

    Fast Gaussian Pairwise Constrained Spectral Clustering

    Get PDF
    International audienceWe consider the problem of spectral clustering with partial supervision in the form of must-link and cannot-link constraints. Such pairwise constraints are common in problems like coreference resolution in natural language processing. The approach developed in this paper is to learn a new representation space for the data together with a dis-tance in this new space. The representation space is obtained through a constraint-driven linear transformation of a spectral embedding of the data. Constraints are expressed with a Gaussian function that locally reweights the similarities in the projected space. A global, non-convex optimization objective is then derived and the model is learned via gradi-ent descent techniques. Our algorithm is evaluated on standard datasets and compared with state of the art algorithms, like [14,18,31]. Results on these datasets, as well on the CoNLL-2012 coreference resolution shared task dataset, show that our algorithm significantly outperforms related approaches and is also much more scalable

    Apprentissage d'une hiérarchie de modèles à paires spécialisés pour la résolution de la coréférence

    Get PDF
    National audienceNous proposons une nouvelle méthode pour améliorer significativement la performance des modèles à paires de mentions pour la résolution de la coréférence. Étant donné un ensemble d'indicateurs, notre méthode apprend à séparer au mieux des types de paires de mentions en classes d'équivalence, chacune de celles-ci donnant lieu à un modèle de classification spécifique. La procédure algorithmique proposée trouve le meilleur espace de traits (créé à partir de combinaisons de traits élémentaires et d'indicateurs) pour discriminer les paires de mentions coréférentielles. Bien que notre approche explore un très vaste ensemble d'espaces de trait, elle reste efficace en exploitant la structure des hiérarchies construites à partir des indicateurs. Nos expériences sur les données anglaises de la CoNLL-2012 Shared Task indiquent que notre méthode donne des gains de performance par rapport au modèle initial utilisant seulement les traits élémentaires, et ce, quelque soit la méthode de formation des chaînes ou la métrique d'évaluation choisie. Notre meilleur système obtient une moyenne de 67.2 en F1-mesure MUC, B3 et CEAF ce qui, malgré sa simplicité, le situe parmi les meilleurs systèmes testés sur ces données

    Review of coreference resolution in English and Persian

    Full text link
    Coreference resolution (CR) is one of the most challenging areas of natural language processing. This task seeks to identify all textual references to the same real-world entity. Research in this field is divided into coreference resolution and anaphora resolution. Due to its application in textual comprehension and its utility in other tasks such as information extraction systems, document summarization, and machine translation, this field has attracted considerable interest. Consequently, it has a significant effect on the quality of these systems. This article reviews the existing corpora and evaluation metrics in this field. Then, an overview of the coreference algorithms, from rule-based methods to the latest deep learning techniques, is provided. Finally, coreference resolution and pronoun resolution systems in Persian are investigated.Comment: 44 pages, 11 figures, 5 table
    corecore