13 research outputs found
Keyword Detection in Text Summarization
Summarization is the process of reducing a text document in order to create a summary that retains the most important points of the original document. As the problem of information overload has grown, and as the quantity of data has increased, so has interest in automatic summarization. Extractive summary works on the given text to extract sentences that best convey the message hidden in the text. Most extractive summarization techniques revolve around the concept of indexing keywords and extracting sentences that have more keywords than the rest. Keyword extraction usually is done by extracting important words having a higher frequency than others, with stress on important. However the current techniques to handle this importance include a stop list which might include words that are critically important to the text. In this thesis, I present a work in progress to define an algorithm to extract truly significant keywords which might have lost its significance if subjected to the current keyword extraction algorithms
Which Factors Contributes to Resolving Coreference Chains with Bayesian Networks?
International audienceThis paper describes coreference chain resolution with Bayesian Networks. Several factors in the resolution of coreference chains may greatly affect the final performance. If the choice of machine learning algorithm and the features the learner relies on are largely addressed by the community, others factors implicated in the resolution, such as noisy features, anaphoricity resolution or the search windows, have been less studied, and their importance remains unclear. In this article, we describe a mention-pair resolver using Bayesian Networks, targeting coreference resolution in discharge summaries. We present a study of the contributions of comprehensive factors involved in the resolution using the 2011 i2b2/VA challenge data set. The results of our study indicate that, besides the use of noisy features for the resolution, anaphoricity resolution has the biggest effect on the coreference chain resolution performance
Review of coreference resolution in English and Persian
Coreference resolution (CR) is one of the most challenging areas of natural
language processing. This task seeks to identify all textual references to the
same real-world entity. Research in this field is divided into coreference
resolution and anaphora resolution. Due to its application in textual
comprehension and its utility in other tasks such as information extraction
systems, document summarization, and machine translation, this field has
attracted considerable interest. Consequently, it has a significant effect on
the quality of these systems. This article reviews the existing corpora and
evaluation metrics in this field. Then, an overview of the coreference
algorithms, from rule-based methods to the latest deep learning techniques, is
provided. Finally, coreference resolution and pronoun resolution systems in
Persian are investigated.Comment: 44 pages, 11 figures, 5 table
Coreference resolution survey
This survey is an extended summarization of state of the art of coreference resolution. The key concepts related to coreference and anaphora are presented, the most relevant approaches to coreference resolution are discussed, and existing systems are classified and compared. Finally, the evaluation methods shared by researchers in the area and the commonly used data sets corpora are presented and compared.Postprint (published version
Coreference resolution with and for Wikipedia
Wikipédia est une ressource embarquée dans de nombreuses applications du traite-
ment des langues naturelles. Pourtant, aucune étude à notre connaissance n’a tenté de
mesurer la qualité de résolution de coréférence dans les textes de Wikipédia, une étape
préliminaire à la compréhension de textes. La première partie de ce mémoire consiste à
construire un corpus de coréférence en anglais, construit uniquement à partir des articles
de Wikipédia. Les mentions sont étiquetées par des informations syntaxiques et séman-
tiques, avec lorsque cela est possible un lien vers les entités FreeBase équivalentes. Le
but est de créer un corpus équilibré regroupant des articles de divers sujets et tailles.
Notre schéma d’annotation est similaire à celui suivi dans le projet OntoNotes. Dans la
deuxième partie, nous allons mesurer la qualité des systèmes de détection de coréférence
à l’état de l’art sur une tâche simple consistant à mesurer les mentions du concept décrit
dans une page Wikipédia (p. ex : les mentions du président Obama dans la page Wiki-
pédia dédiée à cette personne). Nous tenterons d’améliorer ces performances en faisant
usage le plus possible des informations disponibles dans Wikipédia (catégories, redi-
rects, infoboxes, etc.) et Freebase (information du genre, du nombre, type de relations
avec autres entités, etc.).Wikipedia is a resource of choice exploited in many NLP applications, yet we are
not aware of recent attempts to adapt coreference resolution to this resource, a prelim-
inary step to understand Wikipedia texts. The first part of this master thesis is to build
an English coreference corpus, where all documents are from the English version of
Wikipedia. We annotated each markable with coreference type, mention type and the
equivalent Freebase topic. Our corpus has no restriction on the topics of the documents
being annotated, and documents of various sizes have been considered for annotation.
Our annotation scheme follows the one of OntoNotes with a few disparities. In part two,
we propose a testbed for evaluating coreference systems in a simple task of measuring
the particulars of the concept described in a Wikipedia page (eg. The statements of Pres-
ident Obama the Wikipedia page dedicated to that person). We show that by exploiting
the Wikipedia markup (categories, redirects, infoboxes, etc.) of a document, as well
as links to external knowledge bases such as Freebase (information of the type, num-
ber, type of relationship with other entities, etc.), we can acquire useful information on
entities that helps to classify mentions as coreferent or not
Linguistics parameters for zero anaphora resolution
Dissertação de mest., Natural Language Processing and Human Language Technology, Univ. do Algarve, 2009This dissertation describes and proposes a set of linguistically motivated rules for zero
anaphora resolution in the context of a natural language processing chain developed for
Portuguese. Some languages, like Portuguese, allow noun phrase (NP) deletion (or zeroing)
in several syntactic contexts in order to avoid the redundancy that would result from
repetition of previously mentioned words. The co-reference relation between the zeroed
element and its antecedent (or previous mention) in the discourse is here called zero
anaphora (Mitkov, 2002). In Computational Linguistics, zero anaphora resolution may be
viewed as a subtask of anaphora resolution and has an essential role in various Natural
Language Processing applications such as information extraction, automatic abstracting,
dialog systems, machine translation and question answering. The main goal of this
dissertation is to describe the grammatical rules imposing subject NP deletion and referential
constraints in the Brazilian Portuguese, in order to allow a correct identification of the
antecedent of the deleted subject NP. Some of these rules were then formalized into the
Xerox Incremental Parser or XIP (Ait-Mokhtar et al., 2002: 121-144) in order to constitute a
module of the Portuguese grammar (Mamede et al. 2010) developed at Spoken Language
Laboratory (L2F). Using this rule-based approach we expected to improve the performance
of the Portuguese grammar namely by producing better dependency structures with
(reconstructed) zeroed NPs for the syntactic-semantic interface. Because of the complexity
of the task, the scope of this dissertation had to be limited: (a) subject NP deletion; b) within
sentence boundaries and (c) with an explicit antecedent; besides, (d) rules were formalized
based solely on the results of the shallow parser (or chunks), that is, with minimal syntactic
(and no semantic) knowledge. A corpus of different text genres was manually annotated for
zero anaphors and other zero-shaped, usually indefinite, subjects. The rule-based
approached is evaluated and results are presented and discussed
Korreferentzia-ebazpena euskarazko testuetan.
203 p.Gaur egun, korreferentzia-ebazpen automatikoa gakotzat har dezakegu testuak ulertuahal izateko; ondorioz, behar-beharrezkoa da diskurtsoaren ulerkuntza sakona eskatzenduten Lengoaia Naturalaren Prozesamenduko (NLP) hainbat atazatan.Testu bateko bi espresio testualek objektu berbera adierazi edo erreferentziatzendutenean, bi espresio horien artean korreferentzia-erlazio bat dagoela esan ohi da. Testubatean ager daitezkeen espresio testual horien arteko korreferentzia-erlazioak ebazteahelburu duen atazari korreferentzia-ebazpena deritzo.Tesi-lan hau, hizkuntzalaritza konputazionalaren arloan kokatzen da eta euskarazidatzitako testuen korreferentzia-ebazpen automatikoa du helburu, zehazkiago esanda,euskarazko korreferentzia-ebazpen automatikoa gauzatzeko dagoen baliabide eta tresnenhutsunea betetzea du helburu.Tesi-lan honetan, lehenik euskarazko testuetan ager daitezkeen espresio testualakautomatikoki identifikatzeko garatu dugun erregelatan oinarritutako tresna azaltzen da.Ondoren, Stanfordeko unibertsitatean ingeleserako diseinatu den erregelatanoinarritutako korreferentzia-ebazpenerako sistema euskararen ezaugarrietara nolaegokitu den eta ezagutza-base semantikoak erabiliz nola hobetu dugun aurkezten da.Bukatzeko, ikasketa automatikoan oinarritzen den BART korreferentzia-ebazpenerakosistema euskarara egokitzeko eta hobetzeko egindako lana azaltzen da