3,515 research outputs found
Corpus-based identification of non-anaphoric noun phrases
Journal ArticleCoreference resolution involves finding antecedents for anaphoric discourse entities, such as definite noun phrases. But many definite noun phrases are not anaphoric because their meaning can be understood from general world knowledge (e.g., "the White House" or "the news media"). We have developed a corpus-based algorithm for automatically identifying definite noun phrases that are non-anaphoric, which has the potential to improve the efficiency and accuracy of coreference resolution systems. Our algorithm generates lists of nonanaphoric noun phrases and noun phrase patterns from a training corpus and uses them to recognize non-anaphoric noun phrases in new texts. Using 1600 MIX -1 terrorism news articles as the training corpus, our approach achieved 78% recall and 87% precision at identifying such noun phrases in 50 test documents
A Corpus-Based Investigation of Definite Description Use
We present the results of a study of definite descriptions use in written
texts aimed at assessing the feasibility of annotating corpora with information
about definite description interpretation. We ran two experiments, in which
subjects were asked to classify the uses of definite descriptions in a corpus
of 33 newspaper articles, containing a total of 1412 definite descriptions. We
measured the agreement among annotators about the classes assigned to definite
descriptions, as well as the agreement about the antecedent assigned to those
definites that the annotators classified as being related to an antecedent in
the text. The most interesting result of this study from a corpus annotation
perspective was the rather low agreement (K=0.63) that we obtained using
versions of Hawkins' and Prince's classification schemes; better results
(K=0.76) were obtained using the simplified scheme proposed by Fraurud that
includes only two classes, first-mention and subsequent-mention. The agreement
about antecedents was also not complete. These findings raise questions
concerning the strategy of evaluating systems for definite description
interpretation by comparing their results with a standardized annotation. From
a linguistic point of view, the most interesting observations were the great
number of discourse-new definites in our corpus (in one of our experiments,
about 50% of the definites in the collection were classified as discourse-new,
30% as anaphoric, and 18% as associative/bridging) and the presence of
definites which did not seem to require a complete disambiguation.Comment: 47 pages, uses fullname.sty and palatino.st
Recommended from our members
Coreference resolution in clinical discharge summaries, progress notes, surgical and pathology reports: a unified lexical approach
We developed a lexical rule-based system that uses a unified approach to resolving coreference across a wide variety of clinical records comprising discharge summaries, progress notes, pathology, radiology and surgical reports from two corpora (Ontology Development and Information Extraction (ODIE) and i2b2/VA) provided for the fifth i2b2/VA shared task. Taking the unweighted mean between 4 coreference metrics, validation of the system against the i2b2/VA corpus attained an overall F-score of 87.7% across all mention classes, with a maximum of 93.1% for coreference of persons, and a minimum of 77.2% for coreference of tests. For the ODIE corpus the overall F-score across all mention classes was 79.4%, with a maximum of 82.0% for coreference of persons and a minimum of 13.1% for coreference of diagnostic reagents. For the ODIE corpus our results are comparable to the mean reported inter-annotator agreement with the gold standard. We discuss the four categories of errors we identified, and how these might be addressed. The system uses a number of reusable modules and techniques that may be of benefit to the research community
Comparing knowledge sources for nominal anaphora resolution
We compare two ways of obtaining lexical knowledge for antecedent selection in other-anaphora
and definite noun phrase coreference. Specifically, we compare an algorithm that relies on links
encoded in the manually created lexical hierarchy WordNet and an algorithm that mines corpora
by means of shallow lexico-semantic patterns. As corpora we use the British National
Corpus (BNC), as well as the Web, which has not been previously used for this task. Our
results show that (a) the knowledge encoded in WordNet is often insufficient, especially for
anaphor-antecedent relations that exploit subjective or context-dependent knowledge; (b) for
other-anaphora, the Web-based method outperforms the WordNet-based method; (c) for definite
NP coreference, the Web-based method yields results comparable to those obtained using
WordNet over the whole dataset and outperforms the WordNet-based method on subsets of the
dataset; (d) in both case studies, the BNC-based method is worse than the other methods because
of data sparseness. Thus, in our studies, the Web-based method alleviated the lexical knowledge
gap often encountered in anaphora resolution, and handled examples with context-dependent relations
between anaphor and antecedent. Because it is inexpensive and needs no hand-modelling
of lexical knowledge, it is a promising knowledge source to integrate in anaphora resolution systems
Recommended from our members
Lexical patterns, features and knowledge resources for coreference resolution in clinical notes
Generation of entity coreference chains provides a means to extract linked narrative events from clinical notes, but despite being a well-researched topic in natural language processing, general- purpose coreference tools perform poorly on clinical texts. This paper presents a knowledge-centric and pattern-based approach to resolving coreference across a wide variety of clinical records comprising discharge summaries, progress notes, pathology, radiology and surgical reports from two corpora (Ontology Development and Information Extraction (ODIE) and i2b2/VA). In addition, a method for generating coreference chains using progressively pruned linked lists is demonstrated that reduces the search space and facilitates evaluation by a number of metrics. Independent evaluation results show an F-measure for each corpus of 79.2% and 87.5%, respectively, which offers performance at least as good as human annotators, greatly increased performance over general- purpose tools, and improvement on previously reported clinical coreference systems. The system uses a number of open-source components that are available to download
Non-situational functions of demonstrative noun phrases in Lingala (Bantu)
This paper examines the non-situational (i.e., non-exophoric) pragmatic functions of the three adnominal demonstratives, oyo, wand, and yango in the Bantu language Lingala. An examination of natural language corpora reveals that, although native-speaker intuitions sanction the use of oyo as an anaphor in demonstrative NPs, this demonstrative is hardly ever used in that role. It also reveals that wand, which has both situational and discourse-referential capacities, is used more frequently than the exclusively anaphoric demonstrative yango. It is explained that wand appears in a wide range of non-coreferential expression types, in coreferential expression types involving low-salience referents, and in coreferential expression types that both involve highly salient referents and include the speaker's desire to signal a shift in the mental representation of the referent towards a pejorative reading. The use of yango, on the other hand, is only licensed in cases of coreferentiality involving highly salient referents and implying continuation of the same mental representation of the referent. A specific section is devoted to charting the possible grammaticalization paths followed by the demonstratives. Conclusions are drawn for pragmatic theory formation in terms of the relation between form (yango vs. wand) and function (coreferentiality vs. non-coreferentiality)
A Framework for Interpreting Bridging Anaphora
In this paper we present a novel framework for resolving bridging anaphora.We argue that anaphora, particularly bridging anaphora, is used as a shortcut device similar to the use of compound nouns. Hence, the two natural language usage phenomena would have to be based on the same theoretical framework. We use an existing theory on compound nouns to test its validity for anaphora usages. To do this, we used hu- man annotators to interpret indirect anaphora from naturally occurring discourses. The annotators were asked to classify the relations between anaphor-antecedent pairs into relation types that have been previously used to describe the relations between a modi er and the head noun of a compound noun. We obtained very encouraging results with an average Fleiss's value of 0.66 for inter-annotation agreement. The results were evaluated against other similar natural language interpretation annota- tion experiments and were found to compare well. In order to determine the prevalence of the proposed set of anaphora relations we did a detailed analysis of a subset 20 newspaper articles. The results obtained from this also indicated that a majority (98%) of the relations could be described by the relations in the framework. The results from this analysis also showed the distribution of the relation types in the genre of news paper article discourses
- …