19 research outputs found
Annotation of phenotypes using ontologies:a gold standard for the training and evaluation of natural language processing systems
Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity-quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine-human consistency, or similarity, was significantly lower than inter-curator (human-human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale
Web Relation Extraction with Distant Supervision
Being able to find relevant information about prominent entities quickly is the main reason to use a search engine. However, with large quantities of information on the World Wide Web, real time search over billions of Web pages can waste resources and the end user’s time. One of the solutions to this is to store the answer to frequently asked general knowledge queries, such as the albums released by a musical artist, in a more accessible format, a knowledge base. Knowledge bases can be created and maintained automatically by using information extraction methods, particularly methods to extract relations between proper names (named entities). A group of approaches for this that has become popular in recent years are distantly supervised approaches as they allow to train relation extractors without text-bound annotation, using instead known relations from a knowledge base to heuristically align them with a large textual corpus from an appropriate domain. This thesis focuses on researching distant supervision for the Web domain. A new setting for creating training and testing data for distant supervision from the Web with entity-specific search queries is introduced and the resulting corpus is published. Methods to recognise noisy training examples as well as methods to combine extractions based on statistics derived from the background knowledge base are researched. Using co-reference resolution methods to extract relations from sentences which do not contain a direct mention of the subject of the relation is also investigated. One bottleneck for distant supervision for Web data is identified to be named entity recognition and classification (NERC), since relation extraction methods rely on it for identifying relation arguments. Typically, existing pre-trained tools are used, which fail in diverse genres with non-standard language, such as the Web genre. The thesis explores what can cause NERC methods to fail in diverse genres and quantifies different reasons for NERC failure. Finally, a novel method for NERC for relation extraction is proposed based on the idea of jointly training the named entity classifier and the relation extractor with imitation learning to reduce the reliance on external NERC tools. This thesis improves the state of the art in distant supervision for knowledge base population, and sheds light on and proposes solutions for issues arising for information extraction for not traditionally studied domains
Completeness and Consistency Analysis for Evolving Knowledge Bases
Assessing the quality of an evolving knowledge base is a challenging task as
it often requires to identify correct quality assessment procedures.
Since data is often derived from autonomous, and increasingly large data
sources, it is impractical to manually curate the data, and challenging to
continuously and automatically assess their quality.
In this paper, we explore two main areas of quality assessment related to
evolving knowledge bases: (i) identification of completeness issues using
knowledge base evolution analysis, and (ii) identification of consistency
issues based on integrity constraints, such as minimum and maximum cardinality,
and range constraints.
For completeness analysis, we use data profiling information from consecutive
knowledge base releases to estimate completeness measures that allow predicting
quality issues. Then, we perform consistency checks to validate the results of
the completeness analysis using integrity constraints and learning models.
The approach has been tested both quantitatively and qualitatively by using a
subset of datasets from both DBpedia and 3cixty knowledge bases. The
performance of the approach is evaluated using precision, recall, and F1 score.
From completeness analysis, we observe a 94% precision for the English DBpedia
KB and 95% precision for the 3cixty Nice KB. We also assessed the performance
of our consistency analysis by using five learning models over three sub-tasks,
namely minimum cardinality, maximum cardinality, and range constraint. We
observed that the best performing model in our experimental setup is the Random
Forest, reaching an F1 score greater than 90% for minimum and maximum
cardinality and 84% for range constraints.Comment: Accepted for Journal of Web Semantic
Investigating Citation Linkage Between Research Articles
In recent years, there has been a dramatic increase in scientific publications across the globe. To help navigate this overabundance of information, methods have been devised to find papers with related content, but they are lacking in the ability to provide specific information that a researcher may need without having to read hundreds of linked papers. The search and browsing capabilities of online domain specific scientific repositories are limited to finding a paper citing other papers, but do not point to the specific text that is being cited. Providing this capability to the research community will be beneficial in terms of the time required to acquire the amount of background information they need to undertake their research. In this thesis, we present our effort to develop a citation linkage framework for finding those sentences in a cited article that are the focus of a citation in a citing paper. This undertaking has involved the construction of datasets and corpora that are required to build models for focused information extraction, text classification and information retrieval. As the first part of this thesis, two preprocessing steps that are deemed to assist with the citation linkage task are explored: method mention extraction and rhetorical categorization of scientific discourse. In the second part of this thesis, two methodologies for achieving the citation linkage goal are investigated. Firstly, regression techniques have been used to predict the degree of similarity between citation sentences and their equivalent target sentences with medium Pearson correlation score between predicted and expected values. The resulting learning models are then used to rank sentences in the cited paper based on their predicted scores. Secondly, search engine-like retrieval techniques have been used to rank sentences in the cited paper based on the words contained in the citation sentence. Our experiments show that it is possible to find the set of sentences that a citation refers to in a cited paper with reasonable performance. Possible applications of this work include: creation of better science paper repository navigation tools, development of scientific argumentation across research articles, and multi-document summarization of science articles
Recommended from our members
A modular, open-source information extraction framework for identifying clinical concepts and processes of care in clinical narratives
In this thesis, a synthesis is presented of the knowledge models required by clinical informa- tion systems that provide decision support for longitudinal processes of care. Qualitative research techniques and thematic analysis are novelly applied to a systematic review of the literature on the challenges in implementing such systems, leading to the development of an original conceptual framework. The thesis demonstrates how these process-oriented systems make use of a knowledge base derived from workflow models and clinical guidelines, and argues that one of the major barriers to implementation is the need to extract explicit and implicit information from diverse resources in order to construct the knowledge base. Moreover, concepts in both the knowledge base and in the electronic health record (EHR) must be mapped to a common ontological model. However, the majority of clinical guideline information remains in text form, and much of the useful clinical information residing in the EHR resides in the free text fields of progress notes and laboratory reports. In this thesis, it is shown how natural language processing and information extraction techniques provide a means to identify and formalise the knowledge components required by the knowledge base. Original contributions are made in the development of lexico-syntactic patterns and the use of external domain knowledge resources to tackle a variety of information extraction tasks in the clinical domain, such as recognition of clinical concepts, events, temporal relations, term disambiguation and abbreviation expansion. Methods are developed for adapting existing tools and resources in the biomedical domain to the processing of clinical texts, and approaches to improving the scalability of these tools are proposed and evalu- ated. These tools and techniques are then combined in the creation of a novel approach to identifying processes of care in the clinical narrative. It is demonstrated that resolution of coreferential and anaphoric relations as narratively and temporally ordered chains provides a means to extract linked narrative events and processes of care from clinical notes. Coreference performance in discharge summaries and progress notes is largely dependent on correct identification of protagonist chains (patient, clinician, family relation), pronominal resolution, and string matching that takes account of experiencer, temporal, spatial, and anatomical context; whereas for laboratory reports additional, external domain knowledge is required. The types of external knowledge and their effects on system performance are identified and evaluated. Results are compared against existing systems for solving these tasks and are found to improve on them, or to approach the performance of recently reported, state-of-the- art systems. Software artefacts developed in this research have been made available as open-source components within the General Architecture for Text Engineering framework
Grounding event references in news
Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation
Grounding event references in news
Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation
Toponym Resolution in Text
Institute for Communicating and Collaborative SystemsBackground. In the area of Geographic Information Systems (GIS), a shared discipline between
informatics and geography, the term geo-parsing is used to describe the process of identifying
names in text, which in computational linguistics is known as named entity recognition
and classification (NERC). The term geo-coding is used for the task of mapping from implicitly
geo-referenced datasets (such as structured address records) to explicitly geo-referenced
representations (e.g., using latitude and longitude). However, present-day GIS systems provide
no automatic geo-coding functionality for unstructured text.
In Information Extraction (IE), processing of named entities in text has traditionally been seen
as a two-step process comprising a flat text span recognition sub-task and an atomic classification
sub-task; relating the text span to a model of the world has been ignored by evaluations
such as MUC or ACE (Chinchor (1998); U.S. NIST (2003)).
However, spatial and temporal expressions refer to events in space-time, and the grounding of
events is a precondition for accurate reasoning. Thus, automatic grounding can improve many
applications such as automatic map drawing (e.g. for choosing a focus) and question answering
(e.g. , for questions like How far is London from Edinburgh?, given a story in which both occur
and can be resolved). Whereas temporal grounding has received considerable attention in the
recent past (Mani and Wilson (2000); Setzer (2001)), robust spatial grounding has long been
neglected.
Concentrating on geographic names for populated places, I define the task of automatic
Toponym Resolution (TR) as computing the mapping from occurrences of names for places as
found in a text to a representation of the extensional semantics of the location referred to (its
referent), such as a geographic latitude/longitude footprint.
The task of mapping from names to locations is hard due to insufficient and noisy databases,
and a large degree of ambiguity: common words need to be distinguished from proper names
(geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London
can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other
Londons on earth). In addition, names of places and the boundaries referred to change over
time, and databases are incomplete.
Objective. I investigate how referentially ambiguous spatial named entities can be grounded,
or resolved, with respect to an extensional coordinate model robustly on open-domain news
text.
I begin by comparing the few algorithms proposed in the literature, and, comparing semiformal,
reconstructed descriptions of them, I factor out a shared repertoire of linguistic heuristics
(e.g. rules, patterns) and extra-linguistic knowledge sources (e.g. population sizes). I then
investigate how to combine these sources of evidence to obtain a superior method. I also investigate
the noise effect introduced by the named entity tagging step that toponym resolution
relies on in a sequential system pipeline architecture.
Scope. In this thesis, I investigate a present-day snapshot of terrestrial geography as represented
in the gazetteer defined and, accordingly, a collection of present-day news text. I limit
the investigation to populated places; geo-coding of artifact names (e.g. airports or bridges),
compositional geographic descriptions (e.g. 40 miles SW of London, near Berlin), for instance,
is not attempted. Historic change is a major factor affecting gazetteer construction and ultimately
toponym resolution. However, this is beyond the scope of this thesis.
Method. While a small number of previous attempts have been made to solve the toponym
resolution problem, these were either not evaluated, or evaluation was done by manual inspection
of system output instead of curating a reusable reference corpus.
Since the relevant literature is scattered across several disciplines (GIS, digital libraries,
information retrieval, natural language processing) and descriptions of algorithms are mostly
given in informal prose, I attempt to systematically describe them and aim at a reconstruction
in a uniform, semi-formal pseudo-code notation for easier re-implementation. A systematic
comparison leads to an inventory of heuristics and other sources of evidence.
In order to carry out a comparative evaluation procedure, an evaluation resource is required.
Unfortunately, to date no gold standard has been curated in the research community. To this
end, a reference gazetteer and an associated novel reference corpus with human-labeled referent
annotation are created.
These are subsequently used to benchmark a selection of the reconstructed algorithms and
a novel re-combination of the heuristics catalogued in the inventory.
I then compare the performance of the same TR algorithms under three different conditions,
namely applying it to the (i) output of human named entity annotation, (ii) automatic annotation
using an existing Maximum Entropy sequence tagging model, and (iii) a na¨ıve toponym lookup
procedure in a gazetteer.
Evaluation. The algorithms implemented in this thesis are evaluated in an intrinsic or
component evaluation. To this end, we define a task-specific matching criterion to be used with
traditional Precision (P) and Recall (R) evaluation metrics. This matching criterion is lenient
with respect to numerical gazetteer imprecision in situations where one toponym instance is
marked up with different gazetteer entries in the gold standard and the test set, respectively, but
where these refer to the same candidate referent, caused by multiple near-duplicate entries in
the reference gazetteer.
Main Contributions. The major contributions of this thesis are as follows:
• A new reference corpus in which instances of location named entities have been manually
annotated with spatial grounding information for populated places, and an associated
reference gazetteer, from which the assigned candidate referents are chosen. This reference
gazetteer provides numerical latitude/longitude coordinates (such as 51320 North,
0 50 West) as well as hierarchical path descriptions (such as London > UK) with respect
to a world wide-coverage, geographic taxonomy constructed by combining several large,
but noisy gazetteers. This corpus contains news stories and comprises two sub-corpora,
a subset of the REUTERS RCV1 news corpus used for the CoNLL shared task (Tjong
Kim Sang and De Meulder (2003)), and a subset of the Fourth Message Understanding
Contest (MUC-4; Chinchor (1995)), both available pre-annotated with gold-standard.
This corpus will be made available as a reference evaluation resource;
• a new method and implemented system to resolve toponyms that is capable of robustly
processing unseen text (open-domain online newswire text) and grounding toponym instances
in an extensional model using longitude and latitude coordinates and hierarchical
path descriptions, using internal (textual) and external (gazetteer) evidence;
• an empirical analysis of the relative utility of various heuristic biases and other sources
of evidence with respect to the toponym resolution task when analysing free news genre
text;
• a comparison between a replicated method as described in the literature, which functions
as a baseline, and a novel algorithm based on minimality heuristics; and
• several exemplary prototypical applications to show how the resulting toponym resolution
methods can be used to create visual surrogates for news stories, a geographic exploration
tool for news browsing, geographically-aware document retrieval and to answer
spatial questions (How far...?) in an open-domain question answering system. These
applications only have demonstrative character, as a thorough quantitative, task-based
(extrinsic) evaluation of the utility of automatic toponym resolution is beyond the scope of this thesis and left for future work