2 research outputs found
An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
Motivation: Annotations are a key feature of many biological databases, used
to convey our knowledge of a sequence to the reader. Ideally, annotations are
curated manually, however manual curation is costly, time consuming and
requires expert knowledge and training. Given these issues and the exponential
increase of data, many databases implement automated annotation pipelines in an
attempt to avoid un-annotated entries. Both manual and automated annotations
vary in quality between databases and annotators, making assessment of
annotation reliability problematic for users. The community lacks a generic
measure for determining annotation quality and correctness, which we look at
addressing within this article. Specifically we investigate word reuse within
bulk textual annotations and relate this to Zipf's Principle of Least Effort.
We use UniProt Knowledge Base (UniProtKB) as a case study to demonstrate this
approach since it allows us to compare annotation change, both over time and
between automated and manually curated annotations.
Results: By applying power-law distributions to word reuse in annotation, we
show clear trends in UniProtKB over time, which are consistent with existing
studies of quality on free text English. Further, we show a clear distinction
between manual and automated analysis and investigate cohorts of protein
records as they mature. These results suggest that this approach holds distinct
promise as a mechanism for judging annotation quality.
Availability: Source code is available at the authors website:
http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation.
Contact: [email protected]: Paper accepted at The European Conference on Computational Biology
2012 (ECCB'12). Subsequently will be published in a special issue of the
journal Bioinformatics. Paper consists of 8 pages, made up of 5 figure
Provenance, propagation and quality of biological annotation
PhD ThesisBiological databases have become an integral part of the life sciences, being used
to store, organise and share ever-increasing quantities and types of data. Biological
databases are typically centred around raw data, with individual entries being
assigned to a single piece of biological data, such as a DNA sequence. Although essential,
a reader can obtain little information from the raw data alone. Therefore,
many databases aim to supplement their entries with annotation, allowing the current
knowledge about the underlying data to be conveyed to a reader. Although annotations
come in many di erent forms, most databases provide some form of free text
annotation.
Given that annotations can form the foundations of future work, it is important that a
user is able to evaluate the quality and correctness of an annotation. However, this is
rarely straightforward. The amount of annotation, and the way in which it is curated,
varies between databases. For example, the production of an annotation in some
databases is entirely automated, without any manual intervention. Further, sections
of annotations may be reused, being propagated between entries and, potentially,
external databases. This provenance and curation information is not always apparent
to a user.
The work described within this thesis explores issues relating to biological annotation
quality. While the most valuable annotation is often contained within free text, its lack
of structure makes it hard to assess. Initially, this work describes a generic approach
that allows textual annotations to be quantitatively measured. This approach is based
upon the application of Zipf's Law to words within textual annotation, resulting in a
single value, . The relationship between the value and Zipf's principle of least e ort
provides an indication as to the annotations quality, whilst also allowing annotations
to be quantitatively compared.
Secondly, the thesis focuses on determining annotation provenance and tracking any
subsequent propagation. This is achieved through the development of a visualisation
- i -
framework, which exploits the reuse of sentences within annotations. Utilising this
framework a number of propagation patterns were identi ed, which on analysis appear
to indicate low quality and erroneous annotation.
Together, these approaches increase our understanding in the textual characteristics
of biological annotation, and suggests that this understanding can be used to increase
the overall quality of these resources