217 research outputs found
Provenance, propagation and quality of biological annotation
PhD ThesisBiological databases have become an integral part of the life sciences, being used
to store, organise and share ever-increasing quantities and types of data. Biological
databases are typically centred around raw data, with individual entries being
assigned to a single piece of biological data, such as a DNA sequence. Although essential,
a reader can obtain little information from the raw data alone. Therefore,
many databases aim to supplement their entries with annotation, allowing the current
knowledge about the underlying data to be conveyed to a reader. Although annotations
come in many di erent forms, most databases provide some form of free text
annotation.
Given that annotations can form the foundations of future work, it is important that a
user is able to evaluate the quality and correctness of an annotation. However, this is
rarely straightforward. The amount of annotation, and the way in which it is curated,
varies between databases. For example, the production of an annotation in some
databases is entirely automated, without any manual intervention. Further, sections
of annotations may be reused, being propagated between entries and, potentially,
external databases. This provenance and curation information is not always apparent
to a user.
The work described within this thesis explores issues relating to biological annotation
quality. While the most valuable annotation is often contained within free text, its lack
of structure makes it hard to assess. Initially, this work describes a generic approach
that allows textual annotations to be quantitatively measured. This approach is based
upon the application of Zipf's Law to words within textual annotation, resulting in a
single value, . The relationship between the value and Zipf's principle of least e ort
provides an indication as to the annotations quality, whilst also allowing annotations
to be quantitatively compared.
Secondly, the thesis focuses on determining annotation provenance and tracking any
subsequent propagation. This is achieved through the development of a visualisation
- i -
framework, which exploits the reuse of sentences within annotations. Utilising this
framework a number of propagation patterns were identi ed, which on analysis appear
to indicate low quality and erroneous annotation.
Together, these approaches increase our understanding in the textual characteristics
of biological annotation, and suggests that this understanding can be used to increase
the overall quality of these resources
Pre-validation of a MALDI MS proteomics-based method for the reliable detection of blood and blood provenance
Abstract: The reliable identification of blood, as well as the determination of its origin (human or animal) is of great importance in a forensic investigation. Whilst presumptive tests are rapid and deployed in situ, their very nature requires confirmatory tests to be performed remotely. However, only serological tests can determine blood provenance. The present study improves on a previously devised Matrix Assisted Laser Desorption Ionisation Mass Spectrometry (MALDI MS)—proteomics based method for the reliable detection of blood by enabling the determination of blood provenance. The overall protocol was developed to be more specific than presumptive tests and faster/easier than the gold standard liquid chromatography (LC) MS/MS analysis. This is considered a pre-validation study that has investigated stains and fingermarks made in blood, other biofluids and substances that can elicit a false-positive response to colorimetric or presumptive tests, in a blind fashion. Stains and marks were either untreated or enhanced with a range of presumptive tests. Human and animal blood were correctly discriminated from other biofluids and non-biofluid related matrices; animal species determination was also possible within the system investigated. The procedure is compatible with the prior application of presumptive tests. The refined strategy resulting from iterative improvements through a trial and error study of 56 samples was applied to a final set of 13 blind samples. This final study yielded 12/13 correct identifications with the 13th sample being correctly identified as animal blood but with no species attribution. This body of work will contribute towards the validation of MALDI MS based methods and deployment in violent crimes involving bloodshed
Resolution of the type material of the Asian elephant, Elephas maximus Linnaeus, 1758 (Proboscidea, Elephantidae)
The understanding of Earth’s biodiversity depends critically on the accurate identification and nomenclature of
species. Many species were described centuries ago, and in a surprising number of cases their nomenclature or type
material remain unclear or inconsistent. A prime example is provided by Elephas maximus, one of the most iconic
and well-known mammalian species, described and named by Linnaeus (1758) and today designating the Asian
elephant. We used morphological, ancient DNA (aDNA), and high-throughput ancient proteomic analyses to
demonstrate that a widely discussed syntype specimen of E. maximus, a complete foetus preserved in ethanol, is
actually an African elephant, genus Loxodonta. We further discovered that an additional E. maximus syntype,
mentioned in a description by John Ray (1693) cited by Linnaeus, has been preserved as an almost complete skeleton
at the Natural History Museum of the University of Florence. Having confirmed its identity as an Asian elephant
through both morphological and ancient DNA analyses, we designate this specimen as the lectotype of E. maximus
Protein Bioinformatics Infrastructure for the Integration and Analysis of Multiple High-Throughput “omics” Data
High-throughput “omics” technologies bring new opportunities for biological and biomedical researchers to ask complex questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard and standardized nomenclature imposes a major challenge which requires advanced computational methods and bioinformatics infrastructures for integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery. In this paper, we present the challenges in high-throughput “omics” data integration and analysis, introduce a protein-centric approach for systems integration of large and heterogeneous high-throughput “omics” data including microarray, mass spectrometry, protein sequence, protein structure, and protein interaction data, and use scientific case study to illustrate how one can use varied “omics” data from different laboratories to make useful connections that could lead to new biological knowledge
Scop3P : a comprehensive resource of human phosphosites within their full context
Protein phosphorylation is a key post-translational modification in many biological processes and is associated to human diseases such as cancer and metabolic disorders. The accurate identification, annotation, and functional analysis of phosphosites are therefore crucial to understand their various roles. Phosphosites are mainly analyzed through phosphoproteomics, which has led to increasing amounts of publicly available phosphoproteomics data. Several resources have been built around the resulting phosphosite information, but these are usually restricted to the protein sequence and basic site metadata. What is often missing from these resources, however, is context, including protein structure mapping, experimental provenance information, and biophysical predictions. We therefore developed Scop3P: a comprehensive database of human phosphosites within their full context. Scop3P integrates sequences (UniProtKB/Swiss-Prot), structures (PDB), and uniformly reprocessed phosphoproteomics data (PRIDE) to annotate all known human phosphosites. Furthermore, these sites are put into biophysical context by annotating each phosphoprotein with per-residue structural propensity, solvent accessibility, disordered probability, and early folding information. Scop3P, available at https://iomics.ugent.be/scop3p, presents a unique resource for visualization and analysis of phosphosites and for understanding of phosphosite structure–function relationships
UniCarbKB: building a knowledge platform for glycoproteomics
The UniCarb KnowledgeBase (UniCarbKB; http://unicarbkb.org) offers public access to a growing, curated database of information on the glycan structures of glycoproteins. UniCarbKB is an international effort that aims to further our understanding of structures, pathways and networks involved in glycosylation and glyco-mediated processes by integrating structural, experimental and functional glycoscience information. This initiative builds upon the success of the glycan structure database GlycoSuiteDB, together with the informatic standards introduced by EUROCarbDB, to provide a high-quality and updated resource to support glycomics and glycoproteomics research. UniCarbKB provides comprehensive information concerning glycan structures, and published glycoprotein information including global and site-specific attachment information. For the first release over 890 references, 3740 glycan structure entries and 400 glycoproteins have been curated. Further, 598 protein glycosylation sites have been annotated with experimentally confirmed glycan structures from the literature. Among these are 35 glycoproteins, 502 structures and 60 publications previously not included in GlycoSuiteDB. This article provides an update on the transformation of GlycoSuiteDB (featured in previous NAR Database issues and hosted by ExPASy since 2009) to UniCarbKB and its integration with UniProtKB and GlycoMod. Here, we introduce a refactored database, supported by substantial new curated data collections and intuitive user-interfaces that improve database searchin
The neXtProt knowledgebase on human proteins: current status
neXtProt (http://www.nextprot.org) is a human protein-centric knowledgebase developed at the SIB Swiss Institute of Bioinformatics. Focused solely on human proteins, neXtProt aims to provide a state of the art resource for the representation of human biology by capturing a wide range of data, precise annotations, fully traceable data provenance and a web interface which enables researchers to find and view information in a comprehensive manner. Since the introductory neXtProt publication, significant advances have been made on three main aspects: the representation of proteomics data, an extended representation of human variants and the development of an advanced search capability built around semantic technologies. These changes are presented in the current neXtProt updat
Recommended from our members
Towards comprehensive annotation of Drosophila melanogaster enzymes in FlyBase.
The catalytic activities of enzymes can be described using Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. These annotations are available from numerous biological databases and are routinely accessed by researchers and bioinformaticians to direct their work. However, enzyme data may not be congruent between different resources, while the origin, quality and genomic coverage of these data within any one resource are often unclear. GO/EC annotations are assigned either manually by expert curators or inferred computationally, and there is potential for errors in both types of annotation. If such errors remain unchecked, false positive annotations may be propagated across multiple resources, significantly degrading the quality and usefulness of these data. Similarly, the absence of annotations (false negatives) from any one resource can lead to incorrect inferences or conclusions. We are systematically reviewing and enhancing the functional annotation of the enzymes of Drosophila melanogaster, focusing on improvements within the FlyBase (www.flybase.org) database. We have reviewed four major enzyme groups to date: oxidoreductases, lyases, isomerases and ligases. Herein, we describe our review workflow, the improvement in the quality and coverage of enzyme annotations within FlyBase and the wider impact of our work on other related databases
cisPath: an R/Bioconductor package for cloud users for visualization and management of functional protein interaction networks
Background: With the burgeoning development of cloud technology and services, there are an increasing number of users who prefer cloud to run their applications. All software and associated data are hosted on the cloud, allowing users to access them via a web browser from any computer, anywhere. This paper presents cisPath, an R/Bioconductor package deployed on cloud servers for client users to visualize, manage, and share functional protein interaction networks. Results: With this R package, users can easily integrate downloaded protein-protein interaction information from different online databases with private data to construct new and personalized interaction networks. Additional functions allow users to generate specific networks based on private databases. Since the results produced with the use of this package are in the form of web pages, cloud users can easily view and edit the network graphs via the browser, using a mouse or touch screen, without the need to download them to a local computer. This package can also be installed and run on a local desktop computer. Depending on user preference, results can be publicized or shared by uploading to a web server or cloud driver, allowing other users to directly access results via a web browser. Conclusions: This package can be installed and run on a variety of platforms. Since all network views are shown in web pages, such package is particularly useful for cloud users. The easy installation and operation is an attractive quality for R beginners and users with no previous experience with cloud services.SCI(E)CPCI-S(ISTP)[email protected]
- …