217 research outputs found

    Provenance, propagation and quality of biological annotation

    Get PDF
    PhD ThesisBiological databases have become an integral part of the life sciences, being used to store, organise and share ever-increasing quantities and types of data. Biological databases are typically centred around raw data, with individual entries being assigned to a single piece of biological data, such as a DNA sequence. Although essential, a reader can obtain little information from the raw data alone. Therefore, many databases aim to supplement their entries with annotation, allowing the current knowledge about the underlying data to be conveyed to a reader. Although annotations come in many di erent forms, most databases provide some form of free text annotation. Given that annotations can form the foundations of future work, it is important that a user is able to evaluate the quality and correctness of an annotation. However, this is rarely straightforward. The amount of annotation, and the way in which it is curated, varies between databases. For example, the production of an annotation in some databases is entirely automated, without any manual intervention. Further, sections of annotations may be reused, being propagated between entries and, potentially, external databases. This provenance and curation information is not always apparent to a user. The work described within this thesis explores issues relating to biological annotation quality. While the most valuable annotation is often contained within free text, its lack of structure makes it hard to assess. Initially, this work describes a generic approach that allows textual annotations to be quantitatively measured. This approach is based upon the application of Zipf's Law to words within textual annotation, resulting in a single value, . The relationship between the value and Zipf's principle of least e ort provides an indication as to the annotations quality, whilst also allowing annotations to be quantitatively compared. Secondly, the thesis focuses on determining annotation provenance and tracking any subsequent propagation. This is achieved through the development of a visualisation - i - framework, which exploits the reuse of sentences within annotations. Utilising this framework a number of propagation patterns were identi ed, which on analysis appear to indicate low quality and erroneous annotation. Together, these approaches increase our understanding in the textual characteristics of biological annotation, and suggests that this understanding can be used to increase the overall quality of these resources

    Pre-validation of a MALDI MS proteomics-based method for the reliable detection of blood and blood provenance

    Get PDF
    Abstract: The reliable identification of blood, as well as the determination of its origin (human or animal) is of great importance in a forensic investigation. Whilst presumptive tests are rapid and deployed in situ, their very nature requires confirmatory tests to be performed remotely. However, only serological tests can determine blood provenance. The present study improves on a previously devised Matrix Assisted Laser Desorption Ionisation Mass Spectrometry (MALDI MS)—proteomics based method for the reliable detection of blood by enabling the determination of blood provenance. The overall protocol was developed to be more specific than presumptive tests and faster/easier than the gold standard liquid chromatography (LC) MS/MS analysis. This is considered a pre-validation study that has investigated stains and fingermarks made in blood, other biofluids and substances that can elicit a false-positive response to colorimetric or presumptive tests, in a blind fashion. Stains and marks were either untreated or enhanced with a range of presumptive tests. Human and animal blood were correctly discriminated from other biofluids and non-biofluid related matrices; animal species determination was also possible within the system investigated. The procedure is compatible with the prior application of presumptive tests. The refined strategy resulting from iterative improvements through a trial and error study of 56 samples was applied to a final set of 13 blind samples. This final study yielded 12/13 correct identifications with the 13th sample being correctly identified as animal blood but with no species attribution. This body of work will contribute towards the validation of MALDI MS based methods and deployment in violent crimes involving bloodshed

    Resolution of the type material of the Asian elephant, Elephas maximus Linnaeus, 1758 (Proboscidea, Elephantidae)

    Get PDF
    The understanding of Earth’s biodiversity depends critically on the accurate identification and nomenclature of species. Many species were described centuries ago, and in a surprising number of cases their nomenclature or type material remain unclear or inconsistent. A prime example is provided by Elephas maximus, one of the most iconic and well-known mammalian species, described and named by Linnaeus (1758) and today designating the Asian elephant. We used morphological, ancient DNA (aDNA), and high-throughput ancient proteomic analyses to demonstrate that a widely discussed syntype specimen of E. maximus, a complete foetus preserved in ethanol, is actually an African elephant, genus Loxodonta. We further discovered that an additional E. maximus syntype, mentioned in a description by John Ray (1693) cited by Linnaeus, has been preserved as an almost complete skeleton at the Natural History Museum of the University of Florence. Having confirmed its identity as an Asian elephant through both morphological and ancient DNA analyses, we designate this specimen as the lectotype of E. maximus

    Protein Bioinformatics Infrastructure for the Integration and Analysis of Multiple High-Throughput “omics” Data

    Get PDF
    High-throughput “omics” technologies bring new opportunities for biological and biomedical researchers to ask complex questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard and standardized nomenclature imposes a major challenge which requires advanced computational methods and bioinformatics infrastructures for integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery. In this paper, we present the challenges in high-throughput “omics” data integration and analysis, introduce a protein-centric approach for systems integration of large and heterogeneous high-throughput “omics” data including microarray, mass spectrometry, protein sequence, protein structure, and protein interaction data, and use scientific case study to illustrate how one can use varied “omics” data from different laboratories to make useful connections that could lead to new biological knowledge

    Scop3P : a comprehensive resource of human phosphosites within their full context

    Get PDF
    Protein phosphorylation is a key post-translational modification in many biological processes and is associated to human diseases such as cancer and metabolic disorders. The accurate identification, annotation, and functional analysis of phosphosites are therefore crucial to understand their various roles. Phosphosites are mainly analyzed through phosphoproteomics, which has led to increasing amounts of publicly available phosphoproteomics data. Several resources have been built around the resulting phosphosite information, but these are usually restricted to the protein sequence and basic site metadata. What is often missing from these resources, however, is context, including protein structure mapping, experimental provenance information, and biophysical predictions. We therefore developed Scop3P: a comprehensive database of human phosphosites within their full context. Scop3P integrates sequences (UniProtKB/Swiss-Prot), structures (PDB), and uniformly reprocessed phosphoproteomics data (PRIDE) to annotate all known human phosphosites. Furthermore, these sites are put into biophysical context by annotating each phosphoprotein with per-residue structural propensity, solvent accessibility, disordered probability, and early folding information. Scop3P, available at https://iomics.ugent.be/scop3p, presents a unique resource for visualization and analysis of phosphosites and for understanding of phosphosite structure–function relationships

    UniCarbKB: building a knowledge platform for glycoproteomics

    Get PDF
    The UniCarb KnowledgeBase (UniCarbKB; http://unicarbkb.org) offers public access to a growing, curated database of information on the glycan structures of glycoproteins. UniCarbKB is an international effort that aims to further our understanding of structures, pathways and networks involved in glycosylation and glyco-mediated processes by integrating structural, experimental and functional glycoscience information. This initiative builds upon the success of the glycan structure database GlycoSuiteDB, together with the informatic standards introduced by EUROCarbDB, to provide a high-quality and updated resource to support glycomics and glycoproteomics research. UniCarbKB provides comprehensive information concerning glycan structures, and published glycoprotein information including global and site-specific attachment information. For the first release over 890 references, 3740 glycan structure entries and 400 glycoproteins have been curated. Further, 598 protein glycosylation sites have been annotated with experimentally confirmed glycan structures from the literature. Among these are 35 glycoproteins, 502 structures and 60 publications previously not included in GlycoSuiteDB. This article provides an update on the transformation of GlycoSuiteDB (featured in previous NAR Database issues and hosted by ExPASy since 2009) to UniCarbKB and its integration with UniProtKB and GlycoMod. Here, we introduce a refactored database, supported by substantial new curated data collections and intuitive user-interfaces that improve database searchin

    The neXtProt knowledgebase on human proteins: current status

    Get PDF
    neXtProt (http://www.nextprot.org) is a human protein-centric knowledgebase developed at the SIB Swiss Institute of Bioinformatics. Focused solely on human proteins, neXtProt aims to provide a state of the art resource for the representation of human biology by capturing a wide range of data, precise annotations, fully traceable data provenance and a web interface which enables researchers to find and view information in a comprehensive manner. Since the introductory neXtProt publication, significant advances have been made on three main aspects: the representation of proteomics data, an extended representation of human variants and the development of an advanced search capability built around semantic technologies. These changes are presented in the current neXtProt updat

    cisPath: an R/Bioconductor package for cloud users for visualization and management of functional protein interaction networks

    Get PDF
    Background: With the burgeoning development of cloud technology and services, there are an increasing number of users who prefer cloud to run their applications. All software and associated data are hosted on the cloud, allowing users to access them via a web browser from any computer, anywhere. This paper presents cisPath, an R/Bioconductor package deployed on cloud servers for client users to visualize, manage, and share functional protein interaction networks. Results: With this R package, users can easily integrate downloaded protein-protein interaction information from different online databases with private data to construct new and personalized interaction networks. Additional functions allow users to generate specific networks based on private databases. Since the results produced with the use of this package are in the form of web pages, cloud users can easily view and edit the network graphs via the browser, using a mouse or touch screen, without the need to download them to a local computer. This package can also be installed and run on a local desktop computer. Depending on user preference, results can be publicized or shared by uploading to a web server or cloud driver, allowing other users to directly access results via a web browser. Conclusions: This package can be installed and run on a variety of platforms. Since all network views are shown in web pages, such package is particularly useful for cloud users. The easy installation and operation is an attractive quality for R beginners and users with no previous experience with cloud services.SCI(E)CPCI-S(ISTP)[email protected]
    corecore