6,961 research outputs found

    An exploration of alternative visualisations of the basic helix-loop-helix protein interaction network

    Get PDF
    Background: Alternative representations of biochemical networks emphasise different aspects of the data and contribute to the understanding of complex biological systems. In this study we present a variety of automated methods for visualisation of a protein-protein interaction network, using the basic helix-loop-helix ( bHLH) family of transcription factors as an example. Results: Network representations that arrange nodes ( proteins) according to either continuous or discrete information are investigated, revealing the existence of protein sub-families and the retention of interactions following gene duplication events. Methods of network visualisation in conjunction with a phylogenetic tree are presented, highlighting the evolutionary relationships between proteins, and clarifying the context of network hubs and interaction clusters. Finally, an optimisation technique is used to create a three-dimensional layout of the phylogenetic tree upon which the protein-protein interactions may be projected. Conclusion: We show that by incorporating secondary genomic, functional or phylogenetic information into network visualisation, it is possible to move beyond simple layout algorithms based on network topology towards more biologically meaningful representations. These new visualisations can give structure to complex networks and will greatly help in interpreting their evolutionary origins and functional implications. Three open source software packages (InterView, TVi and OptiMage) implementing our methods are available

    A Poisson hierarchical modelling approach to detecting copy number variation in sequence coverage data.

    Get PDF
    BACKGROUND: The advent of next generation sequencing technology has accelerated efforts to map and catalogue copy number variation (CNV) in genomes of important micro-organisms for public health. A typical analysis of the sequence data involves mapping reads onto a reference genome, calculating the respective coverage, and detecting regions with too-low or too-high coverage (deletions and amplifications, respectively). Current CNV detection methods rely on statistical assumptions (e.g., a Poisson model) that may not hold in general, or require fine-tuning the underlying algorithms to detect known hits. We propose a new CNV detection methodology based on two Poisson hierarchical models, the Poisson-Gamma and Poisson-Lognormal, with the advantage of being sufficiently flexible to describe different data patterns, whilst robust against deviations from the often assumed Poisson model. RESULTS: Using sequence coverage data of 7 Plasmodium falciparum malaria genomes (3D7 reference strain, HB3, DD2, 7G8, GB4, OX005, and OX006), we showed that empirical coverage distributions are intrinsically asymmetric and overdispersed in relation to the Poisson model. We also demonstrated a low baseline false positive rate for the proposed methodology using 3D7 resequencing data and simulation. When applied to the non-reference isolate data, our approach detected known CNV hits, including an amplification of the PfMDR1 locus in DD2 and a large deletion in the CLAG3.2 gene in GB4, and putative novel CNV regions. When compared to the recently available FREEC and cn.MOPS approaches, our findings were more concordant with putative hits from the highest quality array data for the 7G8 and GB4 isolates. CONCLUSIONS: In summary, the proposed methodology brings an increase in flexibility, robustness, accuracy and statistical rigour to CNV detection using sequence coverage data

    PolyTB: a genomic variation map for Mycobacterium tuberculosis

    Get PDF
    Tuberculosis (TB) caused by Mycobacterium tuberculosis (Mtb) is the second major cause of death from an infectious disease worldwide. Recent advances in DNA sequencing are leading to the ability to generate whole genome information in clinical isolates of M. tuberculosis complex (MTBC). The identification of informative genetic variants such as phylogenetic markers and those associated with drug resistance or virulence will help barcode Mtb in the context of epidemiological, diagnostic and clinical studies. Mtb genomic datasets are increasingly available as raw sequences, which are potentially difficult and computer intensive to process, and compare across studies. Here we have processed the raw sequence data (>1500 isolates, eight studies) to compile a catalogue of SNPs (n = 74,039, 63% non-synonymous, 51.1% in more than one isolate, i.e. non-private), small indels (n = 4810) and larger structural variants (n = 800). We have developed the PolyTB web-based tool (http://pathogenseq.lshtm.ac.uk/polytb) to visualise the resulting variation and important meta-data (e.g. in silico inferred strain-types, location) within geographical map and phylogenetic views. This resource will allow researchers to identify polymorphisms within candidate genes of interest, as well as examine the genomic diversity and distribution of strains. PolyTB source code is freely available to researchers wishing to develop similar tools for their pathogen of interest

    GrapeTree : visualization of core genomic relationships among 100,000 bacterial pathogens

    Get PDF
    Current methods struggle to reconstruct and visualise the genomic relationships of ≥100,000 bacterial genomes. GrapeTree facilitates the analyses of allelic profiles from 10,000's of core genomes within a web browser window. GrapeTree implements a novel minimum spanning tree algorithm to reconstruct genetic relationships despite missing data together with a static "GrapeTree Layout" algorithm to render interactive visualisations of large trees. GrapeTree is a stand-along package for investigating Newick trees plus associated metadata and is also integrated into EnteroBase to facilitate cutting edge navigation of genomic relationships among >160,000 genomes from bacterial pathogens. The GrapeTree package was released under the GPL v3.0 Licence

    GrapeTree : visualization of core genomic relationships among 100,000 bacterial pathogens

    Get PDF
    Current methods struggle to reconstruct and visualise the genomic relationships of ≥100,000 bacterial genomes. GrapeTree facilitates the analyses of allelic profiles from 10,000's of core genomes within a web browser window. GrapeTree implements a novel minimum spanning tree algorithm to reconstruct genetic relationships despite missing data together with a static "GrapeTree Layout" algorithm to render interactive visualisations of large trees. GrapeTree is a stand-along package for investigating Newick trees plus associated metadata and is also integrated into EnteroBase to facilitate cutting edge navigation of genomic relationships among >160,000 genomes from bacterial pathogens. The GrapeTree package was released under the GPL v3.0 Licence

    Using sound to understand protein sequence data:New sonification algorithms for protein sequences and multiple sequence alignments

    Get PDF
    Funding: This work was supported by the UKRI Biotechnology and Biological Sciences Research Council (BBSRC) grant number BB/M010996/1.Background The use of sound to represent sequence data – sonification – has great potential as an alternative and complement to visual representation, exploiting features of human psychoacoustic intuitions to convey nuance more effectively. We have created five parameter-mapping sonification algorithms that aim to improve knowledge discovery from protein sequences and small protein multiple sequence alignments. For two of these algorithms, we investigated their effectiveness at conveying information. To do this we focussed on subjective assessments of user experience. This entailed a focus group session and survey research by questionnaire of individuals engaged in bioinformatics research. Results For single protein sequences, the success of our sonifications for conveying features was supported by both the survey and focus group findings. For protein multiple sequence alignments, there was limited evidence that the sonifications successfully conveyed information. Additional work is required to identify effective algorithms to render multiple sequence alignment sonification useful to researchers. Feedback from both our survey and focus groups suggests future directions for sonification of multiple alignments: animated visualisation indicating the column in the multiple alignment as the sonification progresses, user control of sequence navigation, and customisation of the sound parameters. Conclusions Sonification approaches undertaken in this work have shown some success in conveying information from protein sequence data. Feedback points out future directions to build on the sonification approaches outlined in this paper. The effectiveness assessment process implemented in this work proved useful, giving detailed feedback and key approaches for improvement based on end-user input. The uptake of similar user experience focussed effectiveness assessments could also help with other areas of bioinformatics, for example in visualisation.Publisher PDFPeer reviewe

    PlasmoView: a web-based resource to visualise global Plasmodium falciparum genomic variation.

    Get PDF
    Malaria is a global public health challenge, with drug resistance a major barrier to disease control and elimination. To meet the urgent need for better treatments and vaccines, a deeper knowledge of Plasmodium biology and malaria epidemiology is required. An improved understanding of the genomic variation of malaria parasites, especially the most virulent Plasmodium falciparum (Pf) species, has the potential to yield new insights in these areas. High-throughput sequencing and genotyping is generating large amounts of genomic data across multiple parasite populations. The resulting ability to identify informative variants, particularly single-nucleotide polymorphisms (SNPs), will lead to the discovery of intra- and inter-population differences and thus enable the development of genetic barcodes for diagnostic assays and clinical studies. Knowledge of genetic variability underlying drug resistance and other differential phenotypes will also facilitate the identification of novel mutations and contribute to surveillance and stratified medicine applications. The PlasmoView interactive web-browsing tool enables the research community to visualise genomic variation and annotation (eg, biological function) in a geographic setting. The first release contains over 600,000 high-quality SNPs in 631 Pf isolates from laboratory strains and four malaria-endemic regions (West Africa, East Africa, Southeast Asia and Oceania)

    Provenance, propagation and quality of biological annotation

    Get PDF
    PhD ThesisBiological databases have become an integral part of the life sciences, being used to store, organise and share ever-increasing quantities and types of data. Biological databases are typically centred around raw data, with individual entries being assigned to a single piece of biological data, such as a DNA sequence. Although essential, a reader can obtain little information from the raw data alone. Therefore, many databases aim to supplement their entries with annotation, allowing the current knowledge about the underlying data to be conveyed to a reader. Although annotations come in many di erent forms, most databases provide some form of free text annotation. Given that annotations can form the foundations of future work, it is important that a user is able to evaluate the quality and correctness of an annotation. However, this is rarely straightforward. The amount of annotation, and the way in which it is curated, varies between databases. For example, the production of an annotation in some databases is entirely automated, without any manual intervention. Further, sections of annotations may be reused, being propagated between entries and, potentially, external databases. This provenance and curation information is not always apparent to a user. The work described within this thesis explores issues relating to biological annotation quality. While the most valuable annotation is often contained within free text, its lack of structure makes it hard to assess. Initially, this work describes a generic approach that allows textual annotations to be quantitatively measured. This approach is based upon the application of Zipf's Law to words within textual annotation, resulting in a single value, . The relationship between the value and Zipf's principle of least e ort provides an indication as to the annotations quality, whilst also allowing annotations to be quantitatively compared. Secondly, the thesis focuses on determining annotation provenance and tracking any subsequent propagation. This is achieved through the development of a visualisation - i - framework, which exploits the reuse of sentences within annotations. Utilising this framework a number of propagation patterns were identi ed, which on analysis appear to indicate low quality and erroneous annotation. Together, these approaches increase our understanding in the textual characteristics of biological annotation, and suggests that this understanding can be used to increase the overall quality of these resources
    corecore