104 research outputs found

    Bioinformatics on the Cloud Computing Platform Azure

    Get PDF
    We discuss the applicability of the Microsoft cloud computing platform, Azure, for bioinformatics. We focus on the usability of the resource rather than its performance. We provide an example of how R can be used on Azure to analyse a large amount of microarray expression data deposited at the public database ArrayExpress. We provide a walk through to demonstrate explicitly how Azure can be used to perform these analyses in Appendix S1 and we offer a comparison with a local computation. We note that the use of the Platform as a Service (PaaS) offering of Azure can represent a steep learning curve for bioinformatics developers who will usually have a Linux and scripting language background. On the other hand, the presence of an additional set of libraries makes it easier to deploy software in a parallel (scalable) fashion and explicitly manage such a production run with only a few hundred lines of code, most of which can be incorporated from a template. We propose that this environment is best suited for running stable bioinformatics software by users not involved with its development. © 2014 Shanahan et al

    Processing and analyzing multiple genomes alignments with MafFilter

    Get PDF
    As the number of available genome sequences from both closely related species and individuals withinspecies increased, theoretical and methodological convergences between the fields of phylogenomics andpopulation genomics emerged. Population genomics typically focuses on the analysis of variants, whilephylogenomics heavily relies on genome alignments. However, these are playing an increasingly importantrole in studies at the population level. Multiple genome alignments of individuals are used when structuralvariation is of primary interest and when genome architecture permits to assemblede novogenomesequences. Here I describe MafFilter, a command-line-driven program allowing to process genome align-ments in the Multiple Alignment Format (MAF). Using concrete examples based on publicly availabledatasets, I demonstrate how MafFilter can be used to develop efficient and reproducible pipelines withquality assurance for downstream analyses. I further show how MafFilter can be used to perform both basicand advanced population genomic analyses in order to infer the patterns of nucleotide diversity alonggenomes

    Short Term Evolution of a Highly Transmissible Methicillin-Resistant Staphylococcus aureus Clone (ST228) in a Tertiary Care Hospital

    Get PDF
    Staphylococcus aureus is recognized as one of the major human pathogens and is by far one of the most common nosocomial organisms. The genetic basis for the emergence of highly epidemic strains remains mysterious. Studying the microevolution of the different clones of S. aureus is essential for identifying the forces driving pathogen emergence and spread. The aim of the present study was to determine the genetic changes characterizing a lineage belonging to the South German clone (ST228) that spread over ten years in a tertiary care hospital in Switzerland. For this reason, we compared the whole genome of eight isolates recovered between 2001 and 2008 at the Lausanne hospital. The genetic comparison of these isolates revealed that their genomes are extremely closely related. Yet, a few more important genetic changes, such as the replacement of a plasmid, the loss of large fragments of DNA, or the insertion of transposases, were observed. These transfers of mobile genetic elements shaped the evolution of the ST228 lineage that spread within the Lausanne hospital. Nevertheless, although the strains analyzed differed in their dynamics, we have not been able to link a particular genetic element with spreading success. Finally, the present study showed that new sequencing technologies improve considerably the quality and quantity of information obtained for a single strain; but this information is still difficult to interpret and important investments are required for the technology to become accessible for routine investigations

    Whole genome sequencing to investigate the emergence of clonal complex 23 Neisseria meningitidis serogroup Y disease in the United States

    Get PDF
    In the United States, serogroup Y, ST-23 clonal complex Neisseria meningitidis was responsible for an increase in meningococcal disease incidence during the 1990s. This increase was accompanied by antigenic shift of three outer membrane proteins, with a decrease in the population that predominated in the early 1990s as a different population emerged later in that decade. To understand factors that may have been responsible for the emergence of serogroup Y disease, we used whole genome pyrosequencing to investigate genetic differences between isolates from early and late N. meningitidis populations, obtained from meningococcal disease cases in Maryland in the 1990s. The genomes of isolates from the early and late populations were highly similar, with 1231 of 1776 shared genes exhibiting 100% amino acid identity and an average πN = 0.0033 and average πS = 0.0216. However, differences were found in predicted proteins that affect pilin structure and antigen profile and in predicted proteins involved in iron acquisition and uptake. The observed changes are consistent with acquisition of new alleles through horizontal gene transfer. Changes in antigen profile due to the genetic differences found in this study likely allowed the late population to emerge due to escape from population immunity. These findings may predict which antigenic factors are important in the cyclic epidemiology of meningococcal disease

    Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies

    Get PDF
    Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation

    A High-Resolution View of Genome-Wide Pneumococcal Transformation

    Get PDF
    Transformation is an important mechanism of microbial evolution through which bacteria have been observed to rapidly adapt in response to clinical interventions; examples include facilitating vaccine evasion and the development of penicillin resistance in the major respiratory pathogen Streptococcus pneumoniae. To characterise the process in detail, the genomes of 124 S. pneumoniae isolates produced through in vitro transformation were sequenced and recombination events detected. Those recombinations importing the selected marker were independent of unselected events elsewhere in the genome, the positions of which were not significantly affected by local sequence similarity between donor and recipient or mismatch repair processes. However, both types of recombinations were sometimes mosaic, with multiple non-contiguous segments originating from the same molecule of donor DNA. The lengths of the unselected events were exponentially distributed with a mean of 2.3 kb, implying that recombinations are stochastically resolved with a fixed per base probability of 4.4×10−4 bp−1. This distribution of recombination sizes, coupled with an observed under representation of large insertions within transferred sequence, suggests transformation has the potential to reduce the size of bacterial genomes, and is unlikely to act as an efficient mechanism for the uptake of accessory genomic loci

    Bioinformatics for the human microbiome project

    Get PDF
    Microbes inhabit virtually all sites of the human body, yet we know very little about the role they play in our health. In recent years, there has been increasing interest in studying human-associated microbial communities, particularly since microbial dysbioses have now been implicated in a number of human diseases [1]–[3]. Dysbiosis, the disruption of the normal microbial community structure, however, is impossible to define without first establishing what “normal microbial community structure” means within the healthy human microbiome. Recent advances in sequencing technologies have made it feasible to perform large-scale studies of microbial communities, providing the tools necessary to begin to address this question [4], [5]. This led to the implementation of the Human Microbiome Project (HMP) in 2007, an initiative funded by the National Institutes of Health Roadmap for Biomedical Research and constructed as a large, genome-scale community research project [6]. Any such project must plan for data analysis, computational methods development, and the public availability of tools and data; here, we provide an overview of the corresponding bioinformatics organization, history, and results from the HMP (Figure 1).National Institutes of Health (U.S.) (NIH U54HG004969)National Institutes of Health (U.S.) (grant R01HG004885)National Institutes of Health (U.S.) (grant R01HG005975)National Institutes of Health (U.S.) (grant R01HG005969
    corecore