22,268 research outputs found

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement

    Full text link
    Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. We describe a method to align two or more genomes that have undergone large-scale recombination, particularly genomes that have undergone substantial amounts of gene gain and loss (gene flux). The method utilizes a novel alignment objective score, referred to as a sum-of-pairs breakpoint score. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The progressive genome alignment algorithm demonstrates markedly improved accuracy over previous approaches in situations where genomes have undergone realistic amounts of genome rearrangement, gene gain, loss, and duplication. We apply the progressive genome alignment algorithm to a set of 23 completely sequenced genomes from the genera Escherichia, Shigella, and Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content conserved among all taxa and total unique content of 15.2Mbp. We document substantial population-level variability among these organisms driven by homologous recombination, gene gain, and gene loss. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve .Comment: Revision dated June 19, 200

    Metric learning pairwise kernel for graph inference

    Full text link
    Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel (MLPK). We demonstrate, using several real biological networks, that this direct approach often improves upon the state-of-the-art SVM for indirect inference with the tensor product pairwise kernel

    Global Functional Atlas of \u3cem\u3eEscherichia coli\u3c/em\u3e Encompassing Previously Uncharacterized Proteins

    Get PDF
    One-third of the 4,225 protein-coding genes of Escherichia coli K-12 remain functionally unannotated (orphans). Many map to distant clades such as Archaea, suggesting involvement in basic prokaryotic traits, whereas others appear restricted to E. coli, including pathogenic strains. To elucidate the orphans’ biological roles, we performed an extensive proteomic survey using affinity-tagged E. coli strains and generated comprehensive genomic context inferences to derive a high-confidence compendium for virtually the entire proteome consisting of 5,993 putative physical interactions and 74,776 putative functional associations, most of which are novel. Clustering of the respective probabilistic networks revealed putative orphan membership in discrete multiprotein complexes and functional modules together with annotated gene products, whereas a machine-learning strategy based on network integration implicated the orphans in specific biological processes. We provide additional experimental evidence supporting orphan participation in protein synthesis, amino acid metabolism, biofilm formation, motility, and assembly of the bacterial cell envelope. This resource provides a “systems-wide” functional blueprint of a model microbe, with insights into the biological and evolutionary significance of previously uncharacterized proteins

    PhylOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data.

    Get PDF
    Microbial diversity is typically characterized by clustering ribosomal RNA (SSU-rRNA) sequences into operational taxonomic units (OTUs). Targeted sequencing of environmental SSU-rRNA markers via PCR may fail to detect OTUs due to biases in priming and amplification. Analysis of shotgun sequenced environmental DNA, known as metagenomics, avoids amplification bias but generates fragmentary, non-overlapping sequence reads that cannot be clustered by existing OTU-finding methods. To circumvent these limitations, we developed PhylOTU, a computational workflow that identifies OTUs from metagenomic SSU-rRNA sequence data through the use of phylogenetic principles and probabilistic sequence profiles. Using simulated metagenomic data, we quantified the accuracy with which PhylOTU clusters reads into OTUs. Comparisons of PCR and shotgun sequenced SSU-rRNA markers derived from the global open ocean revealed that while PCR libraries identify more OTUs per sequenced residue, metagenomic libraries recover a greater taxonomic diversity of OTUs. In addition, we discover novel species, genera and families in the metagenomic libraries, including OTUs from phyla missed by analysis of PCR sequences. Taken together, these results suggest that PhylOTU enables characterization of part of the biosphere currently hidden from PCR-based surveys of diversity
    corecore