16,758 research outputs found

    SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors

    Full text link
    The maximal sensitivity of the Smith-Waterman (SW) algorithm has enabled its wide use in biological sequence database search. Unfortunately, the high sensitivity comes at the expense of quadratic time complexity, which makes the algorithm computationally demanding for big databases. In this paper, we present SWAPHI, the first parallelized algorithm employing Xeon Phi coprocessors to accelerate SW protein database search. SWAPHI is designed based on the scale-and-vectorize approach, i.e. it boosts alignment speed by effectively utilizing both the coarse-grained parallelism from the many co-processing cores (scale) and the fine-grained parallelism from the 512-bit wide single instruction, multiple data (SIMD) vectors within each core (vectorize). By searching against the large UniProtKB/TrEMBL protein database, SWAPHI achieves a performance of up to 58.8 billion cell updates per second (GCUPS) on one coprocessor and up to 228.4 GCUPS on four coprocessors. Furthermore, it demonstrates good parallel scalability on varying number of coprocessors, and is also superior to both SWIPE on 16 high-end CPU cores and BLAST+ on 8 cores when using four coprocessors, with the maximum speedup of 1.52 and 1.86, respectively. SWAPHI is written in C++ language (with a set of SIMD intrinsics), and is freely available at http://swaphi.sourceforge.net.Comment: A short version of this paper has been accepted by the IEEE ASAP 2014 conferenc

    Fuse: Multiple Network Alignment via Data Fusion

    Get PDF

    A methodology for determining amino-acid substitution matrices from set covers

    Full text link
    We introduce a new methodology for the determination of amino-acid substitution matrices for use in the alignment of proteins. The new methodology is based on a pre-existing set cover on the set of residues and on the undirected graph that describes residue exchangeability given the set cover. For fixed functional forms indicating how to obtain edge weights from the set cover and, after that, substitution-matrix elements from weighted distances on the graph, the resulting substitution matrix can be checked for performance against some known set of reference alignments and for given gap costs. Finding the appropriate functional forms and gap costs can then be formulated as an optimization problem that seeks to maximize the performance of the substitution matrix on the reference alignment set. We give computational results on the BAliBASE suite using a genetic algorithm for optimization. Our results indicate that it is possible to obtain substitution matrices whose performance is either comparable to or surpasses that of several others, depending on the particular scenario under consideration

    Global Network Alignment

    Get PDF
    Motivation: High-throughput methods for detecting molecular interactions have lead to a plethora of biological network data with much more yet to come, stimulating the development of techniques for biological network alignment. Analogous to sequence alignment, efficient and reliable network alignment methods will improve our understanding of biological systems. Network alignment is computationally hard. Hence, devising efficient network alignment heuristics is currently one of the foremost challenges in computational biology. 

Results: We present a superior heuristic network alignment algorithm, called Matching-based GRAph ALigner (M-GRAAL), which can process and integrate any number and type of similarity measures between network nodes (e.g., proteins), including, but not limited to, any topological network similarity measure, sequence similarity, functional similarity, and structural similarity. This is efficient in resolving ties in similarity measures and in finding a combination of similarity measures yielding the largest biologically sound alignments. When used to align protein-protein interaction (PPI) networks of various species, M-GRAAL exposes the largest known functional and contiguous regions of network similarity. Hence, we use M-GRAAL’s alignments to predict functions of un-annotated proteins in yeast, human, and bacteria _C. jejuni_ and _E. coli_. Furthermore, using M-GRAAL to compare PPI networks of different herpes viruses, we reconstruct their phylogenetic relationship and our phylogenetic tree is the same as sequenced-based one

    A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

    Full text link
    Ortholog detection (OD) is a critical step for comparative genomic analysis of protein-coding sequences. In this paper, we begin with a comprehensive comparison of four popular, methodologically diverse OD methods: MultiParanoid, Blat, Multiz, and OMA. In head-to-head comparisons, these methods are shown to significantly outperform one another 12-30% of the time. This high complementarity motivates the presentation of the first tool for integrating methodologically diverse OD methods. We term this program MOSAIC, or Multiple Orthologous Sequence Analysis and Integration by Cluster optimization. Relative to component and competing methods, we demonstrate that MOSAIC more than quintuples the number of alignments for which all species are present, while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality. Further, we demonstrate that this improvement in alignment quality yields 40-280% more confidently aligned sites. Combined, these factors translate to higher estimated levels of overall conservation, while at the same time allowing for the detection of up to 180% more positively selected sites. MOSAIC is available as python package. MOSAIC alignments, source code, and full documentation are available at http://pythonhosted.org/bio-MOSAIC

    Family-specific degenerate primer design: a tool to design consensus degenerated oligonucleotides

    Get PDF
    Designing degenerate PCR primers for templates of unknown nucleotide sequence may be a very difficult task. In this paper, we present a new method to design degenerate primers, implemented in family-specific degenerate primer design (FAS-DPD) computer software, for which the starting point is a multiple alignment of related amino acids or nucleotide sequences. To assess their efficiency, four different genome collections were used, covering a wide range of genomic lengths: Arenavirus ( nucleotides), Baculovirus ( to  bp), Lactobacillus sp. ( to  bp), and Pseudomonas sp. ( to  bp). In each case, FAS-DPD designed primers were tested computationally to measure specificity. Designed primers for Arenavirus and Baculovirus were tested experimentally. The method presented here is useful for designing degenerate primers on collections of related protein sequences, allowing detection of new family members.Fil: Iserte, Javier Alonso. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Ingeniería Genética y Biología Molecular y Celular. Área de Virosis Emergentes y Zoonótica; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Stephan, Betina Inés. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Ingeniería Genética y Biología Molecular y Celular. Área de Virosis Emergentes y Zoonótica; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Goñi, Sandra Elizabeth. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Ingeniería Genética y Biología Molecular y Celular. Área de Virosis Emergentes y Zoonótica; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Borio, Cristina Silvia. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Ingeniería Genética y Biología Molecular y Celular. Área de Virosis Emergentes y Zoonótica; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Ghiringhelli, Pablo Daniel. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Ingeniería Genética y Biología Molecular y Celular. Área Virus de Insectos; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Lozano, Mario Enrique. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Ingeniería Genética y Biología Molecular y Celular. Área de Virosis Emergentes y Zoonótica; Argentin

    Dissecting the Specificity of Protein-Protein Interaction in Bacterial Two-Component Signaling: Orphans and Crosstalks

    Get PDF
    Predictive understanding of the myriads of signal transduction pathways in a cell is an outstanding challenge of systems biology. Such pathways are primarily mediated by specific but transient protein-protein interactions, which are difficult to study experimentally. In this study, we dissect the specificity of protein-protein interactions governing two-component signaling (TCS) systems ubiquitously used in bacteria. Exploiting the large number of sequenced bacterial genomes and an operon structure which packages many pairs of interacting TCS proteins together, we developed a computational approach to extract a molecular interaction code capturing the preferences of a small but critical number of directly interacting residue pairs. This code is found to reflect physical interaction mechanisms, with the strongest signal coming from charged amino acids. It is used to predict the specificity of TCS interaction: Our results compare favorably to most available experimental results, including the prediction of 7 (out of 8 known) interaction partners of orphan signaling proteins in Caulobacter crescentus. Surveying among the available bacterial genomes, our results suggest 15~25% of the TCS proteins could participate in out-of-operon "crosstalks". Additionally, we predict clusters of crosstalking candidates, expanding from the anecdotally known examples in model organisms. The tools and results presented here can be used to guide experimental studies towards a system-level understanding of two-component signaling.Comment: Supplementary information available on http://www.plosone.org/article/info:doi/10.1371/journal.pone.001972

    Exploring deep phylogenies using protein structure : a dissertation submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Biochemistry, Institute of Natural and Mathematical Sciences, Massey University, Auckland, New Zealand

    Get PDF
    Recent times have seen an exponential growth in protein sequence and structure data. The most popular way of characterising newly determined protein sequences is to compare them to well characterised sequences and predict the function of novel sequences based on homology. This practice has been highly successful for a majority of proteins. However, these sequence based methods struggle with certain deeply diverging proteins and hence cannot always recover evolutionary histories. Another feature of proteins, namely their structures, has been shown to retain evolutionary signals over longer time scales compared to the respective sequences that encode them. The structure therefore presents an opportunity to uncover the evolutionary signal that otherwise escapes conventional sequence-based methods. Structural phylogenetics refers to the comparison of protein structures to extract evolutionary relationships. The area of structural phylogenetics has been around for a number of years and multiple approaches exist to delineate evolutionary relationships from protein structures. However, once the relationships have been recovered from protein structural data, no methods exist, at present, to verify the robustness of these relationships. Because of the nature of the structural data, conventional sequence-based methods, e.g. bootstrapping, cannot be applied. This work introduces the first ever use of a molecular dynamics (MD)-based bootstrap method, which can add a measure of significance to the relationships inferred from the structure-based analysis. This work begins in Chapter 2 by thoroughly investigating the use of a protein structural comparison metric Qscore, which has previously been used to generate structural phylogenies, and highlights its strengths and weaknesses. The mechanistic exploration of the structural comparison metric reveals a size difference limit of no more than 5-10% in the sizes of protein structures being compared for accurate phylogenetic inference to be made. Chapter 2 also explores the MD-based bootstrap method to offer an interpretation of the significance values recovered. Two protein structural datasets, one relatively more conserved at the sequence level than the other and with different levels of structural conservation are used as controls to simplify the interpretation of the statistics recovered from the MD-based bootstrap method. Chapter 3 then sees the application of the Qscore metric to the aminoacyl-tRNA synthetases. The aminoacyl-tRNA synthetases are believed to have been present at the dawn of life, making them one of the most ancient protein families. Due to the important functional role they play, these proteins are conserved at both sequence and structural levels and well-characterised using both sequence and structure-based comparative methods. This family therefore offered inferences which could be informed with structural analysis using an automated method. Successful recovery of known relationships raised confidence in the ability of structural phylogenetic analysis based on Qscore to detect evolutionary signals. In Chapter 4, a structural phylogeny was created for a protein structural dataset presenting either the histone fold or its ancestral precursor. This structural dataset comprised of proteins that were significantly diverged at a sequence level, however shared a common structural motif. The structural phylogeny recovered the split between bacterial and non-bacterial proteins. Furthermore, TATA protein associated factors were found to have multiple points of origin. Moreover, some mismatch was found between the classifications of these proteins between SCOP and PFam, which also did not agree with the results from this work. Using the structural phylogeny a model outlining the evolution of these proteins was proposed. The structural phylogeny of the Ferritin-like superfamily has previously been generated using the Qscore metric and supported qualitatively. Chapter 5 recovers the structural phylogeny of the Ferritin-like superfamily and finds quantitative support for the inferred relationships from the first ever implementation of the MD-based bootstrap method. The use of the MD-based bootstrap method simultaneously allows for the resolution of polytomies in structural databases. Some limitations of the MD-based bootstrap method, highlighted in Chapter 2, are revisited in Chapter 5. This work indicates that evolutionary signals can be successfully extracted from protein structures for deeply diverging proteins and that the MD-based bootstrap method can be used to gauge the robustness of relationships inferred
    corecore