Search CORE

12 research outputs found

Massively Parallel Algorithm for Multiple Sequence Alignment Based on Artificial Bee Colony

Author: Plamenka Borovska
Publication venue
Publication date
Field of study

In silico biological sequence processing is a key task in molecular biology. This scientific area requires powerful computing resources for exploring large sets of biological data. Parallel in silico simulations based on methods and algorithms for analysis of biological data using high-performance distributed computing is essential for accelerating the research and reducing the investment. Multiple sequence alignment is a widely used method for biological sequence processing. The goal of this method is DNA and protein sequences alignment. This paper presents an innovative parallel algorithm MSA_BG for multiple alignment of biological sequences that is highly scalable and locality aware. The MSA_BG algorithm we describe is iterative and is based on the concept of Artificial Bee Colony metaheuristics and the concept of algorithmic and architectural spaces correlation. The metaphor of the ABC metaheuristics has been constructed and the functionalities of the agents have been defined. The conceptual parallel model of computation has been designed and the algorithmic framework of the designed parallel algorithm constructed. Experimental simulations on the basis of parallel implementation of MSA_BG algorithm for multiple sequences alignment on heterogeneouc compact computer cluster and supercomputer BlueGene/P have been carried out for the case study of the influenza virus variability investigation. The performance estimation and profiling analyses have shown that the parallel system is well balanced both in respect to the workload and machine size

ZENODO

JCoDA: a tool for detecting evolutionary selection

Author: Dannenfelser Ruth
Hayes James E
Laucius Christopher D
Nayak Sudhir
Steinway Steven N
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The incorporation of annotated sequence information from multiple related species in commonly used databases (Ensembl, Flybase, Saccharomyces Genome Database, Wormbase, etc.) has increased dramatically over the last few years. This influx of information has provided a considerable amount of raw material for evaluation of evolutionary relationships. To aid in the process, we have developed JCoDA (Java Codon Delimited Alignment) as a simple-to-use visualization tool for the detection of site specific and regional positive/negative evolutionary selection amongst homologous coding sequences. Results JCoDA accepts user-inputted unaligned or pre-aligned coding sequences, performs a codon-delimited alignment using ClustalW, and determines the dN/dS calculations using PAML (Phylogenetic Analysis Using Maximum Likelihood, yn00 and codeml) in order to identify regions and sites under evolutionary selection. The JCoDA package includes a graphical interface for Phylip (Phylogeny Inference Package) to generate phylogenetic trees, manages formatting of all required file types, and streamlines passage of information between underlying programs. The raw data are output to user configurable graphs with sliding window options for straightforward visualization of pairwise or gene family comparisons. Additionally, codon-delimited alignments are output in a variety of common formats and all dN/dS calculations can be output in comma-separated value (CSV) format for downstream analysis. To illustrate the types of analyses that are facilitated by JCoDA, we have taken advantage of the well studied sex determination pathway in nematodes as well as the extensive sequence information available to identify genes under positive selection, examples of regional positive selection, and differences in selection based on the role of genes in the sex determination pathway. Conclusions JCoDA is a configurable, open source, user-friendly visualization tool for performing evolutionary analysis on homologous coding sequences. JCoDA can be used to rapidly screen for genes and regions of genes under selection using PAML. It can be freely downloaded at <url>http://www.tcnj.edu/~nayaklab/jcoda</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Alignment of Multiple DNA Sequences by Using Improved GA Operators

Author: Manish Kumar
Publication venue
Publication date: 11/04/2020
Field of study

ABSTRACT One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). It is a critical tool for biologists to identify the relationships between species and also possibly predict the structure and functionality of biological sequences. The general multiple sequence alignment problem is known to be NP-hard, and hence the problem of finding the best possible multiple sequence alignment is intractable. Therefore, a genetic algorithm based approach has been designed to solve the multiple DNA sequence alignment problem by using different genetic operators. Experimental results with different lengths of DNA sequences has been detailed in this paper . It has also shown that how the increase in length will affect the overall quality of the alignment. The extensive experiment on wide range of datasets and the obtained results has shown the effectiveness of the proposed approach in solving multiple DNA sequences. KEYWORDS: Multiple Sequence Alignment, Genetic Algorithms (GAs), DNA Sequences. INTRODUCTION The main components of the biochemical processes of life are proteins and nucleic acids. There are two types of nucleic acids, deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). DNA sequences are long biomolecular strands composed of four types of nucleotide bases: adenine (A), guanine (G), cytosine (C), and thymine (T). DNA actually occurs as a double strand of such bases. The stands are held together by hydrogen bonds between complementary bases: A-T and G-C. DNA sequences, which consist of hundreds of millions of nucleotides, define the genome of a particular species. Recent advances in bioinformatics have generated volumes of genome data for biomedical research. For example, many immunity genes in the fruit fly genome have nucleotide sequences that are reminiscent of TCGGGGATTTC

CiteSeerX

Alignment Metric Accuracy

Author: Myers Eugene W.
Pachter Lior
Schwartz Ariel S.
Publication venue
Publication date: 27/10/2005
Field of study

We propose a metric for the space of multiple sequence alignments that can be used to compare two alignments to each other. In the case where one of the alignments is a reference alignment, the resulting accuracy measure improves upon previous approaches, and provides a balanced assessment of the fidelity of both matches and gaps. Furthermore, in the case where a reference alignment is not available, we provide empirical evidence that the distance from an alignment produced by one program to predicted alignments from other programs can be used as a control for multiple alignment experiments. In particular, we show that low accuracy alignments can be effectively identified and discarded. We also show that in the case of pairwise sequence alignment, it is possible to find an alignment that maximizes the expected value of our accuracy measure. Unlike previous approaches based on expected accuracy alignment that tend to maximize sensitivity at the expense of specificity, our method is able to identify unalignable sequence, thereby increasing overall accuracy. In addition, the algorithm allows for control of the sensitivity/specificity tradeoff via the adjustment of a single parameter. These results are confirmed with simulation studies that show that unalignable regions can be distinguished from homologous, conserved sequences. Finally, we propose an extension of the pairwise alignment method to multiple alignment. Our method, which we call AMAP, outperforms existing protein sequence multiple alignment programs on benchmark datasets. A webserver and software downloads are available at http://bio.math.berkeley.edu/amap/

arXiv.org e-Print Archive

Caltech Authors

Grammar-based distance in progressive multiple sequence alignment

Author: AY Mitrophanov
C Notredame
C Notredame
CB Do
David J Russell
DJ Lipman
GH Gonnet
Hasan H Otu
HH Otu
J Stoye
J Ziv
J Ziv
JD Thompson
JD Thompson
K Katoh
K Katoh
K Katoh
Khalid Sayood
MO Albertson
P Clote
R Durbin
RC Edgar
RC Edgar
S Henikoff
S Sze
SB Needleman
VD Gusev
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: We propose a multiple sequence alignment (MSA) algorithm and compare the alignment-quality and execution-time of the proposed algorithm with that of existing algorithms. The proposed progressive alignment algorithm uses a grammar-based distance metric to determine the order in which biological sequences are to be pairwise aligned. The progressive alignment occurs via pairwise aligning new sequences with an ensemble of the sequences previously aligned. Results: The performance of the proposed algorithm is validated via comparison to popular progressive multiple alignment approaches, ClustalW and T-Coffee, and to the more recently developed algorithms MAFFT, MUSCLE, Kalign, and PSAlign using the BAliBASE 3.0 database of amino acid alignment files and a set of longer sequences generated by Rose software. The proposed algorithm has successfully built multiple alignments comparable to other programs with significant improvements in running time. The results are especially striking for large datasets. Conclusion: We introduce a computationally efficient progressive alignment algorithm using a grammar based sequence distance particularly useful in aligning large datasets

Crossref

DigitalCommons@University of Nebraska

Harvard University - DASH

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Finding conserved patterns in biological sequences, networks and genomes

Author: Yang Qingwu
Publication venue
Publication date: 15/05/2009
Field of study

Biological patterns are widely used for identifying biologically interesting regions within macromolecules, classifying biological objects, predicting functions and studying evolution. Good pattern finding algorithms will help biologists to formulate and validate hypotheses in an attempt to obtain important insights into the complex mechanisms of living things. In this dissertation, we aim to improve and develop algorithms for five biological pattern finding problems. For the multiple sequence alignment problem, we propose an alternative formulation in which a final alignment is obtained by preserving pairwise alignments specified by edges of a given tree. In contrast with traditional NPhard formulations, our preserving alignment formulation can be solved in polynomial time without using a heuristic, while having very good accuracy. For the path matching problem, we take advantage of the linearity of the query path to reduce the problem to finding a longest weighted path in a directed acyclic graph. We can find k paths with top scores in a network from the query path in polynomial time. As many biological pathways are not linear, our graph matching approach allows a non-linear graph query to be given. Our graph matching formulation overcomes the common weakness of previous approaches that there is no guarantee on the quality of the results. For the gene cluster finding problem, we investigate a formulation based on constraining the overall size of a cluster and develop statistical significance estimates that allow direct comparisons of clusters of different sizes. We explore both a restricted version which requires that orthologous genes are strictly ordered within each cluster, and the unrestricted problem that allows paralogous genes within a genome and clusters that may not appear in every genome. We solve the first problem in polynomial time and develop practical exact algorithms for the second one. In the gene cluster querying problem, based on a querying strategy, we propose an efficient approach for investigating clustering of related genes across multiple genomes for a given gene cluster. By analyzing gene clustering in 400 bacterial genomes, we show that our algorithm is efficient enough to study gene clusters across hundreds of genomes

Texas A&M Repository

Protein multiple sequence alignment by hybrid bio-inspired algorithms

Author: Cutello Vincenzo
Nicosia Giuseppe
Pavone Mario
Prizzi Igor
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

This article presents an immune inspired algorithm to tackle the Multiple Sequence Alignment (MSA) problem. MSA is one of the most important tasks in biological sequence analysis. Although this paper focuses on protein alignments, most of the discussion and methodology may also be applied to DNA alignments. The problem of finding the multiple alignment was investigated in the study by Bonizzoni and Vedova and Wang and Jiang, and proved to be a NP-hard (non-deterministic polynomial-time hard) problem. The presented algorithm, called Immunological Multiple Sequence Alignment Algorithm (IMSA), incorporates two new strategies to create the initial population and specific ad hoc mutation operators. It is based on the ‘weighted sum of pairs’ as objective function, to evaluate a given candidate alignment. IMSA was tested using both classical benchmarks of BAliBASE (versions 1.0, 2.0 and 3.0), and experimental results indicate that it is comparable with state-of-the-art multiple alignment algorithms, in terms of quality of alignments, weighted Sums-of-Pairs (SP) and Column Score (CS) values. The main novelty of IMSA is its ability to generate more than a single suboptimal alignment, for every MSA instance; this behaviour is due to the stochastic nature of the algorithm and of the populations evolved during the convergence process. This feature will help the decision maker to assess and select a biologically relevant multiple sequence alignment. Finally, the designed algorithm can be used as a local search procedure to properly explore promising alignments of the search space

CiteSeerX

PubMed Central

A Domain Decomposition Strategy for Alignment of Multiple Biological Sequences on Multiprocessor Platforms

Author: Ashfaq Khokhar
Berger
Cline
Crandall
Do
Edgar
Edgar
Edgar
Fahad Saeed
Hambrusch
Hambrusch
Hanmao
Jones
Kaddoura
Kumar
Lassmann
Lassmann
Mikhailov
Morgenstern
Muller
Notredame
Notredame
Pilkington
Ronaghi
Saeed
Sauder
Schmollinger
Schwartz
SF
Smith
Stoye
Sze
Thompson
Thompson
Wang
Willebeek-LeMair
Publication venue: 'Elsevier BV'
Publication date: 11/05/2009
Field of study

Multiple Sequences Alignment (MSA) of biological sequences is a fundamental problem in computational biology due to its critical significance in wide ranging applications including haplotype reconstruction, sequence homology, phylogenetic analysis, and prediction of evolutionary origins. The MSA problem is considered NP-hard and known heuristics for the problem do not scale well with increasing number of sequences. On the other hand, with the advent of new breed of fast sequencing techniques it is now possible to generate thousands of sequences very quickly. For rapid sequence analysis, it is therefore desirable to develop fast MSA algorithms that scale well with the increase in the dataset size. In this paper, we present a novel domain decomposition based technique to solve the MSA problem on multiprocessing platforms. The domain decomposition based technique, in addition to yielding better quality, gives enormous advantage in terms of execution time and memory requirements. The proposed strategy allows to decrease the time complexity of any known heuristic of O(N)^x complexity by a factor of O(1/p)^x, where N is the number of sequences, x depends on the underlying heuristic approach, and p is the number of processing nodes. In particular, we propose a highly scalable algorithm, Sample-Align-D, for aligning biological sequences using Muscle system as the underlying heuristic. The proposed algorithm has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of quality of alignment, execution time and speed-up.Comment: 36 pages, 17 figures, Accepted manuscript in Journal of Parallel and Distributed Computing(JPDC

arXiv.org e-Print Archive

Crossref

Improving the quality of multiple sequence alignment

Author: Lu Yue
Publication venue
Publication date: 15/05/2009
Field of study

Multiple sequence alignment is an important bioinformatics problem, with applications in diverse types of biological analysis, such as structure prediction, phylogenetic analysis and critical sites identification. In recent years, the quality of multiple sequence alignment was improved a lot by newly developed methods, although it remains a difficult task for constructing accurate alignments, especially for divergent sequences. In this dissertation, we propose three new methods (PSAlign, ISPAlign, and NRAlign) for further improving the quality of multiple sequences alignment. In PSAlign, we propose an alternative formulation of multiple sequence alignment based on the idea of finding a multiple alignment which preserves all the pairwise alignments specified by edges of a given tree. In contrast with traditional NP-hard formulations, our preserving alignment formulation can be solved in polynomial time without using a heuristic, while still retaining very good performance when compared to traditional heuristics. In ISPAlign, by using additional hits from database search of the input sequences, a few strategies have been proposed to significantly improve alignment accuracy, including the construction of profiles from the hits while performing profile alignment, the inclusion of high scoring hits into the input sequences, the use of intermediate sequence search to link distant homologs, and the use of secondary structure information. In NRAlign, we observe that it is possible to further improve alignment accuracy by taking into account alignment of neighboring residues when aligning two residues, thus making better use of horizontal information. By modifying existing multiple alignment algorithms to make use of horizontal information, we show that this strategy is able to consistently improve over existing algorithms on all the benchmarks that are commonly used to measure alignment accuracy

Texas A&M Repository

MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts

Author: A Krogh
AN Tegge
C Notredame
CB Do
DF Feng
DG Higgins
DG Higgins
DG Higgins
DG Higgins
F Jeanmougin
F Wilcoxon
G Pollastri
GH Gonnet
GJ Barton
GP Raghava
GP Raghava
HY Zhou
J Cheng
J Heringa
J Pei
J Pei
J Pei
J Söding
J Söding
JD Thompson
JD Thompson
JD Thompson
JD Thompson
Jianlin Cheng
K Katoh
M Brudno
M Larkin
NK Kim
NS Boutonnet
O Poirot
O Poirot
PHA Sneath
R Chenna
R Durbin
RC Edgar
RC Edgar
RK Bradley
RS Amarendran
RS Amarendran
RS Amarendran
S Chikkagoudar
SE Brenner
SH Sze
T Kawabata
TL Bailey
U Roshan
V Walle
V Walle
Xin Deng
YC Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Multiple Sequence Alignment (MSA) is a basic tool for bioinformatics research and analysis. It has been used essentially in almost all bioinformatics tasks such as protein structure modeling, gene and protein function prediction, DNA motif recognition, and phylogenetic analysis. Therefore, improving the accuracy of multiple sequence alignment is important for advancing many bioinformatics fields. Results We designed and developed a new method, MSACompro, to synergistically incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. The method is different from the multiple sequence alignment methods (e.g. 3D-Coffee) that use the tertiary structure information of some sequences since the structural information of our method is fully predicted from sequences. To the best of our knowledge, applying predicted relative solvent accessibility and contact map to multiple sequence alignment is novel. The rigorous benchmarking of our method to the standard benchmarks (i.e. BAliBASE, SABmark and OXBENCH) clearly demonstrated that incorporating predicted protein structural information improves the multiple sequence alignment accuracy over the leading multiple protein sequence alignment tools without using this information, such as MSAProbs, ProbCons, Probalign, T-coffee, MAFFT and MUSCLE. And the performance of the method is comparable to the state-of-the-art method PROMALS of using structural features and additional homologous sequences by slightly lower scores. Conclusion MSACompro is an efficient and reliable multiple protein sequence alignment tool that can effectively incorporate predicted protein structural information into multiple sequence alignment. The software is available at <url>http://sysbio.rnet.missouri.edu/multicom_toolbox/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central