118,600 research outputs found

    Detection of recombination in DNA multiple alignments with hidden markov models

    Get PDF
    CConventional phylogenetic tree estimation methods assume that all sites in a DNA multiple alignment have the same evolutionary history. This assumption is violated in data sets from certain bacteria and viruses due to recombination, a process that leads to the creation of mosaic sequences from different strains and, if undetected, causes systematic errors in phylogenetic tree estimation. In the current work, a hidden Markov model (HMM) is employed to detect recombination events in multiple alignments of DNA sequences. The emission probabilities in a given state are determined by the branching order (topology) and the branch lengths of the respective phylogenetic tree, while the transition probabilities depend on the global recombination probability. The present study improves on an earlier heuristic parameter optimization scheme and shows how the branch lengths and the recombination probability can be optimized in a maximum likelihood sense by applying the expectation maximization (EM) algorithm. The novel algorithm is tested on a synthetic benchmark problem and is found to clearly outperform the earlier heuristic approach. The paper concludes with an application of this scheme to a DNA sequence alignment of the argF gene from four Neisseria strains, where a likely recombination event is clearly detected

    Efficient Two-Level Swarm Intelligence Approach for Multiple Sequence Alignment

    Get PDF
    This paper proposes two-level particle swarm optimization (TL-PSO), an efficient PSO variant that addresses two levels of optimization problem. Level one works on optimizing dimension for entire swarm, whereas level two works for optimizing each particle's position. The issue addressed here is one of the most challenging multiple sequence alignment (MSA) problem. TL-PSO deals with the arduous task of determination of exact sequence length with most suitable gap positions in MSA. The two levels considered here are: to obtain optimal sequence length in level one and to attain optimum gap positions for maximal alignment score in level two. The performance of TL-PSO has been assessed through a comparative study with two kinds of benchmark dataset of DNA and RNA. The efficiency of the proposed approach is evaluated with four popular scoring schemes at specific parameters. TL-PSO alignments are compared with four PSO variants, i.e. S-PSO, M-PSO, ED-MPSO and CPSO-Sk, and two leading alignment software, i.e. ClustalW and T-Coffee, at different alignment scores. Hence obtained results prove the competence of TL-PSO at accuracy aspects and conclude better score scheme

    ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences

    Get PDF
    BACKGROUND: Currently available methods to predict splice sites are mainly based on the independent and progressive alignment of transcript data (mostly ESTs) to the genomic sequence. Apart from often being computationally expensive, this approach is vulnerable to several problems – hence the need to develop novel strategies. RESULTS: We propose a method, based on a novel multiple genome-EST alignment algorithm, for the detection of splice sites. To avoid limitations of splice sites prediction (mainly, over-predictions) due to independent single EST alignments to the genomic sequence our approach performs a multiple alignment of transcript data to the genomic sequence based on the combined analysis of all available data. We recast the problem of predicting constitutive and alternative splicing as an optimization problem, where the optimal multiple transcript alignment minimizes the number of exons and hence of splice site observations. We have implemented a splice site predictor based on this algorithm in the software tool ASPIC (Alternative Splicing PredICtion). It is distinguished from other methods based on BLAST-like tools by the incorporation of entirely new ad hoc procedures for accurate and computationally efficient transcript alignment and adopts dynamic programming for the refinement of intron boundaries. ASPIC also provides the minimal set of non-mergeable transcript isoforms compatible with the detected splicing events. The ASPIC web resource is dynamically interconnected with the Ensembl and Unigene databases and also implements an upload facility. CONCLUSION: Extensive bench marking shows that ASPIC outperforms other existing methods in the detection of novel splicing isoforms and in the minimization of over-predictions. ASPIC also requires a lower computation time for processing a single gene and an EST cluster. The ASPIC web resource is available at

    Solving multiple sequence alignment problems by using a swarm intelligent optimization based approach

    Get PDF
    In this article, the alignment of multiple sequences is examined through swarm intelligence based an improved particle swarm optimization (PSO). A random heuristic technique for solving discrete optimization problems and realistic estimation was recently discovered in PSO. The PSO approach is a nature-inspired technique based on intelligence and swarm movement. Thus, each solution is encoded as “chromosomes” in the genetic algorithm (GA). Based on the optimization of the objective function, the fitness function is designed to maximize the suitable components of the sequence and reduce the unsuitable components of the sequence. The availability of a public benchmark data set such as the Bali base is seen as an assessment of the proposed system performance, with the potential for PSO to reveal problems in adapting to better performance. This proposed system is compared with few existing approaches such as deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) alignment (DIALIGN), PILEUP8, hidden Markov model training (HMMT), rubber band technique-genetic algorithm (RBT-GA) and ML-PIMA. In many cases, the experimental results are well implemented in the proposed system compared to other existing approaches

    Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model

    Get PDF
    BACKGROUND: Certain protein families are highly conserved across distantly related organisms and belong to large and functionally diverse superfamilies. The patterns of conservation present in these protein sequences presumably are due to selective constraints maintaining important but unknown structural mechanisms with some constraints specific to each family and others shared by a larger subset or by the entire superfamily. To exploit these patterns as a source of functional information, we recently devised a statistically based approach called contrast hierarchical alignment and interaction network (CHAIN) analysis, which infers the strengths of various categories of selective constraints from co-conserved patterns in a multiple alignment. The power of this approach strongly depends on the quality of the multiple alignments, which thus motivated development of theoretical concepts and strategies to improve alignment of conserved motifs within large sets of distantly related sequences. RESULTS: Here we describe a hidden Markov model (HMM), an algebraic system, and Markov chain Monte Carlo (MCMC) sampling strategies for alignment of multiple sequence motifs. The MCMC sampling strategies are useful both for alignment optimization and for adjusting position specific background amino acid frequencies for alignment uncertainties. Associated statistical formulations provide an objective measure of alignment quality as well as automatic gap penalty optimization. Improved alignments obtained in this way are compared with PSI-BLAST based alignments within the context of CHAIN analysis of three protein families: G(iα )subunits, prolyl oligopeptidases, and transitional endoplasmic reticulum (p97) AAA+ ATPases. CONCLUSION: While not entirely replacing PSI-BLAST based alignments, which likewise may be optimized for CHAIN analysis using this approach, these motif-based methods often more accurately align very distantly related sequences and thus can provide a better measure of selective constraints. In some instances, these new approaches also provide a better understanding of family-specific constraints, as we illustrate for p97 ATPases. Programs implementing these procedures and supplementary information are available from the authors

    Un enfoque Multi-Objetivo a la optimización del Alineamiento Múltiple de Secuencias (MSA)

    Get PDF
    Multiple Sequence Alignment (MSA) is one of the main topics in the bioinformatics domain, consists in finding an optimal alignment for three or more biological sequences with the number maximum of conserved zones or totally aligned columns. Different scores to assess the quality of the alignments have been proposed, so the problem can be formulated and resolved as a Multi-Objective Optimization Problem (MOP). For this reason we have carried out a perfomanced study resolving the MSA problem under a multi-objective approach, considering two popular metrics as objectives to be optimized: The weighted Sum-Of-Pairs with affine gap penalties (wSOP) and the Totally Aligned Columns (TC), with three algorithms from the state-of- the-art of Multi-Objective Optimization: NSGAII, SPEA2 and MOCell. Our experiments reveals that the classic metaheuristic NSGA-II provides the best overall performance resolving some problems provided by the benchmark BAliBASE (v3.0), under a multi-objective and biological approach

    Cooperative Metaheuristics for Exploring Proteomic Data

    Get PDF
    Most combinatorial optimization problems cannotbe solved exactly. A class of methods, calledmetaheuristics, has proved its efficiency togive good approximated solutions in areasonable time. Cooperative metaheuristics area sub-set of metaheuristics, which implies aparallel exploration of the search space byseveral entities with information exchangebetween them. The importance of informationexchange in the optimization process is relatedto the building block hypothesis ofevolutionary algorithms, which is based onthese two questions: what is the pertinentinformation of a given potential solution andhow this information can be shared? Aclassification of cooperative metaheuristicsmethods depending on the nature of cooperationinvolved is presented and the specificproperties of each class, as well as a way tocombine them, is discussed. Severalimprovements in the field of metaheuristics arealso given. In particular, a method to regulatethe use of classical genetic operators and todefine new more pertinent ones is proposed,taking advantage of a building block structuredrepresentation of the explored space. Ahierarchical approach resting on multiplelevels of cooperative metaheuristics is finallypresented, leading to the definition of acomplete concerted cooperation strategy. Someapplications of these concepts to difficultproteomics problems, including automaticprotein identification, biological motifinference and multiple sequence alignment arepresented. For each application, an innovativemethod based on the cooperation concept isgiven and compared with classical approaches.In the protein identification problem, a firstlevel of cooperation using swarm intelligenceis applied to the comparison of massspectrometric data with biological sequencedatabase, followed by a genetic programmingmethod to discover an optimal scoring function.The multiple sequence alignment problem isdecomposed in three steps involving severalevolutionary processes to infer different kindof biological motifs and a concertedcooperation strategy to build the sequencealignment according to their motif conten

    Using motif databases to help improve multiple sequence alignment

    Get PDF
    Current progress in genome research projects has generated huge amount of data. As a result, the analysis of these data is now a bottleneck in bioinformatics. Multiple sequence alignment is an important step in this kind of analysis. It compares unknown sequences with well studied ones, and thus infers functional and structural information of the unknown sequences. However, due to the NP-completeness nature of the multiple sequence alignment, exhaustive searching method is unrealistic. Current algorithms use heuristic approach to get a nearly global optimal result. As a consequence, any specific program may encounter certain cases that it is not good at. In this work, we use protein motif databases to improve the alignment. The basic idea is to detect possible occurrences of motifs on the sequences, and force those parts to be aligned together. Unlike existing programs, this method uses biological information instead of treating it as purely an optimization problem. It also reduces the searching space. Experiments show that using motif databases could generate good result

    Aligning Multiple Sequences with Genetic Algorithm

    Get PDF
    The alignment of biological sequences is a crucial tool in molecular biology and genome analysis. It helps to build a phylogenetic tree of related DNA sequences and also to predict the function and structure of unknown protein sequences by aligning with other sequences whose function and structure is already known. However, finding an optimal multiple sequence alignment takes time and space exponential with the length or number of sequences increases. Genetic Algorithms (GAs) are strategies of random searching that optimize an objective function which is a measure of alignment quality (distance) and has the ability for exploratory search through the solution space and exploitation of current results
    corecore