37,523 research outputs found

    Improved genetic algorithm for multiple sequence alignment using segment profiles (GASP)

    Get PDF
    This paper presents a novel genetic algorithm (GA) for multiple sequence alignment in protein analysis. The most significant improvement afforded by this algorithm results from its use of segment profiles to generate the diversified initial population and prevent the destruction of conserved regions by crossover and mutation operations. Segment profiles contain rich local information, thereby speeding up convergence. Secondly, it introduces the use of the norMD function in a genetic algorithm to measure multiple alignment Finally, as an approach to the premature problem, an improved progressive method is used to optimize the highest-scoring individual of each new generation. The new algorithm is compared with the ClustalX and T-Coffee programs on several data cases from the BAliBASE benchmark alignment database. The experimental results show that it can yield better performance on data sets with long sequences, regardless of similarity

    REDHORSE-REcombination and Double crossover detection in Haploid Organisms using next-geneRation SEquencing data

    Get PDF
    BACKGROUND: Next-generation sequencing technology provides a means to study genetic exchange at a higher resolution than was possible using earlier technologies. However, this improvement presents challenges as the alignments of next generation sequence data to a reference genome cannot be directly used as input to existing detection algorithms, which instead typically use multiple sequence alignments as input. We therefore designed a software suite called REDHORSE that uses genomic alignments, extracts genetic markers, and generates multiple sequence alignments that can be used as input to existing recombination detection algorithms. In addition, REDHORSE implements a custom recombination detection algorithm that makes use of sequence information and genomic positions to accurately detect crossovers. REDHORSE is a portable and platform independent suite that provides efficient analysis of genetic crosses based on Next-generation sequencing data. RESULTS: We demonstrated the utility of REDHORSE using simulated data and real Next-generation sequencing data. The simulated dataset mimicked recombination between two known haploid parental strains and allowed comparison of detected break points against known true break points to assess performance of recombination detection algorithms. A newly generated NGS dataset from a genetic cross of Toxoplasma gondii allowed us to demonstrate our pipeline. REDHORSE successfully extracted the relevant genetic markers and was able to transform the read alignments from NGS to the genome to generate multiple sequence alignments. Recombination detection algorithm in REDHORSE was able to detect conventional crossovers and double crossovers typically associated with gene conversions whilst filtering out artifacts that might have been introduced during sequencing or alignment. REDHORSE outperformed other commonly used recombination detection algorithms in finding conventional crossovers. In addition, REDHORSE was the only algorithm that was able to detect double crossovers. CONCLUSION: REDHORSE is an efficient analytical pipeline that serves as a bridge between genomic alignments and existing recombination detection algorithms. Moreover, REDHORSE is equipped with a recombination detection algorithm specifically designed for Next-generation sequencing data. REDHORSE is portable, platform independent Java based utility that provides efficient analysis of genetic crosses based on Next-generation sequencing data. REDHORSE is available at http://redhorse.sourceforge.net/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-1309-7) contains supplementary material, which is available to authorized users

    Aligning Multiple Sequences with Genetic Algorithm

    Get PDF
    The alignment of biological sequences is a crucial tool in molecular biology and genome analysis. It helps to build a phylogenetic tree of related DNA sequences and also to predict the function and structure of unknown protein sequences by aligning with other sequences whose function and structure is already known. However, finding an optimal multiple sequence alignment takes time and space exponential with the length or number of sequences increases. Genetic Algorithms (GAs) are strategies of random searching that optimize an objective function which is a measure of alignment quality (distance) and has the ability for exploratory search through the solution space and exploitation of current results

    Detection of recombination in DNA multiple alignments with hidden markov models

    Get PDF
    CConventional phylogenetic tree estimation methods assume that all sites in a DNA multiple alignment have the same evolutionary history. This assumption is violated in data sets from certain bacteria and viruses due to recombination, a process that leads to the creation of mosaic sequences from different strains and, if undetected, causes systematic errors in phylogenetic tree estimation. In the current work, a hidden Markov model (HMM) is employed to detect recombination events in multiple alignments of DNA sequences. The emission probabilities in a given state are determined by the branching order (topology) and the branch lengths of the respective phylogenetic tree, while the transition probabilities depend on the global recombination probability. The present study improves on an earlier heuristic parameter optimization scheme and shows how the branch lengths and the recombination probability can be optimized in a maximum likelihood sense by applying the expectation maximization (EM) algorithm. The novel algorithm is tested on a synthetic benchmark problem and is found to clearly outperform the earlier heuristic approach. The paper concludes with an application of this scheme to a DNA sequence alignment of the argF gene from four Neisseria strains, where a likely recombination event is clearly detected

    Higher accuracy protein Multiple Sequence Alignment by Stochastic Algorithm

    Get PDF
    Multiple Sequence Alignment gives insight into evolutionary, structural and functional relationships among the proteins. Here, a novel Protein Alignment by Stochastic Algorithm (PASA) is developed. Evolutionary operators of a genetic algorithm, namely, mutation and selection are utilized in combining the output of two most important sequence alignment programs and then developing an optimized new algorithm. Efficiency of protein alignments is evaluated in terms of Total Column score which is equal to the number of correctly aligned columns between a test alignment and the reference alignment divided by the total number of columns in the reference alignment. The PASA optimizer achieves, on an average, significant better alignment over the well known individual bioinformatics tools. This PASA is statistically the most accurate protein alignment method today. It can have potential applications in drug discovery processes in the biotechnology industry

    Establishing the precise evolutionary history of a gene improves prediction of disease-causing missense mutations

    Get PDF
    PURPOSE: Predicting the phenotypic effects of mutations has become an important application in clinical genetic diagnostics. Computational tools evaluate the behavior of the variant over evolutionary time and assume that variations seen during the course of evolution are probably benign in humans. However, current tools do not take into account orthologous/paralogous relationships. Paralogs have dramatically different roles in Mendelian diseases. For example, whereas inactivating mutations in the NPC1 gene cause the neurodegenerative disorder Niemann-Pick C, inactivating mutations in its paralog NPC1L1 are not disease-causing and, moreover, are implicated in protection from coronary heart disease. METHODS: We identified major events in NPC1 evolution and revealed and compared orthologs and paralogs of the human NPC1 gene through phylogenetic and protein sequence analyses. We predicted whether an amino acid substitution affects protein function by reducing the organism’s fitness. RESULTS: Removing the paralogs and distant homologs improved the overall performance of categorizing disease-causing and benign amino acid substitutions. CONCLUSION: The results show that a thorough evolutionary analysis followed by identification of orthologs improves the accuracy in predicting disease-causing missense mutations. We anticipate that this approach will be used as a reference in the interpretation of variants in other genetic diseases as well. Genet Med 18 10, 1029–1036

    Genetic Sequence Matching Using D4M Big Data Approaches

    Full text link
    Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.Comment: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC) 201
    • …