127 research outputs found

    ALIGNMENT-FREE METHODS AND ITS APPLICATIONS

    Get PDF
    Comparing biological sequences remains one of the most vital activities in Bioinformatics. Comparing biological sequences would address the relatedness between species, and find similar structures that might lead to similar functions. Sequence alignment is the default method, and has been used in the domain for over four decades. It gained a lot of trust, but limitations and even failure has been reported, especially with the new generated genomes. These new generated genomes have bigger size, and to some extent suffer errors. Such errors come mainly as a result from the sequencing machine. These sequencing errors should be considered when submitting sequences to GenBank, for sequence comparison, it is often hard to address or even trace this problem. Alignment-based methods would fail with such errors, and even if biologists still trust them, reports showed failure with these methods. The poor results of alignment-based methods with erratic sequences, motivated researchers in the domain to look for alternatives. These alternative methods are alignment-free, and would overcome the shortcomings of alignment-based methods. The work of this thesis is based on alignment-free methods, and it conducts an in-depth study to evaluate these methods, and find the right domain’s application for them. The right domain for alignment-free methods could be by applying them to data that were subjected to manufactured errors, and test the methods provide better comparison results with data that has naturally severe errors. The two techniques used in this work are compression-based and motif-based (or k-mer based, or signal based). We also addressed the selection of the used motifs in the second technique, and how to progress the results by selecting specific motifs that would enhance the quality of results. In addition, we applied an alignment-free method to a different domain, which is gene prediction. We are using alignment-free in gene prediction to speed up the process of providing high quality results, and predict accurate stretches in the DNA sequence, which would be considered parts of genes

    Algoritmi za učinkovitu usporedbu sekvenci bez koriơtenja sravnjivanja

    Get PDF
    Sequence comparison is an essential tool in modern biology. It is used to identify homologous regions between sequences, and to detect evolutionary relationships between organisms. Sequence comparison is usually based on alignments. However, aligning whole genomes is computationally difficult. As an alternative approach, alignment-free sequence comparison can be used. In my thesis, I concentrate on two problems that can be solved without alignment: (i) estimation of substitution rates between nucleotide sequences, and (ii) detection of local sequence homology. In the first part of my thesis, I developed and implemented a new algorithm for the efficient alignment-free computation of the number of nucleotide substitutions per site, and applied it to the analysis of large data sets of complete genomes. In the second part of my thesis, I developed and implemented a new algorithm for detecting matching regions between nucleotide sequences. I applied this solution to the classification of circulating recombinant forms of HIV, and to the analysis of bacterial genomes subject to horizontal gene transfer.Table of Contents 1. GENERAL INTRODUCTION.........................................................................1 1.1. Suffix trees and other index data structures used in biological sequence analysis.....................................................................................................................9 1.1.1. Suffix Tree..........................................................................................11 1.1.2. The space and the time complexity of the algorithms for the suffix tree construction.......................................................................................................13 1.1.3. Suffix Array........................................................................................14 1.1.4. The space and the time complexity of the algorithms for suffix array construction.......................................................................................................15 1.1.5. Enhanced Suffix Array.......................................................................17 1.1.6. The 64-bit implementation of the lightweight suffix array construction algorithm 21 1.1.7. Self-index...........................................................................................22 1.1.8. Burrows-Wheeler transform..............................................................23 1.1.9. The FM-Index and the backward search algorithm..........................25 1.1.10. The space and the time-complexity of the FM-index.........................29 2. EFFICIENT ESTIMATION OF PAIRWISE DISTANCES BETWEEN GENOMES...............................................................................................................31 2.1. Introduction................................................................................................31 2.2. Methods.....................................................................................................33 2.2.1. Definition of an alignment-free estimator of the rate of substitution, Kr 33 2.2.2. Problem statement.............................................................................35 2.2.3. Time complexity analysis of the previous approach (kr 1)................35 2.2.4. Time complexity analysis of the new approach (kr 2).......................37 2.2.5. Algorithm 1: Computation of all Kr values during the traversal of a generalized suffix tree of n sequences................................................................38 2.2.6. The implementation of kr version 2...................................................44 2.3. Analysis of Kr on simulated data sets........................................................45 2.3.1. Auxiliary programs............................................................................45 2.3.2. Consistency of Kr...............................................................................46 i 2.3.3. The affect of horizontal gene transfer on the accuracy of Kr............48 2.3.4. The effect of genome duplication on the accuracy of Kr....................49 2.3.5. Run time comparison of kr 1 and kr 2...............................................50 2.4. Application of kr version 2........................................................................53 2.4.1. Auxililary software used for the analysis of real data sets................56 2.4.2. The analysis of 12 Drosophila genomes............................................57 2.4.3. The analysis of 13 Escherichia coli and Shigella genomes...............58 2.4.4. The analysis of 825 HIV-1 pure subtype genomes.............................61 2.5. Discussion..................................................................................................62 3. EFFICIENT ALIGNMENT-FREE DETECTION OF LOCAL SEQUENCE HOMOLOGY....................................................................................66 3.1. Introduction................................................................................................66 3.2. Methods.....................................................................................................69 3.2.1. Problem statement – determining subtype(s) of a query sequence....69 3.2.2. Construction of locally homologous segments..................................71 3.2.3. Time complexity of computing a list of intervals Ii............................72 3.2.4. Algorithm 2: Construction of an interval tree...................................73 3.2.5. Computing a list of segements Gi.......................................................80 3.3. Analysis of st on simulated data sets.........................................................82 3.3.1. Run-time and memory usage analysis of st........................................82 3.3.2. Consistency of st................................................................................85 3.3.3. Comparison to SCUEAL on simulated data sets...............................92 3.4. Application of st.........................................................................................97 3.4.1. The analysis of Neisseria meningitidis..............................................98 3.4.2. The analysis of a recombinant form of HIV-1...................................99 3.4.3. The analysis of circulating recombinant forms of HIV-1................103 3.4.4. The analysis of an avian pathogenic Escherichia coli strain..........104 3.5. Discussion................................................................................................107 4. CONCLUSION..............................................................................................110 5. REFERENCES..............................................................................................112 6. ELECTRONIC SOURCES...........................................................................121 7. LIST OF ABBREVIATIONS AND SYMBOLS.........................................122 ii iii ABSTRACT............................................................................................................124 SAĆœETAK..............................................................................................................125 CURRICULUM VITAE........................................................................................126 ĆœIVOTOPIS...........................................................................................................12

    A Lossy Compression Technique Enabling Duplication-Aware Sequence Alignment

    Get PDF
    In spite of the recognized importance of tandem duplications in genome evolution, commonly adopted sequence comparison algorithms do not take into account complex mutation events involving more than one residue at the time, since they are not compliant with the underlying assumption of statistical independence of adjacent residues. As a consequence, the presence of tandem repeats in sequences under comparison may impair the biological significance of the resulting alignment. Although solutions have been proposed, repeat-aware sequence alignment is still considered to be an open problem and new efficient and effective methods have been advocated. The present paper describes an alternative lossy compression scheme for genomic sequences which iteratively collapses repeats of increasing length. The resulting approximate representations do not contain tandem duplications, while retaining enough information for making their comparison even more significant than the edit distance between the original sequences. This allows us to exploit traditional alignment algorithms directly on the compressed sequences. Results confirm the validity of the proposed approach for the problem of duplication-aware sequence alignment

    CAD Tools for DNA Micro-Array Design, Manufacture and Application

    Get PDF
    Motivation: As the human genome project progresses and some microbial and eukaryotic genomes are recognized, numerous biotechnological processes have attracted increasing number of biologists, bioengineers and computer scientists recently. Biotechnological processes profoundly involve production and analysis of highthroughput experimental data. Numerous sequence libraries of DNA and protein structures of a large number of micro-organisms and a variety of other databases related to biology and chemistry are available. For example, microarray technology, a novel biotechnology, promises to monitor the whole genome at once, so that researchers can study the whole genome on the global level and have a better picture of the expressions among millions of genes simultaneously. Today, it is widely used in many fields- disease diagnosis, gene classification, gene regulatory network, and drug discovery. For example, designing organism specific microarray and analysis of experimental data require combining heterogeneous computational tools that usually differ in the data format; such as, GeneMark for ORF extraction, Promide for DNA probe selection, Chip for probe placement on microarray chip, BLAST to compare sequences, MEGA for phylogenetic analysis, and ClustalX for multiple alignments. Solution: Surprisingly enough, despite huge research efforts invested in DNA array applications, very few works are devoted to computer-aided optimization of DNA array design and manufacturing. Current design practices are dominated by ad-hoc heuristics incorporated in proprietary tools with unknown suboptimality. This will soon become a bottleneck for the new generation of high-density arrays, such as the ones currently being designed at Perlegen [109]. The goal of the already accomplished research was to develop highly scalable tools, with predictable runtime and quality, for cost-effective, computer-aided design and manufacturing of DNA probe arrays. We illustrate the utility of our approach by taking a concrete example of combining the design tools of microarray technology for Harpes B virus DNA data

    A list of parameterized problems in bioinformatics

    Get PDF
    In this report we present a list of problems that originated in bionformatics. Our aim is to collect information on such problems that have been analyzed from the point of view of Parameterized Complexity. For every problem we give its definition and biological motivation together with known complexity results.Postprint (published version

    High Performance Computing for DNA Sequence Alignment and Assembly

    Get PDF
    Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

    Oligonucleotide Design for Whole Genome Tiling Arrays

    Get PDF
    Oligonucleotides are short, single-stranded fragments of DNA or RNA, designed to readily bind with a unique part in the target sequence. They have many important applications including PCR (polymerase chain reaction) amplification, microarrays, or FISH (fluorescence in situ hybridization) probes. While traditional microarrays are commonly used for measuring gene expression levels by probing for sequences of known and predicted genes, high-density, whole genome tiling arrays probe intensively for sequences that are known to exist in a contiguous region. Current programs for designing oligonucleotides for tiling arrays are not able to produce results that are close to optimal since they allow oligonucleotides that are too similar with non-targets, thus enabling unwanted cross-hybridization. We present a new program, BOND-tile, that produces much better tiling arrays, as shown by extensive comparison with leading programs

    Efficient estimation of evolutionary distances

    Get PDF
    The advent of high throughput sequencers has lead to a dramatic increase in the size of available genomic data. Standard methods, which have worked well for many years, are not suitable for the analysis of big data sets, due to their reliance on a time-consuming alignment step. In this thesis, a new alignment-free approach for phylogeny reconstruction is introduced. The corresponding program, andi, is orders of magnitude faster than classical approaches and also superior to comparable alignment-free methods. The central data structure in andi is the enhanced suffix array. It is used to find long exact matches between sequences. In this thesis, various approaches to the construction of enhanced suffix arrays, including novel ones, are evaluated with respect to performance. Additionally, a new parallel algorithm for the computation of suffix arrays is introduced

    Accelerating dynamic programming

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 129-136).Dynamic Programming (DP) is a fundamental problem-solving technique that has been widely used for solving a broad range of search and optimization problems. While DP can be invoked when more specialized methods fail, this generality often incurs a cost in efficiency. We explore a unifying toolkit for speeding up DP, and algorithms that use DP as subroutines. Our methods and results can be summarized as follows. - Acceleration via Compression. Compression is traditionally used to efficiently store data. We use compression in order to identify repeats in the table that imply a redundant computation. Utilizing these repeats requires a new DP, and often different DPs for different compression schemes. We present the first provable speedup of the celebrated Viterbi algorithm (1967) that is used for the decoding and training of Hidden Markov Models (HMMs). Our speedup relies on the compression of the HMM's observable sequence. - Totally Monotone Matrices. It is well known that a wide variety of DPs can be reduced to the problem of finding row minima in totally monotone matrices. We introduce this scheme in the context of planar graph problems. In particular, we show that planar graph problems such as shortest paths, feasible flow, bipartite perfect matching, and replacement paths can be accelerated by DPs that exploit a total-monotonicity property of the shortest paths. - Combining Compression and Total Monotonicity. We introduce a method for accelerating string edit distance computation by combining compression and totally monotone matrices.(cont.) In the heart of this method are algorithms for computing the edit distance between two straight-line programs. These enable us to exploits the compressibility of strings, even if each string is compressed using a different compression scheme. - Partial Tables. In typical DP settings, a table is filled in its entirety, where each cell corresponds to some subproblem. In some cases, by changing the DP, it is possible to compute asymptotically less cells of the table. We show that [theta](nÂł) subproblems are both necessary and sufficient for computing the similarity between two trees. This improves all known solutions and brings the idea of partial tables to its full extent. - Fractional Subproblems. In some DPs, the solution to a subproblem is a data structure rather than a single value. The entire data structure of a subproblem is then processed and used to construct the data structure of larger subproblems. We suggest a method for reusing parts of a subproblem's data structure. In some cases, such fractional parts remain unchanged when constructing the data structure of larger subproblems. In these cases, it is possible to copy this part of the data structure to the larger subproblem using only a constant number of pointer changes. We show how this idea can be used for finding the optimal tree searching strategy in linear time. This is a generalization of the well known binary search technique from arrays to trees.by Oren Weimann.Ph.D

    High-Performance Computing Frameworks for Large-Scale Genome Assembly

    Get PDF
    Genome sequencing technology has witnessed tremendous progress in terms of throughput and cost per base pair, resulting in an explosion in the size of data. Typical de Bruijn graph-based assembly tools demand a lot of processing power and memory and cannot assemble big datasets unless running on a scaled-up server with terabytes of RAMs or scaled-out cluster with several dozens of nodes. In the first part of this work, we present a distributed next-generation sequence (NGS) assembler called Lazer, that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, we can assemble large sequences such as human genomes (~400 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in 18.5 hours and on 128 nodes in 1.25 hours. In the second part, we present a new distributed GPU-accelerated NGS assembler called LaSAGNA, which can assemble large-scale sequence datasets using a single GPU by building string graphs from approximate all-pair overlaps in quasi-linear time. To use the limited memory on GPUs efficiently, LaSAGNA uses a two-level semi-streaming approach from disk through host memory to device memory with restricted access patterns on both disk and host memory. Using LaSAGNA, we can assemble the human genome dataset on a single NVIDIA K40 GPU in 17 hours, and in a little over 5 hours on an 8-node cluster of NVIDIA K20s. In the third part, we present the first distributed 3rd generation sequence (3GS) assembler which uses a map-reduce computing paradigm and a distributed hash-map, both built on a high-performance networking middleware. Using this assembler, we assembled an Oxford Nanopore human genome dataset (~150 GB) in just over half an hour using 128 nodes whereas existing 3GS assemblers could not assemble it because of memory and/or time limitations
    • 

    corecore