95 research outputs found

    BLAMM : BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs

    Get PDF
    Background The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed. Results We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10(-4) using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min. Conclusions BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm

    Parallel Position Weight Matrices Algorithms

    Get PDF
    International audiencePosition Weight Matrices (PWMs) are broadly used in computational biology. The basic problems, Scan and MultipleScan, aim to find all the occurrences of a given PWM or a set of PWMs in long sequences. Some other PWM tasks share a common NP-hard subproblem, ScoreDistribution. The existing algorithms rely on the enumeration on a large set of scores or words, and they are mostly not suitable for parallelization. We propose a new algorithm, BucketScoreDistribution, that is both very efficient and suitable for parallelization. We bound the error induced by this algorithm. We realized a GPU prototype for Scan, MultipleScan and BucketScoreDistribution with the CUDA libraries, and report for the different problems speedups larger than 10× on several Nvidia cards

    MOODS: fast search for position weight matrix matches in DNA sequences

    Get PDF
    Summary: MOODS (MOtif Occurrence Detection Suite) is a software package for matching position weight matrices against DNA sequences. MOODS implements state-of-the-art online matching algorithms, achieving considerably faster scanning speed than with a simple brute-force search. MOODS is written in C++, with bindings for the popular BioPerl and Biopython toolkits. It can easily be adapted for different purposes and integrated into existing workflows. It can also be used as a C++ library

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Techniques for searching, parsing, and matching (fourth edition)

    Get PDF
    These lecture notes present some basic techniques for: (i) exploring search spaces, (ii) parsing context-free languages, and (iii) matching patterns in strings. These techniques are taught in a course on Automata, Languages, and Translators at the University of Roma "Tor Vergata''. We assume that the reader is familiar with the basic notions of Automata Theory and Formal Languages. These notions can be found in many books such as [Har78,HoU79,Pet13a]. Some of the algorithms we have presented in these notes are written in Java 1.5 and some others in Prolog. For the Java language the reader may refer to the Java Tutorial at http://java.sun.com/docs/books/tutorial/} . (Recall that this Java version allows the use of parameterized types, also called generics.) All Java programs have been compiled using the Java compiler 1.5.0-13 running under Mac OS X 10.4.11 Darwin 8.11.1. For the Prolog language the reader may refer to [ClM84]. The Prolog language incorporates a backtracking mechanism which is useful for exploring search spaces and solving parsing and matching problems

    Algoritmi za učinkovitu usporedbu sekvenci bez korištenja sravnjivanja

    Get PDF
    Sequence comparison is an essential tool in modern biology. It is used to identify homologous regions between sequences, and to detect evolutionary relationships between organisms. Sequence comparison is usually based on alignments. However, aligning whole genomes is computationally difficult. As an alternative approach, alignment-free sequence comparison can be used. In my thesis, I concentrate on two problems that can be solved without alignment: (i) estimation of substitution rates between nucleotide sequences, and (ii) detection of local sequence homology. In the first part of my thesis, I developed and implemented a new algorithm for the efficient alignment-free computation of the number of nucleotide substitutions per site, and applied it to the analysis of large data sets of complete genomes. In the second part of my thesis, I developed and implemented a new algorithm for detecting matching regions between nucleotide sequences. I applied this solution to the classification of circulating recombinant forms of HIV, and to the analysis of bacterial genomes subject to horizontal gene transfer.Table of Contents 1. GENERAL INTRODUCTION.........................................................................1 1.1. Suffix trees and other index data structures used in biological sequence analysis.....................................................................................................................9 1.1.1. Suffix Tree..........................................................................................11 1.1.2. The space and the time complexity of the algorithms for the suffix tree construction.......................................................................................................13 1.1.3. Suffix Array........................................................................................14 1.1.4. The space and the time complexity of the algorithms for suffix array construction.......................................................................................................15 1.1.5. Enhanced Suffix Array.......................................................................17 1.1.6. The 64-bit implementation of the lightweight suffix array construction algorithm 21 1.1.7. Self-index...........................................................................................22 1.1.8. Burrows-Wheeler transform..............................................................23 1.1.9. The FM-Index and the backward search algorithm..........................25 1.1.10. The space and the time-complexity of the FM-index.........................29 2. EFFICIENT ESTIMATION OF PAIRWISE DISTANCES BETWEEN GENOMES...............................................................................................................31 2.1. Introduction................................................................................................31 2.2. Methods.....................................................................................................33 2.2.1. Definition of an alignment-free estimator of the rate of substitution, Kr 33 2.2.2. Problem statement.............................................................................35 2.2.3. Time complexity analysis of the previous approach (kr 1)................35 2.2.4. Time complexity analysis of the new approach (kr 2).......................37 2.2.5. Algorithm 1: Computation of all Kr values during the traversal of a generalized suffix tree of n sequences................................................................38 2.2.6. The implementation of kr version 2...................................................44 2.3. Analysis of Kr on simulated data sets........................................................45 2.3.1. Auxiliary programs............................................................................45 2.3.2. Consistency of Kr...............................................................................46 i 2.3.3. The affect of horizontal gene transfer on the accuracy of Kr............48 2.3.4. The effect of genome duplication on the accuracy of Kr....................49 2.3.5. Run time comparison of kr 1 and kr 2...............................................50 2.4. Application of kr version 2........................................................................53 2.4.1. Auxililary software used for the analysis of real data sets................56 2.4.2. The analysis of 12 Drosophila genomes............................................57 2.4.3. The analysis of 13 Escherichia coli and Shigella genomes...............58 2.4.4. The analysis of 825 HIV-1 pure subtype genomes.............................61 2.5. Discussion..................................................................................................62 3. EFFICIENT ALIGNMENT-FREE DETECTION OF LOCAL SEQUENCE HOMOLOGY....................................................................................66 3.1. Introduction................................................................................................66 3.2. Methods.....................................................................................................69 3.2.1. Problem statement – determining subtype(s) of a query sequence....69 3.2.2. Construction of locally homologous segments..................................71 3.2.3. Time complexity of computing a list of intervals Ii............................72 3.2.4. Algorithm 2: Construction of an interval tree...................................73 3.2.5. Computing a list of segements Gi.......................................................80 3.3. Analysis of st on simulated data sets.........................................................82 3.3.1. Run-time and memory usage analysis of st........................................82 3.3.2. Consistency of st................................................................................85 3.3.3. Comparison to SCUEAL on simulated data sets...............................92 3.4. Application of st.........................................................................................97 3.4.1. The analysis of Neisseria meningitidis..............................................98 3.4.2. The analysis of a recombinant form of HIV-1...................................99 3.4.3. The analysis of circulating recombinant forms of HIV-1................103 3.4.4. The analysis of an avian pathogenic Escherichia coli strain..........104 3.5. Discussion................................................................................................107 4. CONCLUSION..............................................................................................110 5. REFERENCES..............................................................................................112 6. ELECTRONIC SOURCES...........................................................................121 7. LIST OF ABBREVIATIONS AND SYMBOLS.........................................122 ii iii ABSTRACT............................................................................................................124 SAŽETAK..............................................................................................................125 CURRICULUM VITAE........................................................................................126 ŽIVOTOPIS...........................................................................................................12

    Transform Based And Search Aware Text Compression Schemes And Compressed Domain Text Retrieval

    Get PDF
    In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. While a good fraction of this information is of transient interest, useful information of archival value will continue to accumulate. We need ways to manage, organize and transport this data from one point to the other on data communications links with limited bandwidth. We must also have means to speedily find the information we need from this huge mass of data. Sometimes, a single site may also contain large collections of data such as a library database, thereby requiring an efficient search mechanism even to search within the local data. To facilitate the information retrieval, an emerging ad hoc standard for uncompressed text is XML which preprocesses the text by putting additional user defined metadata such as DTD or hyperlinks to enable searching with better efficiency and effectiveness. This increases the file size considerably, underscoring the importance of applying text compression. On account of efficiency (in terms of both space and time), there is a need to keep the data in compressed form for as much as possible. Text compression is concerned with techniques for representing the digital text data in alternate representations that takes less space. Not only does it help conserve the storage space for archival and online data, it also helps system performance by requiring less number of secondary storage (disk or CD Rom) accesses and improves the network transmission bandwidth utilization by reducing the transmission time. Unlike static images or video, there is no international standard for text compression, although compressed formats like .zip, .gz, .Z files are increasingly being used. In general, data compression methods are classified as lossless or lossy. Lossless compression allows the original data to be recovered exactly. Although used primarily for text data, lossless compression algorithms are useful in special classes of images such as medical imaging, finger print data, astronomical images and data bases containing mostly vital numerical data, tables and text information. Many lossy algorithms use lossless methods at the final stage of the encoding stage underscoring the importance of lossless methods for both lossy and lossless compression applications. In order to be able to effectively utilize the full potential of compression techniques for the future retrieval systems, we need efficient information retrieval in the compressed domain. This means that techniques must be developed to search the compressed text without decompression or only with partial decompression independent of whether the search is done on the text or on some inversion table corresponding to a set of key words for the text. In this dissertation, we make the following contributions: (1) Star family compression algorithms: We have proposed an approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm\u27s ability to compress. We use a static dictionary to convert the English words into predefined symbol sequences. These transformed sequences create additional context information that is superior to the original text. Thus we achieve some compression at the preprocessing stage. We have a series of transforms which improve the performance. Star transform requires a static dictionary for a certain size. To avoid the considerable complexity of conversion, we employ the ternary tree data structure that efficiently converts the words in the text to the words in the star dictionary in linear time. (2) Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) files: We proposed a method to extract the useful context information in linear time from the BWT transformed text. The auxiliary arrays obtained from BWT inverse transform brings logarithm search time. Meanwhile, approximate pattern matching can be performed based on the results of exact pattern matching to extract the possible candidate for the approximate pattern matching. Then fast verifying algorithm can be applied to those candidates which could be just small parts of the original text. We present algorithms for both k-mismatch and k-approximate pattern matching in BWT compressed text. A typical compression system based on BWT has Move-to-Front and Huffman coding stages after the transformation. We propose a novel approach to replace the Move-to-Front stage in order to extend compressed domain search capability all the way to the entropy coding stage. A modification to the Move-to-Front makes it possible to randomly access any part of the compressed text without referring to the part before the access point. (3) Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval: Although many compression algorithms provide good compression ratio and/or time complexity, LZW is the first one studied for the compressed pattern matching because of its simplicity and efficiency. Modifications on LZW algorithm provide the extra advantage for fast random access and partial decoding ability that is especially useful for text retrieval systems. Based on this algorithm, we can provide a dynamic hierarchical semantic structure for the text, so that the text search can be performed on the expected level of granularity. For example, user can choose to retrieve a single line, a paragraph, or a file, etc. that contains the keywords. More importantly, we will show that parallel encoding and decoding algorithm is trivial with the modified LZW. Both encoding and decoding can be performed with multiple processors easily and encoding and decoding process are independent with respect to the number of processors

    Proceedings of the Eindhoven FASTAR Days 2004 : Eindhoven, The Netherlands, September 3-4, 2004

    Get PDF
    The Eindhoven FASTAR Days (EFD) 2004 were organized by the Software Construction group of the Department of Mathematics and Computer Science at the Technische Universiteit Eindhoven. On September 3rd and 4th 2004, over thirty participants|hailing from the Czech Republic, Finland, France, The Netherlands, Poland and South Africa|gathered at the Department to attend the EFD. The EFD were organized in connection with the research on finite automata by the FASTAR Research Group, which is centered in Eindhoven and at the University of Pretoria, South Africa. FASTAR (Finite Automata Systems|Theoretical and Applied Research) is an in- ternational research group that aims to lead in all areas related to finite state systems. The work in FASTAR includes both core and applied parts of this field. The EFD therefore focused on the field of finite automata, with an emphasis on practical aspects and applications. Eighteen presentations, mostly on subjects within this field, were given, by researchers as well as students from participating universities and industrial research facilities. This report contains the proceedings of the conference, in the form of papers for twelve of the presentations at the EFD. Most of them were initially reviewed and distributed as handouts during the EFD. After the EFD took place, the papers were revised for publication in these proceedings. We would like to thank the participants for their attendance and presentations, making the EFD 2004 as successful as they were. Based on this success, it is our intention to make the EFD into a recurring event. Eindhoven, December 2004 Loek Cleophas Bruce W. Watso
    corecore