148 research outputs found

    SeqAn An efficient, generic C++ library for sequence analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use.</p> <p>Results</p> <p>To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use.</p> <p>Conclusion</p> <p>We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.</p

    Rust-Bio - a fast and safe bioinformatics library

    Full text link
    We present Rust-Bio, the first general purpose bioinformatics library for the innovative Rust programming language. Rust-Bio leverages the unique combination of speed, memory safety and high-level syntax offered by Rust to provide a fast and safe set of bioinformatics algorithms and data structures with a focus on sequence analysis

    Evaluating the Relationship Between Running Times and DNA Sequence Sizes using a Generic-Based Filtering Program.

    Get PDF
    Generic programming depends on the decomposition of programs into simpler components which may be developed separately and combined arbitrarily, subject only to well- defined interfaces. Bioinformatics deals with the application of computational techniques to data present in the Biological sciences. A genetic sequence is a succession of letters which represents the basic structure of a hypothetical DNA molecule, with the capacity to carry information. This research article studied the relationship between the running times of a generic-based filtering program and different samples of genetic sequences in an increasing order of magnitude. A graphical result was obtained to adequately depict this relationship. It was also discovered that the complexity of the generic tree program was O (log2 N). This research article provided one of the systematic approaches of generic programming to Bioinformatics, which could be instrumental in elucidating major discoveries in Bioinformatics, as regards efficient data management and analysis

    Segment-based multiple sequence alignment

    Get PDF
    Motivation: Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given the importance and wide-spread use of alignment tools, progress in both categories is a contribution to the community and has driven research in the field so far. Results: We introduce a graph-based extension to the consistency-based, progressive alignment strategy. We apply the consistency notion to segments instead of single characters. The main problem we solve in this context is to define segments of the sequences in such a way that a graph-based alignment is possible. We implemented the algorithm using the SeqAn library and report results on amino acid and DNA sequences. The benefit of our approach is threefold: (1) sequences with conserved blocks can be rapidly aligned, (2) the implementation is conceptually easy, generic and fast and (3) the consistency idea can be extended to align multiple genomic sequences. Availability: The segment-based multiple sequence alignment tool can be downloaded from http://www.seqan.de/projects/msa.html. A novel version of T-Coffee interfaced with the tool is available from http://www.tcoffee.org. The usage of the tool is described in both documentations. Contact: [email protected]

    LaRA 2: parallel and vectorized program for sequence–structure alignment of RNA sequences

    Get PDF
    Background The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson–Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and infer their possible functions, the structural alignment of RNAs is an essential task. This task demands a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains overlapping interactions (called pseudoknots), which add additional complexity to the problem and are often ignored in available software. Results We present the SeqAn-based software LaRA 2 that is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. In contrast to other programs our approach can handle arbitrary pseudoknots. As an improved re-implementation of the LaRA tool for structural alignments, LaRA 2 uses multi-threading and vectorization for parallel execution and a new heuristic for computing a lower boundary of the solution. Our algorithmic improvements yield a program that is up to 130 times faster than the previous version. Conclusions With LaRA 2 we provide a tool to analyse large sets of RNA secondary structures in relatively short time, based on structural alignment. The produced alignments can be used to derive structural motifs for the search in genomic databases

    Evaluating and Improving the Efficiency of Software and Algorithms for Sequence Data Analysis

    Get PDF
    With the ever-growing size of sequence data sets, data processing and analysis are an increasingly large portion of the time and money spent on nucleic acid sequencing projects. Correspondingly, the performance of the software and algorithms used to perform that analysis has a direct effect on the time and expense involved. Although the analytical methods are widely varied, certain types of software and algorithms are applicable to a number of areas. Targeting improvements to these common elements has the potential for wide reaching rewards. This dissertation research consisted of several projects to characterize and improve upon the efficiency of several common elements of sequence data analysis software and algorithms. The first project sought to improve the efficiency of the short read mapping process, as mapping is the most time consuming step in many data analysis pipelines. The result was a new short read mapping algorithm and software, demonstrated to be more computationally efficient than existing software and enabling more of the raw data to be utilized. While developing this software, it was discovered that a widely used bioinformatics software library introduced a great deal of inefficiency into the application. Given the potential impact of similar libraries to other applications, and because little research had been done to evaluate library efficiency, the second project evaluated the efficiency of seven of the most popular bioinformatics software libraries, written in C++, Java, Python, and Perl. This evaluation showed that two of libraries written in the most popular language, Java, were an order of magnitude slower and used more memory than expected based on the language in which they were implemented. The third and final project, therefore, was the development of a new general-purpose bioinformatics software library for Java. This library, known as BioMojo, incorporated a new design approach resulting in vastly improved efficiency. Assessing the performance of this new library using the benchmark methods developed for the second project showed that BioMojo outperformed all of the other libraries across all benchmark tasks, being up to 30 times more CPU efficient than existing Java libraries

    LaRA 2: parallel and vectorized program for sequence–structure alignment of RNA sequences

    Get PDF
    Background The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson–Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and infer their possible functions, the structural alignment of RNAs is an essential task. This task demands a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains overlapping interactions (called pseudoknots), which add additional complexity to the problem and are often ignored in available software. Results We present the SeqAn-based software LaRA 2 that is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. In contrast to other programs our approach can handle arbitrary pseudoknots. As an improved re-implementation of the LaRA tool for structural alignments, LaRA 2 uses multi-threading and vectorization for parallel execution and a new heuristic for computing a lower boundary of the solution. Our algorithmic improvements yield a program that is up to 130 times faster than the previous version. Conclusions With LaRA 2 we provide a tool to analyse large sets of RNA secondary structures in relatively short time, based on structural alignment. The produced alignments can be used to derive structural motifs for the search in genomic databases

    miR-SEA: miRNA Seed Extension based Aligner Pipeline for NGS Expression Level Extraction

    Get PDF
    The advent of Next Generation Sequencing (NGS) technology has enabled a new major approach for micro RNAs (miRNAs) expression profiling through the so called RNA-Sequencing (RNA-Seq). Different tools have been developed in the last years in order to detect and quantify miRNAs, especially in pathological samples, starting from the big amount of data deriving from RNA sequencing. These tools, usually relying on general purpose alignment algorithms, are however characterized by different sensitivity and accuracy levels and in the most of the cases provide not overlapping predictions. To overcome these limitations we propose a novel pipeline for miRNAs detection and quantification in RNA-Seq sample, miRNA Seed Extension Aligner (miR-SEA), based on an experimental evidence concerning miRNAs structure. The proposed pipeline was tested on three Colorectal Cancer (CRC) RNA-Seq samples and the obtained results compared with those provided by two well-known miRNAs detection tools showing good ability in performing detection and quantification more adherent to miRNAs structur

    LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment

    Full text link
    Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorithms for long alignments, the community primarily relies on approximate algorithms that search only for high-quality alignments and stop early when one is not found. In this work, we present the first GPU optimization of the popular X-drop alignment algorithm, that we named LOGAN. Results show that our high-performance multi-GPU implementation achieves up to 181.6 GCUPS and speed-ups up to 6.6x and 30.7x using 1 and 6 NVIDIA Tesla V100, respectively, over the state-of-the-art software running on two IBM Power9 processors using 168 CPU threads, with equivalent accuracy. We also demonstrate a 2.3x LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for sequence alignment implemented in minimap2, a long-read mapping software. To highlight the impact of our work on a real-world application, we couple LOGAN with a many-to-many long-read alignment software called BELLA, and demonstrate that our implementation improves the overall BELLA runtime by up to 10.6x. Finally, we adapt the Roofline model for LOGAN and demonstrate that our implementation is near-optimal on the NVIDIA Tesla V100s
    • …