690 research outputs found
Indices and Applications in High-Throughput Sequencing
Recent advances in sequencing technology allow to produce billions of base pairs per day in the form of reads of length 100 bp an longer and current developments promise the personal $1,000 genome in a couple of years. The analysis of these unprecedented amounts of data demands for efficient data structures and algorithms. One such data structures is the substring index, that represents all substrings or substrings up to a certain length contained in a given text.
In this thesis we propose 3 substring indices, which we extend to be applicable to millions of sequences. We devise internal and external memory construction algorithms and a uniform framework for accessing the generalized suffix tree. Additionally we propose different index-based applications, e.g. exact and approximate pattern matching and different repeat search algorithms.
Second, we present the read mapping tool RazerS, which aligns millions of single or paired-end reads of arbitrary lengths to their potential genomic origin using either Hamming or edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present a novel approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. We compare RazerS with other state-of-the-art read mappers and show that it has the highest sensitivity and a comparable performance on various real-world datasets.
At last, we propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel and lightweight algorithm that is faster and uses less memory than the best available algorithms. We show its applicability for mining multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to generalize the emerging substring mining problem to multiple databases. To demonstrate the improvement of our algorithm we compared to recent approaches on real-world experiments of various string domains, e.g. natural language, DNA, or protein sequences
Fast and accurate read mapping with approximate seeds and multiple backtracking
We present Masai, a read mapper representing the state-of-the-art in terms of speed and accuracy. Our tool is an order of magnitude faster than RazerS 3 and mrFAST, 2-4 times faster and more accurate than Bowtie 2 and BWA. The novelties of our read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seeds, compared with exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds by taking advantage of the repetitiveness of next-generation sequencing data. Combined together, these two methods significantly speed up approximate search on genomic data sets. Masai is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and binaries for Linux, Mac OS X and Windows can be freely downloaded from http://www.seqan.de/projects/masai
RazerS 3: Faster, fully sensitive read mapping
Motivation: During the last years NGS sequencing has become a key technology for many applications in the biomedical sciences. Throughput continues to increase and new protocols provide longer reads than currently available. In almost all applications, read mapping is a first step. Hence, it is crucial to have algorithms and implementations that perform fast, with high sensitivity, and are able to deal with long reads and a large absolute number of indels.
Results: RazerS is a read mapping program with adjustable sensitivity based on counting q-grams. In this work we propose the successor RazerS 3 which now supports shared-memory parallelism, an additional seed-based filter with adjustable sensitivity, a much faster, banded version of the Myers’ bit-vector algorithm for verification, memory saving measures and support for the SAM output format. This leads to a much improved performance for mapping reads, in particular long reads with many errors. We extensively compare RazerS 3 with other popular read mappers and show that its results are often superior to them in terms of sensitivity while exhibiting practical and often competetive run times. In addition, RazerS 3 works without a precomputed index.
Availability and Implementation: Source code and binaries are freely available for download at http://www.seqan.de/projects/razers. RazerS 3 is implemented in C++ and OpenMP under a GPL license using the SeqAn library and supports Linux, Mac OS X, and Windows
RazerS - Fast Read Mapping with Sensitivity Control
Second-generation sequencing technologies deliver DNA sequence data at unprecedented high throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. Due to the large amounts of data, efficient algorithms and implementations are crucial for this task. We present an efficient read mapping tool called RazerS. It allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present an approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time
Segment-based multiple sequence alignment
Motivation: Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given
the importance and wide-spread use of alignment tools, progress in
both categories is a contribution to the community and has driven
research in the field so far. Results: We introduce a graph-based
extension to the consistency-based, progressive alignment strategy.
We apply the consistency notion to segments instead of single characters.
The main problem we solve in this context is to define segments of
the sequences in such a way that a graph-based alignment is possible.
We implemented the algorithm using the SeqAn library and report results
on amino acid and DNA sequences. The benefit of our approach is threefold:
(1) sequences with conserved blocks can be rapidly aligned, (2) the
implementation is conceptually easy, generic and fast and (3) the
consistency idea can be extended to align multiple genomic sequences.
Availability: The segment-based multiple sequence alignment tool
can be downloaded from http://www.seqan.de/projects/msa.html. A novel
version of T-Coffee interfaced with the tool is available from http://www.tcoffee.org.
The usage of the tool is described in both documentations. Contact:
[email protected]
Bis(μ-dimesitylborinato-κ2 O:O)bis[(2-methylpyridine-κN)lithium]
The title compound, [Li2(C18H22BO)2(C6H7N)2], is a lithium dimesitylboroxide dimer in which the lithium cation is also coordinated by one molecule of 2-methylpyridine. At the core of the structure is an Li2O2 four-membered ring. The structure is centrosymmetric with an inversion centre midway between two Li atoms. Intermolecular C—H⋯π interactions and π–π interactions between the 2-methylpyridine rings exist [centroid–centroid distance = 3.6312 (16) Å]
Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS
Motivation: The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map.
Results: Here we present a method for ‘split’ read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant.
Availability: SplazerS is available from http://www.seqan.de/projects/ splazers
Recommended from our members
Exploring the Physical, Chemical and Thermal Characteristics of a New Potentially Insensitive High Explosive: RX-55-AE-5
Current work at the Energetic Materials Center, EMC, at Lawrence Livermore National Laboratory (LLNL) includes both understanding properties of old explosives and measuring properties of new ones [1]. The necessity to know and understand the properties of energetic materials is driven by the need to improve performance and enhance stability to various stimuli, such as thermal, friction and impact insult. This review will concentrate on the physical properties of RX-55-AE-5, which is formulated from heterocyclic explosive, 2,6-diamino-3,5-dinitropyrazine-1-oxide, LLM-105, and 2.5% Viton A. Differential scanning calorimetry (DSC) was used to measure a specific heat capacity, C{sub p}, of {approx} 0.950 J/g{center_dot} C and a thermal conductivity, {kappa}, of {approx} 0.475 W/m{center_dot} C. The LLNL kinetics modeling code Kinetics05 and the Advanced Kinetics and Technology Solutions (AKTS) code Thermokinetics were both used to calculate Arrhenius kinetics for decomposition of LLM-105. Both obtained an activation energy barrier E {approx} 180 kJ mol{sup -1} for mass loss in an open pan. Thermal mechanical analysis, TMA, was used to measure the coefficient of thermal expansion (CTE). The CTE for this formulation was calculated to be {approx} 61 {micro}m/m{center_dot} C. Impact, spark, friction are also reported
- …