Search CORE

690 research outputs found

Indices and Applications in High-Throughput Sequencing

Author: Weese D.
Publication venue
Publication date: 05/06/2013
Field of study

Recent advances in sequencing technology allow to produce billions of base pairs per day in the form of reads of length 100 bp an longer and current developments promise the personal $1,000 genome in a couple of years. The analysis of these unprecedented amounts of data demands for efficient data structures and algorithms. One such data structures is the substring index, that represents all substrings or substrings up to a certain length contained in a given text. In this thesis we propose 3 substring indices, which we extend to be applicable to millions of sequences. We devise internal and external memory construction algorithms and a uniform framework for accessing the generalized suffix tree. Additionally we propose different index-based applications, e.g. exact and approximate pattern matching and different repeat search algorithms. Second, we present the read mapping tool RazerS, which aligns millions of single or paired-end reads of arbitrary lengths to their potential genomic origin using either Hamming or edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present a novel approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. We compare RazerS with other state-of-the-art read mappers and show that it has the highest sensitivity and a comparable performance on various real-world datasets. At last, we propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel and lightweight algorithm that is faster and uses less memory than the best available algorithms. We show its applicability for mining multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to generalize the emerging substring mining problem to multiple databases. To demonstrate the improvement of our algorithm we compared to recent approaches on real-world experiments of various string domains, e.g. natural language, DNA, or protein sequences

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

Recommendations for approaches to meticillin-resistant staphylococcal infections of small animals: diagnosis, therapeutic considerations and preventative measures

Author: Davis M F
Guardabassi L
Loeffler A
Morris D O
Weese J S
Publication venue: 'Wiley'
Publication date: 17/05/2017
Field of study

Fast and accurate read mapping with approximate seeds and multiple backtracking

Author: Reinert K.
Siragusa E.
Weese D.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 28/01/2013
Field of study

We present Masai, a read mapper representing the state-of-the-art in terms of speed and accuracy. Our tool is an order of magnitude faster than RazerS 3 and mrFAST, 2-4 times faster and more accurate than Bowtie 2 and BWA. The novelties of our read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seeds, compared with exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds by taking advantage of the repetitiveness of next-generation sequencing data. Combined together, these two methods significantly speed up approximate search on genomic data sets. Masai is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and binaries for Linux, Mac OS X and Windows can be freely downloaded from http://www.seqan.de/projects/masai

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

PubMed Central

MPG.PuRe

RazerS 3: Faster, fully sensitive read mapping

Author: Holtgrewe M.
Reinert K.
Weese D.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 24/08/2012
Field of study

Motivation: During the last years NGS sequencing has become a key technology for many applications in the biomedical sciences. Throughput continues to increase and new protocols provide longer reads than currently available. In almost all applications, read mapping is a first step. Hence, it is crucial to have algorithms and implementations that perform fast, with high sensitivity, and are able to deal with long reads and a large absolute number of indels. Results: RazerS is a read mapping program with adjustable sensitivity based on counting q-grams. In this work we propose the successor RazerS 3 which now supports shared-memory parallelism, an additional seed-based filter with adjustable sensitivity, a much faster, banded version of the Myers’ bit-vector algorithm for verification, memory saving measures and support for the SAM output format. This leads to a much improved performance for mapping reads, in particular long reads with many errors. We extensively compare RazerS 3 with other popular read mappers and show that its results are often superior to them in terms of sensitivity while exhibiting practical and often competetive run times. In addition, RazerS 3 works without a precomputed index. Availability and Implementation: Source code and binaries are freely available for download at http://www.seqan.de/projects/razers. RazerS 3 is implemented in C++ and OpenMP under a GPL license using the SeqAn library and supports Linux, Mac OS X, and Windows

Crossref

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

Entwurf und Implementierung eines generischen Substring-Index

Author: Weese D.
Publication venue
Publication date: 02/05/2006
Field of study

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

RazerS - Fast Read Mapping with Sensitivity Control

Author: Döring A.
Emde A.-K.
Rausch T.
Reinert K.
Weese D.
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 10/07/2009
Field of study

Second-generation sequencing technologies deliver DNA sequence data at unprecedented high throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. Due to the large amounts of data, eﬃcient algorithms and implementations are crucial for this task. We present an eﬃcient read mapping tool called RazerS. It allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance. Our tool can work either lossless or with a user-deﬁned loss rate at higher speeds. Given the loss rate, we present an approach that guarantees not to lose more reads than speciﬁed. This enables the user to adapt to the problem at hand and provides a seamless tradeoﬀ between sensitivity and running time

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

PubMed Central

Segment-based multiple sequence alignment

Author: Emde A.-K.
Notredame C.
Rausch T.
Reinert K.
Weese D.
Publication venue
Publication date: 01/01/2008
Field of study

Motivation: Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given the importance and wide-spread use of alignment tools, progress in both categories is a contribution to the community and has driven research in the field so far. Results: We introduce a graph-based extension to the consistency-based, progressive alignment strategy. We apply the consistency notion to segments instead of single characters. The main problem we solve in this context is to define segments of the sequences in such a way that a graph-based alignment is possible. We implemented the algorithm using the SeqAn library and report results on amino acid and DNA sequences. The benefit of our approach is threefold: (1) sequences with conserved blocks can be rapidly aligned, (2) the implementation is conceptually easy, generic and fast and (3) the consistency idea can be extended to align multiple genomic sequences. Availability: The segment-based multiple sequence alignment tool can be downloaded from http://www.seqan.de/projects/msa.html. A novel version of T-Coffee interfaced with the tool is available from http://www.tcoffee.org. The usage of the tool is described in both documentations. Contact: [email protected]

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

Bis(μ-dimesitylborinato-κ2 O:O)bis[(2-methylpyridine-κN)lithium]

Author: Cole
Farrugia
Gibson
James D. Hoefelmeyer
Jung-Ho Son
K. T. Pillai Saravana
Sheldrick
Spek
Weese
Publication venue: International Union of Crystallography
Publication date: 01/02/2009
Field of study

The title compound, [Li2(C18H22BO)2(C6H7N)2], is a lithium dimesitylboroxide dimer in which the lithium cation is also coordinated by one molecule of 2-methylpyridine. At the core of the structure is an Li2O2 four-membered ring. The structure is centrosymmetric with an inversion centre midway between two Li atoms. Intermolecular C—H⋯π interactions and π–π interactions between the 2-methylpyridine rings exist [centroid–centroid distance = 3.6312 (16) Å]

Crossref

Directory of Open Access Journals

PubMed Central

Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS

Author: Emde A.-K.
Haas S. A.
Kalscheuer V. M.
Reinert K.
Schulz M. H.
Sun R.
Vingron M.
Weese D.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation: The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map. Results: Here we present a method for ‘split’ read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant. Availability: SplazerS is available from http://www.seqan.de/projects/ splazers

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

MPG.PuRe

Recommended from our members

Exploring the Physical, Chemical and Thermal Characteristics of a New Potentially Insensitive High Explosive: RX-55-AE-5

Author: Burnham A K
Tran T D
Turner H C
Weese R K
Publication venue: Lawrence Livermore National Laboratory
Publication date: 05/06/2006
Field of study

Current work at the Energetic Materials Center, EMC, at Lawrence Livermore National Laboratory (LLNL) includes both understanding properties of old explosives and measuring properties of new ones [1]. The necessity to know and understand the properties of energetic materials is driven by the need to improve performance and enhance stability to various stimuli, such as thermal, friction and impact insult. This review will concentrate on the physical properties of RX-55-AE-5, which is formulated from heterocyclic explosive, 2,6-diamino-3,5-dinitropyrazine-1-oxide, LLM-105, and 2.5% Viton A. Differential scanning calorimetry (DSC) was used to measure a specific heat capacity, C{sub p}, of {approx} 0.950 J/g{center_dot} C and a thermal conductivity, {kappa}, of {approx} 0.475 W/m{center_dot} C. The LLNL kinetics modeling code Kinetics05 and the Advanced Kinetics and Technology Solutions (AKTS) code Thermokinetics were both used to calculate Arrhenius kinetics for decomposition of LLM-105. Both obtained an activation energy barrier E {approx} 180 kJ mol{sup -1} for mass loss in an open pan. Thermal mechanical analysis, TMA, was used to measure the coefficient of thermal expansion (CTE). The CTE for this formulation was calculated to be {approx} 61 {micro}m/m{center_dot} C. Impact, spark, friction are also reported

UNT Digital Library