16 research outputs found
Analytical model of peptide mass cluster centres with applications
BACKGROUND: The elemental composition of peptides results in formation of distinct, equidistantly spaced clusters across the mass range. The property of peptide mass clustering is used to calibrate peptide mass lists, to identify and remove non-peptide peaks and for data reduction. RESULTS: We developed an analytical model of the peptide mass cluster centres. Inputs to the model included, the amino acid frequencies in the sequence database, the average length of the proteins in the database, the cleavage specificity of the proteolytic enzyme used and the cleavage probability. We examined the accuracy of our model by comparing it with the model based on an in silico sequence database digest. To identify the crucial parameters we analysed how the cluster centre location depends on the inputs. The distance to the nearest cluster was used to calibrate mass spectrometric peptide peak-lists and to identify non-peptide peaks. CONCLUSION: The model introduced here enables us to predict the location of the peptide mass cluster centres. It explains how the location of the cluster centres depends on the input parameters. Fast and efficient calibration and filtering of non-peptide peaks is achieved by a distance measure suggested by Wool and Smilansky
A novel and well-defined benchmarking method for second generation read mapping
Background Second generation sequencing technologies yield DNA sequence data
at ultra high-throughput. Common to most biological applications is a mapping
of the reads to an almost identical or highly similar reference genome. The
assessment of the quality of read mapping results is not straightforward and
has not been formalized so far. Hence, it has not been easy to compare
different read mapping approaches in a unified way and to determine which
program is the best for what task. Results We present a new benchmark method,
called Rabema (Read Alignment BEnchMArk), for read mappers. It consists of a
strict definition of the read mapping problem and of tools to evaluate the
result of arbitrary read mappers supporting the SAM output format. Conclusions
We show the usefulness of the benchmark program by performing a comparison of
popular read mappers. The tools supporting the benchmark are licensed under
the GPL and available from http://www.seqan.de/projects/rabema.html
von Read Mapping zur Detektion von genomischen Variationen
Next-Generation-Sequencing (NGS) has brought on a revolution in sequence
analysis with its broad spectrum of applications ranging from genome
resequencing to transcriptomics or metagenomics, and from fundamental research
to diagnostics. The tremendous amounts of data necessitate highly efficient
computational analysis tools for the wide variety of NGS applications. This
thesis addresses a broad range of key computational aspects of resequencing
applications, where a reference genome sequence is known and heavily used for
interpretation of the newly sequenced sample. It presents tools for read
mapping and benchmarking, for partial read mapping of small RNA reads and for
structural variant/indel detection, and finally tools for detecting and
genotyping SNVs and short indels. Our tools efficiently scale to large NGS
data sets and are well-suited for advances in sequencing technology, since
their generic algorithm design allows handling of arbitrary read lengths and
variable error rates. Furthermore, they are implemented within the robust C++
library SeqAn, making them open-source, easily available, and potentially
adaptable for the bioinformatics community. Among other applications, our
tools have been integrated into a large-scale analysis pipeline and have been
applied to large datasets, leading to interesting discoveries of human
retrocopy variants and insights into the genetic causes of X-linked
intellectual disabilities.Neuste DNA-Sequenzieungstechnologien (kurz genannt NGS Technologien)
ermöglichen revolutionäre neue Anwendungen, die sowohl von
Genomresequenzierung ĂĽber Transkriptomsequenzierung zu Metagenomik als auch
von Grundlagenforschung zu Diagnostik reichen. Problematisch ist dabei die
Flut an Daten, die eine grosse Herausforderung fĂĽr die Bionformatik darstellt.
Hocheffiziente Analysesoftware ist von enormer Wichtigkeit fĂĽr das breite
Spektrum von NGS Anwendungen. Diese Arbeit adressiert mehrere SchlĂĽsselaspekte
der Analyse von Resequenzierungsdaten, bei der ein bereits sequenziertes
Referenzgenom als Grundlage fĂĽr die Interpretation eines neu sequenzierten
Datensatzes dient. Es werden Algorithmen und Programme präsentiert für das
sogenannte Read Mapping Problem und für die Auswertung der Güte seiner Lösung,
fĂĽr partielles Read Mapping, welches in miRNA Studien und bei der Suche nach
strukturellen Variationen Anwendung findet, sowie letztlich zum Auffinden und
Genotypisieren von Basenmutationen und kurzen Insertionen/Deletionen im Genom.
Die vorgestellten Algorithmen sind effizient und so gestaltet, dass sie auch
bei Fortschritten in Sequenzierungstechnologien weiterhin anwendbar und
skalierbar bleiben. Zudem sind sie in der robusten C++ Bibliothek SeqAn
implementiert, was sie leicht zugänglich und adaptierbar macht. Unter anderem
wurden unsere Tools in eine Hochdurchsatz-Analysepipeline integriert und auf
grosse Datensaetze angewendet, wodurch interessante biologische Erkenntnisse
(vorallem im Zusammenhang X-Chromosom gebundener geistiger Behinderung)
gewonnen werden konnten
Segment-based multiple sequence alignment
Motivation: Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given the importance and wide-spread use of alignment tools, progress in both categories is a contribution to the community and has driven research in the field so far.
Results: We introduce a graph-based extension to the consistency-based, progressive alignment strategy. We apply the consistency notion to segments instead of single characters. The main problem we solve in this context is to define segments of the sequences in such a way that a graph-based alignment is possible. We implemented the algorithm using the SeqAn library and report results on amino acid and DNA sequences. The benefit of our approach is threefold: (1) sequences with conserved blocks can be rapidly aligned, (2) the implementation is conceptually easy, generic and fast and (3) the consistency idea can be extended to align multiple genomic sequences
Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS
MOTIVATION: The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map. RESULTS: Here we present a method for 'split' read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant. AVAILABILITY: SplazerS is available from http://www.seqan.de/projects/ splazers. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
Data S1. Human PGBD5 DNA transposase promotes site-specific oncogenic mutations in rhabdoid tumors
Supplementary data S1 for Henssen et al. "
Human PGBD5 DNA transposase promotes site-specific oncogenic mutations in rhabdoid tumors