15 research outputs found

    Analytical model of peptide mass cluster centres with applications

    Get PDF
    BACKGROUND: The elemental composition of peptides results in formation of distinct, equidistantly spaced clusters across the mass range. The property of peptide mass clustering is used to calibrate peptide mass lists, to identify and remove non-peptide peaks and for data reduction. RESULTS: We developed an analytical model of the peptide mass cluster centres. Inputs to the model included, the amino acid frequencies in the sequence database, the average length of the proteins in the database, the cleavage specificity of the proteolytic enzyme used and the cleavage probability. We examined the accuracy of our model by comparing it with the model based on an in silico sequence database digest. To identify the crucial parameters we analysed how the cluster centre location depends on the inputs. The distance to the nearest cluster was used to calibrate mass spectrometric peptide peak-lists and to identify non-peptide peaks. CONCLUSION: The model introduced here enables us to predict the location of the peptide mass cluster centres. It explains how the location of the cluster centres depends on the input parameters. Fast and efficient calibration and filtering of non-peptide peaks is achieved by a distance measure suggested by Wool and Smilansky

    A novel and well-defined benchmarking method for second generation read mapping

    Get PDF
    Background Second generation sequencing technologies yield DNA sequence data at ultra high-throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. The assessment of the quality of read mapping results is not straightforward and has not been formalized so far. Hence, it has not been easy to compare different read mapping approaches in a unified way and to determine which program is the best for what task. Results We present a new benchmark method, called Rabema (Read Alignment BEnchMArk), for read mappers. It consists of a strict definition of the read mapping problem and of tools to evaluate the result of arbitrary read mappers supporting the SAM output format. Conclusions We show the usefulness of the benchmark program by performing a comparison of popular read mappers. The tools supporting the benchmark are licensed under the GPL and available from http://www.seqan.de/projects/rabema.html

    von Read Mapping zur Detektion von genomischen Variationen

    No full text
    Next-Generation-Sequencing (NGS) has brought on a revolution in sequence analysis with its broad spectrum of applications ranging from genome resequencing to transcriptomics or metagenomics, and from fundamental research to diagnostics. The tremendous amounts of data necessitate highly efficient computational analysis tools for the wide variety of NGS applications. This thesis addresses a broad range of key computational aspects of resequencing applications, where a reference genome sequence is known and heavily used for interpretation of the newly sequenced sample. It presents tools for read mapping and benchmarking, for partial read mapping of small RNA reads and for structural variant/indel detection, and finally tools for detecting and genotyping SNVs and short indels. Our tools efficiently scale to large NGS data sets and are well-suited for advances in sequencing technology, since their generic algorithm design allows handling of arbitrary read lengths and variable error rates. Furthermore, they are implemented within the robust C++ library SeqAn, making them open-source, easily available, and potentially adaptable for the bioinformatics community. Among other applications, our tools have been integrated into a large-scale analysis pipeline and have been applied to large datasets, leading to interesting discoveries of human retrocopy variants and insights into the genetic causes of X-linked intellectual disabilities.Neuste DNA-Sequenzieungstechnologien (kurz genannt NGS Technologien) ermöglichen revolutionäre neue Anwendungen, die sowohl von Genomresequenzierung über Transkriptomsequenzierung zu Metagenomik als auch von Grundlagenforschung zu Diagnostik reichen. Problematisch ist dabei die Flut an Daten, die eine grosse Herausforderung für die Bionformatik darstellt. Hocheffiziente Analysesoftware ist von enormer Wichtigkeit für das breite Spektrum von NGS Anwendungen. Diese Arbeit adressiert mehrere Schlüsselaspekte der Analyse von Resequenzierungsdaten, bei der ein bereits sequenziertes Referenzgenom als Grundlage für die Interpretation eines neu sequenzierten Datensatzes dient. Es werden Algorithmen und Programme präsentiert für das sogenannte Read Mapping Problem und für die Auswertung der Güte seiner Lösung, für partielles Read Mapping, welches in miRNA Studien und bei der Suche nach strukturellen Variationen Anwendung findet, sowie letztlich zum Auffinden und Genotypisieren von Basenmutationen und kurzen Insertionen/Deletionen im Genom. Die vorgestellten Algorithmen sind effizient und so gestaltet, dass sie auch bei Fortschritten in Sequenzierungstechnologien weiterhin anwendbar und skalierbar bleiben. Zudem sind sie in der robusten C++ Bibliothek SeqAn implementiert, was sie leicht zugänglich und adaptierbar macht. Unter anderem wurden unsere Tools in eine Hochdurchsatz-Analysepipeline integriert und auf grosse Datensaetze angewendet, wodurch interessante biologische Erkenntnisse (vorallem im Zusammenhang X-Chromosom gebundener geistiger Behinderung) gewonnen werden konnten

    Segment-based multiple sequence alignment

    Get PDF
    Motivation: Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given the importance and wide-spread use of alignment tools, progress in both categories is a contribution to the community and has driven research in the field so far. Results: We introduce a graph-based extension to the consistency-based, progressive alignment strategy. We apply the consistency notion to segments instead of single characters. The main problem we solve in this context is to define segments of the sequences in such a way that a graph-based alignment is possible. We implemented the algorithm using the SeqAn library and report results on amino acid and DNA sequences. The benefit of our approach is threefold: (1) sequences with conserved blocks can be rapidly aligned, (2) the implementation is conceptually easy, generic and fast and (3) the consistency idea can be extended to align multiple genomic sequences

    Robust consensus computation

    No full text

    Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS

    No full text
    MOTIVATION: The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map. RESULTS: Here we present a method for 'split' read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant. AVAILABILITY: SplazerS is available from http://www.seqan.de/projects/ splazers. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
    corecore