65 research outputs found
Performance analysis of a parallel, multi-node pipeline for DNA sequencing
Post-sequencing DNA analysis typically consists of read mapping followed by variant calling and is very time-consuming, even on a multi-core machine. Recently, we proposed Halvade, a parallel, multi-node implementation of a DNA sequencing pipeline according to the GATK Best Practices recommendations. The MapReduce programming model is used to distribute the workload among different workers. In this paper, we study the impact of different hardware configurations on the performance of Halvade. Benchmarks indicate that especially the lack of good multithreading capabilities in the existing tools (BWA, SAMtools, Picard, GATK) cause suboptimal scaling behavior. We demonstrate that it is possible to circumvent this bottleneck by using multiprocessing on high-memory machines rather than using multithreading. Using a 15-node cluster with 360 CPU cores in total, this results in a runtime of 1 h 31 min. Compared to a single-threaded runtime of similar to 12 days, this corresponds to an overall parallel efficiency of 53%
SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data
We report SInC (SNV, Indel and CNV) simulator and read generator, an
open-source tool capable of simulating biological variants taking into account
a platform-specific error model. SInC is capable of simulating and generating
single- and paired-end reads with user-defined insert size with high efficiency
compared to the other existing tools. SInC, due to its multi-threaded
capability during read generation, has a low time footprint. SInC is currently
optimised to work in limited infrastructure setup and can efficiently exploit
the commonly used quad-core desktop architecture to simulate short sequence
reads with deep coverage for large genomes. Sinc can be downloaded from
https://sourceforge.net/projects/sincsimulator/
MaxSSmap: A GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence
Programs based on hash tables and Burrows-Wheeler are very fast for mapping
short reads to genomes but have low accuracy in the presence of mismatches and
gaps. Such reads can be aligned accurately with the Smith-Waterman algorithm
but it can take hours and days to map millions of reads even for bacteria
genomes. We introduce a GPU program called MaxSSmap with the aim of achieving
comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most
programs MaxSSmap identifies a local region of the genome followed by exact
alignment. Instead of using hash tables or Burrows-Wheeler in the first part,
MaxSSmap calculates maximum scoring subsequence score between the read and
disjoint fragments of the genome in parallel on a GPU and selects the highest
scoring fragment for exact alignment. We evaluate MaxSSmap's accuracy and
runtime when mapping simulated Illumina E.coli and human chromosome one reads
of different lengths and 10\% to 30\% mismatches with gaps to the E.coli genome
and human chromosome one. We also demonstrate applications on real data by
mapping ancient horse DNA reads to modern genomes and unmapped paired reads
from NA12878 in 1000 genomes. We show that MaxSSmap attains comparable high
accuracy and low error to fast Smith-Waterman programs yet has much lower
runtimes. We show that MaxSSmap can map reads rejected by BWA and NextGenMap
with high accuracy and low error much faster than if Smith-Waterman were used.
On short read lengths of 36 and 51 both MaxSSmap and Smith-Waterman have lower
accuracy compared to at higher lengths. On real data MaxSSmap produces many
alignments with high score and mapping quality that are not given by NextGenMap
and BWA. The MaxSSmap source code is freely available from
http://www.cs.njit.edu/usman/MaxSSmap
Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions
Nanopore sequencing technology has the potential to render other sequencing
technologies obsolete with its ability to generate long reads and provide
portability. However, high error rates of the technology pose a challenge while
generating accurate genome assemblies. The tools used for nanopore sequence
analysis are of critical importance as they should overcome the high error
rates of the technology. Our goal in this work is to comprehensively analyze
current publicly available tools for nanopore sequence analysis to understand
their advantages, disadvantages, and performance bottlenecks. It is important
to understand where the current tools do not perform well to develop better
tools. To this end, we 1) analyze the multiple steps and the associated tools
in the genome assembly pipeline using nanopore sequence data, and 2) provide
guidelines for determining the appropriate tools for each step. We analyze
various combinations of different tools and expose the tradeoffs between
accuracy, performance, memory usage and scalability. We conclude that our
observations can guide researchers and practitioners in making conscious and
effective choices for each step of the genome assembly pipeline using nanopore
sequence data. Also, with the help of bottlenecks we have found, developers can
improve the current tools or build new ones that are both accurate and fast, in
order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201
T-lex3 : An accurate tool to genotype and estimate population frequencies of transposable elements using the latest short-read whole genome sequencing data
Motivation: Transposable elements (TEs) constitute a significant proportion of the majority of genomes sequenced to date. TEs are responsible for a considerable fraction of the genetic variation within and among species. Accurate genotyping of TEs in genomes is therefore crucial for a complete identification of the genetic differences among individuals, populations and species. Results: In this work, we present a new version of T-lex, a computational pipeline that accurately genotypes and estimates the population frequencies of reference TE insertions using short-read high-throughput sequencing data. In this new version, we have re-designed the T-lex algorithm to integrate the BWA-MEM short-read aligner, which is one of the most accurate short-read mappers and can be launched on longer short-reads (e.g. reads >150 bp). We have added new filtering steps to increase the accuracy of the genotyping, and new parameters that allow the user to control both the minimum and maximum number of reads, and the minimum number of strains to genotype a TE insertion. We also showed for the first time that T-lex3 provides accurate TE calls in a plant genome. Availability and implementation: To test the accuracy of T-lex3, we called 1630 individual TE insertions in Drosophila melanogaster, 1600 individual TE insertions in humans, and 3067 individual TE insertions in the rice genome. We showed that this new version of T-lex is a broadly applicable and accurate tool for genotyping and estimating TE frequencies in organisms with different genome sizes and different TE contents. T-lex3 is available at Github: https://github.com/GonzalezLab/T-lex3
Lessons Learned: Recommendations for Establishing Critical Periodic Scientific Benchmarking
The dependence of life scientists on software has steadily grown in recent years. For many tasks, researchers have to decide which of the available bioinformatics software are more suitable for their specific needs. Additionally researchers should be able to objectively select the software that provides the highest accuracy, the best efficiency and the highest level of reproducibility when integrated in their research projects.
Critical benchmarking of bioinformatics methods, tools and web services is therefore an essential community service, as well as a critical component of reproducibility efforts. Unbiased and objective evaluations are challenging to set up and can only be effective when built and implemented around community driven efforts, as demonstrated by the many ongoing community challenges in bioinformatics that followed the success of CASP. Community challenges bring the combined benefits of intense collaboration, transparency and standard harmonization. Only open systems for the continuous evaluation of methods offer a perfect complement to community challenges, offering to larger communities of users that could extend far beyond the community of developers, a window to the developments status that they can use for their specific projects. We understand by continuous evaluation systems as those services which are always available and periodically update their data and/or metrics according to a predefined schedule keeping in mind that the performance has to be always seen in terms of each research domain.
We argue here that technology is now mature to bring community driven benchmarking efforts to a higher level that should allow effective interoperability of benchmarks across related methods. New technological developments allow overcoming the limitations of the first experiences on online benchmarking e.g. EVA. We therefore describe OpenEBench, a novel infra-structure designed to establish a continuous automated benchmarking system for bioinformatics methods, tools and web services.
OpenEBench is being developed so as to cater for the needs of the bioinformatics community, especially software developers who need an objective and quantitative way to inform their decisions as well as the larger community of end-users, in their search for unbiased and up-to-date evaluation of bioinformatics methods. As such OpenEBench should soon become a central place for bioinformatics software developers, community-driven benchmarking initiatives, researchers using bioinformatics methods, and funders interested in the result of methods evaluation.Preprin
Cancer risk prediction with next generation sequencing data using machine learning
The use of computational biology for next generation sequencing (NGS) analysis is rapidly increasing in genomics research. However, the effectiveness of NGS data to predict disease abundance is yet unclear. This research investigates the problem in the whole exome NGS data of the chronic lymphocytic leukemia (CLL) available at dbGaP. Initially, raw reads from samples are aligned to the human reference genome using burrows wheeler aligner. From the samples, structural variants, namely, Single Nucleotide Polymorphism (SNP) and Insertion Deletion (INDEL) are identified and are filtered using SAMtools as well as with Genome Analyzer Tool Kit (GATK). Subsequently, the variants are encoded and feature selection is performed with the Pearson correlation coefficient (PCC) and the chi-square 2-df statistical test. Finally, 90:10 cross validation is performed by applying the support vector machine algorithm on sets of top selected features. It is found that the variants detected with SAMtools and GATK achieve similar prediction accuracies. It is also noted that the features that are ranked with the PCC yield better accuracy than the chi-square test. In all of the analyses, the SNPs are identified to have superior accuracy as compared to the INDELs or the full dataset. Later, an exome capture kit is introduced for analysis. The SNPs, ranked with the PCC, along with the exome capture kit yield prediction accuracy of 85.1 % and area under curve of 0.94. Overall, this study shows the effective application of the machine learning methods and the strength of the NGS data for the CLL risk prediction
- …