1,108 research outputs found
SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data
We report SInC (SNV, Indel and CNV) simulator and read generator, an
open-source tool capable of simulating biological variants taking into account
a platform-specific error model. SInC is capable of simulating and generating
single- and paired-end reads with user-defined insert size with high efficiency
compared to the other existing tools. SInC, due to its multi-threaded
capability during read generation, has a low time footprint. SInC is currently
optimised to work in limited infrastructure setup and can efficiently exploit
the commonly used quad-core desktop architecture to simulate short sequence
reads with deep coverage for large genomes. Sinc can be downloaded from
https://sourceforge.net/projects/sincsimulator/
SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner
To tackle the exponentially increasing throughput of Next-Generation
Sequencing (NGS), most of the existing short-read aligners can be configured to
favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging
the computational power of both CPU and GPU with optimized algorithms, delivers
high speed and sensitivity simultaneously. Compared with widely adopted
aligners including BWA, Bowtie2, SeqAlto, GEM and GPU-based aligners including
BarraCUDA and CUSHAW, SOAP3-dp is two to tens of times faster, while
maintaining the highest sensitivity and lowest false discovery rate (FDR) on
Illumina reads with different lengths. Transcending its predecessor SOAP3,
which does not allow gapped alignment, SOAP3-dp by default tolerates alignment
similarity as low as 60 percent. Real data evaluation using human genome
demonstrates SOAP3-dp's power to enable more authentic variants and longer
Indels to be discovered. Fosmid sequencing shows a 9.1 percent FDR on newly
discovered deletions. SOAP3-dp natively supports BAM file format and provides a
scoring scheme same as BWA, which enables it to be integrated into existing
analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and
Tianhe-1A.Comment: 21 pages, 6 figures, submitted to PLoS ONE, additional files
available at "https://www.dropbox.com/sh/bhclhxpoiubh371/O5CO_CkXQE".
Comments most welcom
Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples
Motivation: Whole-genome high-coverage sequencing has been widely used for
personal and cancer genomics as well as in various research areas. However, in
the lack of an unbiased whole-genome truth set, the global error rate of
variant calls and the leading causal artifacts still remain unclear even given
the great efforts in the evaluation of variant calling methods.
Results: We made ten SNP and INDEL call sets with two read mappers and five
variant callers, both on a haploid human genome and a diploid genome at a
similar coverage. By investigating false heterozygous calls in the haploid
genome, we identified the erroneous realignment in low-complexity regions and
the incomplete reference genome with respect to the sample as the two major
sources of errors, which press for continued improvements in these two areas.
We estimated that the error rate of raw genotype calls is as high as 1 in
10-15kb, but the error rate of post-filtered calls is reduced to 1 in 100-200kb
without significant compromise on the sensitivity.
Availability: BWA-MEM alignment: http://bit.ly/1g8XqRt; Scripts:
https://github.com/lh3/varcmp; Additional data:
https://figshare.com/articles/Towards_better_understanding_of_artifacts_in_variating_calling_from_high_coverage_samples/981073Comment: Published versio
TreeToReads - a pipeline for simulating raw reads from phylogenies.
BackgroundUsing phylogenomic analysis tools for tracking pathogens has become standard practice in academia, public health agencies, and large industries. Using the same raw read genomic data as input, there are several different approaches being used to infer phylogenetic tree. These include many different SNP pipelines, wgMLST approaches, k-mer algorithms, whole genome alignment and others; each of these has advantages and disadvantages, some have been extensively validated, some are faster, some have higher resolution. A few of these analysis approaches are well-integrated into the regulatory process of US Federal agencies (e.g. the FDA's SNP pipeline for tracking foodborne pathogens). However, despite extensive validation on benchmark datasets and comparison with other pipelines, we lack methods for fully exploring the effects of multiple parameter values in each pipeline that can potentially have an effect on whether the correct phylogenetic tree is recovered.ResultsTo resolve this problem, we offer a program, TreeToReads, which can generate raw read data from mutated genomes simulated under a known phylogeny. This simulation pipeline allows direct comparisons of simulated and observed data in a controlled environment. At each step of these simulations, researchers can vary parameters of interest (e.g., input tree topology, amount of sequence divergence, rate of indels, read coverage, distance of reference genome, etc) to assess the effects of various parameter values on correctly calling SNPs and reconstructing an accurate tree.ConclusionsSuch critical assessments of the accuracy and robustness of analytical pipelines are essential to progress in both research and applied settings
Recommended from our members
novoBreak: local assembly for breakpoint detection in cancer genomes.
We present novoBreak, a genome-wide local assembly algorithm that discovers somatic and germline structural variation breakpoints in whole-genome sequencing data. novoBreak consistently outperformed existing algorithms on real cancer genome data and on synthetic tumors in the ICGC-TCGA DREAM 8.5 Somatic Mutation Calling Challenge primarily because it more effectively utilized reads spanning breakpoints. novoBreak also demonstrated great sensitivity in identifying short insertions and deletions
Recommended from our members
Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants.
De novo genetic variants are an important source of causative variation in complex genetic disorders. Many methods for variant discovery rely on mapping reads to a reference genome, detecting numerous inherited variants irrelevant to the phenotype of interest. To distinguish between inherited and de novo variation, sequencing of families (parents and siblings) is commonly pursued. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction. Kevlar is a mapping-free method for de novo variant discovery, based on direct comparison of sequences between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest. Reads containing these k-mers are partitioned into disjoint sets by shared k-mer content for variant calling, and preliminary variant predictions are sorted using a probabilistic score. We evaluated Kevlar on simulated and real datasets, demonstrating its ability to detect both de novo single-nucleotide variants and indels with high accuracy
CLEVER: Clique-Enumerating Variant Finder
Next-generation sequencing techniques have facilitated a large scale analysis
of human genetic variation. Despite the advances in sequencing speeds, the
computational discovery of structural variants is not yet standard. It is
likely that many variants have remained undiscovered in most sequenced
individuals. Here we present a novel internal segment size based approach,
which organizes all, including also concordant reads into a read alignment
graph where max-cliques represent maximal contradiction-free groups of
alignments. A specifically engineered algorithm then enumerates all max-cliques
and statistically evaluates them for their potential to reflect insertions or
deletions (indels). For the first time in the literature, we compare a large
range of state-of-the-art approaches using simulated Illumina reads from a
fully annotated genome and present various relevant performance statistics. We
achieve superior performance rates in particular on indels of sizes 20--100,
which have been exposed as a current major challenge in the SV discovery
literature and where prior insert size based approaches have limitations. In
that size range, we outperform even split read aligners. We achieve good
results also on real data where we make a substantial amount of correct
predictions as the only tool, which complement the predictions of split-read
aligners. CLEVER is open source (GPL) and available from
http://clever-sv.googlecode.com.Comment: 30 pages, 8 figure
GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data
BACKGROUND: NGS technology represents a powerful alternative to the standard Sanger sequencing in the context of clinical setting. The proprietary software that are generally used for variant calling often depend on preset parameters that may not fit in a satisfactory manner for different genes. GATK, which is widely used in the academic world, is rich in parameters for variant calling. However the self-adjusting parameter calibration of GATK requires data from a large number of exomes. When these are not available, which is the standard condition of a diagnostic laboratory, the parameters must be set by the operator (hard filtering). The aim of the present paper was to set up a procedure to assess the best parameters to be used in the hard filtering of GATK. This was pursued by using classification trees on true and false variants from simulated sequences of a real dataset data. RESULTS: We simulated two datasets, with different coverages, including all the sequence alterations identified in a real dataset according to their observed frequencies. Simulated sequences were aligned with standard protocols and then regression trees were built up to identify the most reliable parameters and cutoff values to discriminate true and false variant calls. Moreover, we analyzed flanking sequences of region presenting a high rate of false positive calls observing that such sequences present a low complexity make up. CONCLUSIONS: Our results showed that GATK hard filtering parameter values can be tailored through a simulation study based-on the DNA region of interest to ameliorate the accuracy of the variant calling
A cancer cell-line titration series for evaluating somatic classification.
BackgroundAccurate detection of somatic single nucleotide variants and small insertions and deletions from DNA sequencing experiments of tumour-normal pairs is a challenging task. Tumour samples are often contaminated with normal cells confounding the available evidence for the somatic variants. Furthermore, tumours are heterogeneous so sub-clonal variants are observed at reduced allele frequencies. We present here a cell-line titration series dataset that can be used to evaluate somatic variant calling pipelines with the goal of reliably calling true somatic mutations at low allele frequencies.ResultsCell-line DNA was mixed with matched normal DNA at 8 different ratios to generate samples with known tumour cellularities, and exome sequenced on Illumina HiSeq to depths of >300Ă—. The data was processed with several different variant calling pipelines and verification experiments were performed to assay >1500 somatic variant candidates using Ion Torrent PGM as an orthogonal technology. By examining the variants called at varying cellularities and depths of coverage, we show that the best performing pipelines are able to maintain a high level of precision at any cellularity. In addition, we estimate the number of true somatic variants undetected as cellularity and coverage decrease.ConclusionsOur cell-line titration series dataset, along with the associated verification results, was effective for this evaluation and will serve as a valuable dataset for future somatic calling algorithm development. The data is available for further analysis at the European Genome-phenome Archive under accession number EGAS00001001016. Data access requires registration through the International Cancer Genome Consortium's Data Access Compliance Office (ICGC DACO)
- …