359,605 research outputs found
Statistical distributions of sequencing by synthesis with probabilistic nucleotide incorporation
Sequencing by synthesis is used in many next-generation DNA sequencing
technologies. Some of the technologies, especially those exploring the
principle of single-molecule sequencing, allow incomplete nucleotide
incorporation in each cycle. We derive statistical distributions for sequencing
by synthesis by taking into account the possibility that nucleotide
incorporation may not be complete in each flow cycle. The statistical
distributions are expressed in terms of nucleotide probabilities of the target
sequences and the nucleotide incorporation probabilities for each nucleotide.
We give exact distributions both for fixed number of flow cycles and for fixed
sequence length. Explicit formulas are derived for the mean and variance of
these distributions. The results are generalizations of our previous work for
pyrosequencing. Incomplete nucleotide incorporation leads to significant change
in the mean and variance of the distributions, but still they can be
approximated by normal distributions with the same mean and variance. The
results are also generalized to handle sequence context dependent
incorporation. The statistical distributions will be useful for instrument and
software development for sequencing by synthesis platforms.Comment: 25 pages, 2 figure
Length distribution of sequencing by synthesis: fixed flow cycle model
Sequencing by synthesis is the underlying technology for many next-generation
DNA sequencing platforms. We developed a new model, the fixed flow cycle model,
to derive the distributions of sequence length for a given number of flow
cycles under the general conditions where the nucleotide incorporation is
probabilistic and may be incomplete, as in some single-molecule sequencing
technologies. Unlike the previous model, the new model yields the probability
distribution for the sequence length. Explicit closed form formulas are derived
for the mean and variance of the distribution.Comment: 27 pages, 5 figure
Detection of microRNAs in color space
MotivationDeep sequencing provides inexpensive opportunities to characterize the transcriptional diversity of known genomes. The AB SOLiD technology generates millions of short sequencing reads in color-space; that is, the raw data is a sequence of colors, where each color represents 2 nt and each nucleotide is represented by two consecutive colors. This strategy is purported to have several advantages, including increased ability to distinguish sequencing errors from polymorphisms. Several programs have been developed to map short reads to genomes in color space. However, a number of previously unexplored technical issues arise when using SOLiD technology to characterize microRNAs.ResultsHere we explore these technical difficulties. First, since the sequenced reads are longer than the biological sequences, every read is expected to contain linker fragments. The color-calling error rate increases toward the 3(') end of the read such that recognizing the linker sequence for removal becomes problematic. Second, mapping in color space may lead to the loss of the first nucleotide of each read. We propose a sequential trimming and mapping approach to map small RNAs. Using our strategy, we reanalyze three published insect small RNA deep sequencing datasets and characterize 22 new microRNAs.Availability and implementationA bash shell script to perform the sequential trimming and mapping procedure, called SeqTrimMap, is available at: http://www.mirbase.org/tools/seqtrimmap/[email protected] informationSupplementary data are available at Bioinformatics online
Clinical exome performance for reporting secondary genetic findings.
BACKGROUND
:
Reporting clinically actionable incidental
genetic findings in the course of clinical exome testing is
recommended by the American College of Medical Genet-
ics and Genomics (ACMG). However, the performance of
clinical exome methods for reporting small subsets of genes
has not been previously reported.
METHODS
:
In this study, 57 exome data sets performed as
clinical (n
!
12) or research (n
!
45) tests were retrospec-
tively analyzed. Exome sequencing data was examined for
adequacy in the detection of potentially pathogenic variant
locations in the 56 genes described in the ACMG incidental
findings recommendation. All exons of the 56 genes were
examined for adequacy of sequencing coverage. In addition,
nucleotide positions annotated in HGMD (Human Gene
Mutation Database) were examined.
RESULTS
:
The 56 ACMG genes have 18336 nucleotide
variants annotated in HGMD. None of the 57 exome
data sets possessed a HGMD variant. The clinical exome
test had inadequate coverage for
"
50% of HGMD vari-
ant locations in 7 genes. Six exons from 6 different genes
had consistent failure across all 3 test methods; these
exons had high GC content (76%–84%).
CONCLUSIONS
:
The use of clinical exome sequencing
for the interpretation and reporting of subsets of genes
requires recognition of the substantial possibility of
inadequate depth and breadth of sequencing coverage
at clinically relevant locations. Inadequate depth of
coverage may contribute to false-negative clinical ex-
ome results
Missense-depleted regions in population exomes implicate ras superfamily nucleotide-binding protein alteration in patients with brain malformation.
Genomic sequence interpretation can miss clinically relevant missense variants for several reasons. Rare missense variants are numerous in the exome and difficult to prioritise. Affected genes may also not have existing disease association. To improve variant prioritisation, we leverage population exome data to identify intragenic missense-depleted regions (MDRs) genome-wide that may be important in disease. We then use missense depletion analyses to help prioritise undiagnosed disease exome variants. We demonstrate application of this strategy to identify a novel gene association for human brain malformation. We identified de novo missense variants that affect the GDP/GTP-binding site of ARF1 in three unrelated patients. Corresponding functional analysis suggests ARF1 GDP/GTP-activation is affected by the specific missense mutations associated with heterotopia. These findings expand the genetic pathway underpinning neurologic disease that classically includes FLNA. ARF1 along with ARFGEF2 add further evidence implicating ARF/GEFs in the brain. Using functional ontology, top MDR-containing genes were highly enriched for nucleotide-binding function, suggesting these may be candidates for human disease. Routine consideration of MDR in the interpretation of exome data for rare diseases may help identify strong genetic factors for many severe conditions, infertility/reduction in reproductive capability, and embryonic conditions contributing to preterm loss
QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles
Background: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ("deep sequencing"), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset.
Results: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNVD). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNVHS). To also increase specificity, SNVs called were overruled when their frequency was below the 80th percentile calculated on the distribution of error frequencies (QQ-SNVHS-P80). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNVD performed similarly to the existing approaches. QQ-SNVHS was more sensitive on all test sets but with more false positives. QQ-SNVHS-P80 was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 %, QQ-SNVHS-P80 revealed a sensitivity of 100 % (vs. 40-60 % for the existing methods) and a specificity of 100 % (vs. 98.0-99.7 % for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 % were consistently detected by QQ-SNVHS-P80 from different generations of Illumina sequencers.
Conclusions: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data
De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space
Sequencing technologies allow for an in-depth analysis
of biological species but the size of the generated datasets
introduce a number of analytical challenges. Recently, we
demonstrated the application of numerical sequence representations
and data transformations for the alignment of short
reads to a reference genome. Here, we expand out approach
for de novo assembly of short reads. Our results demonstrate
that highly compressed data can encapsulate the signal suffi-
ciently to accurately assemble reads to big contigs or complete
genomes
Statistical inference of the generation probability of T-cell receptors from sequence repertoires
Stochastic rearrangement of germline DNA by VDJ recombination is at the
origin of immune system diversity. This process is implemented via a series of
stochastic molecular events involving gene choices and random nucleotide
insertions between, and deletions from, genes. We use large sequence
repertoires of the variable CDR3 region of human CD4+ T-cell receptor beta
chains to infer the statistical properties of these basic biochemical events.
Since any given CDR3 sequence can be produced in multiple ways, the probability
distribution of hidden recombination events cannot be inferred directly from
the observed sequences; we therefore develop a maximum likelihood inference
method to achieve this end. To separate the properties of the molecular
rearrangement mechanism from the effects of selection, we focus on
non-productive CDR3 sequences in T-cell DNA. We infer the joint distribution of
the various generative events that occur when a new T-cell receptor gene is
created. We find a rich picture of correlation (and absence thereof), providing
insight into the molecular mechanisms involved. The generative event statistics
are consistent between individuals, suggesting a universal biochemical process.
Our distribution predicts the generation probability of any specific CDR3
sequence by the primitive recombination process, allowing us to quantify the
potential diversity of the T-cell repertoire and to understand why some
sequences are shared between individuals. We argue that the use of formal
statistical inference methods, of the kind presented in this paper, will be
essential for quantitative understanding of the generation and evolution of
diversity in the adaptive immune system.Comment: 20 pages, including Appendi
BamView: visualizing and interpretation of next-generation sequencing read alignments.
So-called next-generation sequencing (NGS) has provided the ability to sequence on a massive scale at low cost, enabling biologists to perform powerful experiments and gain insight into biological processes. BamView has been developed to visualize and analyse sequence reads from NGS platforms, which have been aligned to a reference sequence. It is a desktop application for browsing the aligned or mapped reads [Ruffalo, M, LaFramboise, T, Koyutürk, M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 2011;27:2790-6] at different levels of magnification, from nucleotide level, where the base qualities can be seen, to genome or chromosome level where overall coverage is shown. To enable in-depth investigation of NGS data, various views are provided that can be configured to highlight interesting aspects of the data. Multiple read alignment files can be overlaid to compare results from different experiments, and filters can be applied to facilitate the interpretation of the aligned reads. As well as being a standalone application it can be used as an integrated part of the Artemis genome browser, BamView allows the user to study NGS data in the context of the sequence and annotation of the reference genome. Single nucleotide polymorphism (SNP) density and candidate SNP sites can be highlighted and investigated, and read-pair information can be used to discover large structural insertions and deletions. The application will also calculate simple analyses of the read mapping, including reporting the read counts and reads per kilobase per million mapped reads (RPKM) for genes selected by the user
- …
