416 research outputs found
Coverage statistics for sequence census methods
Background: We study the statistical properties of fragment coverage in
genome sequencing experiments. In an extension of the classic Lander-Waterman
model, we consider the effect of the length distribution of fragments. We also
introduce the notion of the shape of a coverage function, which can be used to
detect abberations in coverage. The probability theory underlying these
problems is essential for constructing models of current high-throughput
sequencing experiments, where both sample preparation protocols and sequencing
technology particulars can affect fragment length distributions.
Results: We show that regardless of fragment length distribution and under
the mild assumption that fragment start sites are Poisson distributed, the
fragments produced in a sequencing experiment can be viewed as resulting from a
two-dimensional spatial Poisson process. We then study the jump skeleton of the
the coverage function, and show that the induced trees are Galton-Watson trees
whose parameters can be computed.
Conclusions: Our results extend standard analyses of shotgun sequencing that
focus on coverage statistics at individual sites, and provide a null model for
detecting deviations from random coverage in high-throughput sequence census
based experiments. By focusing on fragments, we are also led to a new approach
for visualizing sequencing data that should be of independent interest.Comment: 10 pages, 4 figure
Pseudoalignment for metagenomic read assignment
Motivation: Read assignment is an important first step in many metagenomic analysis workflows, providing the basis for identification and quantification of species. However ambiguity among the sequences of many strains makes it difficult to assign reads at the lowest level of taxonomy, and reads are typically assigned to taxonomic levels where they are unambiguous. We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data in order to develop novel methods for rapid and accurate quantification of metagenomic strains.
Results: We find that the recent idea of pseudoalignment introduced in the RNA-Seq context is highly applicable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software, making it possible and practical for the first time to analyze abundances of individual genomes in metagenomics projects
A Study of Water Clusters Using the Effective Fragment Potential and Monte Carlo Simulated Annealing
Simulated annealing methods have been used with the effective fragment potential to locate the lowest energy structures for the water clusters (H2O)n with n=6, 8, 10, 12, 14, 16, 18, and 20. The most successful method uses a local minimization on each Monte Carlo step. The effective fragment potential method yielded interaction energies in excellent agreement with those calculated at the ab initio Hartree–Fock level and was quite successful at predicting the same energy ordering as the higher-level perturbation theory and coupled cluster methods. Analysis of the molecular interaction energies in terms of its electrostatic,polarization, and exchange-repulsion/charge-transfer components reveals that the electrostatic contribution is the dominant term in determining the energy ordering of the minima on the (H2O)n potential energy surfaces, but that differences in the polarization and repulsion components can be important in some cases
NNT pseudoexon activation as a novel mechanism for disease in two siblings with familial glucocorticoid deficiency
CONTEXT:
Intronic DNA frequently encodes potential exonic sequences called pseudoexons. In recent years, mutations resulting in aberrant pseudoexon inclusion have been increasingly recognized to cause disease.
OBJECTIVES:
To find the genetic cause of familial glucocorticoid deficiency (FGD) in two siblings.
PATIENTS:
The proband and his affected sibling, from nonconsanguineous parents of East Asian and South African origin, were diagnosed with FGD at the ages of 21 and 8 months, respectively.
DESIGN:
Whole exome sequencing was performed on genomic DNA (gDNA) of the siblings. Variants in genes known to cause FGD were assessed for causality. Further analysis of gDNA and cDNA was performed by PCR/RT-PCR followed by automated Sanger sequencing.
RESULTS:
Whole exome sequencing identified a single, novel heterozygous variant (p.Arg71*) in nicotinamide nucleotide transhydrogenase (NNT) in both affected individuals. Follow-up cDNA analysis in the proband identified a 69-bp pseudoexon inclusion event, and Sanger sequencing of his gDNA identified a 4-bp duplication responsible for its activation. The variants segregated with the disease: p.Arg71* was inherited from the mother, the pseudoexon change was inherited from the father, and an unaffected sibling had inherited only the p.Arg71* variant.
CONCLUSIONS:
FGD in these siblings is caused by compound heterozygous mutations in NNT; one causing pseudoexon inclusion in combination with another leading to Arg71*. Discovery of this pseudoexon activation mutation highlights the importance of identifying sequence changes in introns by cDNA analysis. The clinical implications of these findings include: facilitation of antenatal genetic diagnosis, early institution of potentially lifesaving therapy, and the possibility of preventative or curative interventio
The Mystery of Two Straight Lines in Bacterial Genome Statistics. Release 2007
In special coordinates (codon position--specific nucleotide frequencies)
bacterial genomes form two straight lines in 9-dimensional space: one line for
eubacterial genomes, another for archaeal genomes. All the 348 distinct
bacterial genomes available in Genbank in April 2007, belong to these lines
with high accuracy. The main challenge now is to explain the observed high
accuracy. The new phenomenon of complementary symmetry for codon
position--specific nucleotide frequencies is observed. The results of analysis
of several codon usage models are presented. We demonstrate that the
mean--field approximation, which is also known as context--free, or complete
independence model, or Segre variety, can serve as a reasonable approximation
to the real codon usage. The first two principal components of codon usage
correlate strongly with genomic G+C content and the optimal growth temperature
respectively. The variation of codon usage along the third component is related
to the curvature of the mean-field approximation. First three eigenvalues in
codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and
archaeal genomes codon usage is clearly distributed along two third order
curves with genomic G+C content as a parameter.Comment: Significantly extended version with new data for all the 348 distinct
bacterial genomes available in Genbank in April 200
Markov basis and Groebner basis of Segre-Veronese configuration for testing independence in group-wise selections
We consider testing independence in group-wise selections with some
restrictions on combinations of choices. We present models for frequency data
of selections for which it is easy to perform conditional tests by Markov chain
Monte Carlo (MCMC) methods. When the restrictions on the combinations can be
described in terms of a Segre-Veronese configuration, an explicit form of a
Gr\"obner basis consisting of moves of degree two is readily available for
performing a Markov chain. We illustrate our setting with the National Center
Test for university entrance examinations in Japan. We also apply our method to
testing independence hypotheses involving genotypes at more than one locus or
haplotypes of alleles on the same chromosome.Comment: 25 pages, 5 figure
Recognizing Treelike k-Dissimilarities
A k-dissimilarity D on a finite set X, |X| >= k, is a map from the set of
size k subsets of X to the real numbers. Such maps naturally arise from
edge-weighted trees T with leaf-set X: Given a subset Y of X of size k, D(Y) is
defined to be the total length of the smallest subtree of T with leaf-set Y .
In case k = 2, it is well-known that 2-dissimilarities arising in this way can
be characterized by the so-called "4-point condition". However, in case k > 2
Pachter and Speyer recently posed the following question: Given an arbitrary
k-dissimilarity, how do we test whether this map comes from a tree? In this
paper, we provide an answer to this question, showing that for k >= 3 a
k-dissimilarity on a set X arises from a tree if and only if its restriction to
every 2k-element subset of X arises from some tree, and that 2k is the least
possible subset size to ensure that this is the case. As a corollary, we show
that there exists a polynomial-time algorithm to determine when a
k-dissimilarity arises from a tree. We also give a 6-point condition for
determining when a 3-dissimilarity arises from a tree, that is similar to the
aforementioned 4-point condition.Comment: 18 pages, 4 figure
Shape-based peak identification for ChIP-Seq
We present a new algorithm for the identification of bound regions from
ChIP-seq experiments. Our method for identifying statistically significant
peaks from read coverage is inspired by the notion of persistence in
topological data analysis and provides a non-parametric approach that is robust
to noise in experiments. Specifically, our method reduces the peak calling
problem to the study of tree-based statistics derived from the data. We
demonstrate the accuracy of our method on existing datasets, and we show that
it can discover previously missed regions and can more clearly discriminate
between multiple binding events. The software T-PIC (Tree shape Peak
Identification for ChIP-Seq) is available at
http://math.berkeley.edu/~vhower/tpic.htmlComment: 12 pages, 6 figure
Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts
Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling, which limits their scope and generality. We propose a novel method that compares and clusters cells based on their transcript-compatibility read counts rather than on the transcript or gene quantifications used in standard analysis pipelines. In the reanalysis of two landmark yet disparate single-cell RNA-seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays
- …