494 research outputs found
Assessing molecular variability in cancer genomes
The dynamics of tumour evolution are not well understood. In this paper we
provide a statistical framework for evaluating the molecular variation observed
in different parts of a colorectal tumour. A multi-sample version of the Ewens
Sampling Formula forms the basis for our modelling of the data, and we provide
a simulation procedure for use in obtaining reference distributions for the
statistics of interest. We also describe the large-sample asymptotics of the
joint distributions of the variation observed in different parts of the tumour.
While actual data should be evaluated with reference to the simulation
procedure, the asymptotics serve to provide theoretical guidelines, for
instance with reference to the choice of possible statistics.Comment: 22 pages, 1 figure. Chapter 4 of "Probability and Mathematical
Genetics: Papers in Honour of Sir John Kingman" (Editors N.H. Bingham and
C.M. Goldie), Cambridge University Press, 201
Ancestral inference from haplotypes and mutations
We consider inference about the history of a sample of DNA sequences,
conditional upon the haplotype counts and the number of segregating sites
observed at the present time. After deriving some theoretical results in the
coalescent setting, we implement rejection sampling and importance sampling
schemes to perform the inference. The importance sampling scheme addresses an
extension of the Ewens Sampling Formula for a configuration of haplotypes and
the number of segregating sites in the sample. The implementations include both
constant and variable population size models. The methods are illustrated by
two human Y chromosome data sets
Exploiting the feller coupling for the ewens sampling formula
This is the final version of the article. It first appeared from the Institute of Mathematical Statistics via http://dx.doi.org/10.1214/15-STS53
A Rate for the Erdős-Turán Law
The Erdős-Turán law gives a normal approximation for the order of a randomly chosen permutation of n objects. In this paper, we provide a sharp error estimate for the approximation, showing that, if the mean of the approximating normal distribution is slightly adjusted, the error is of order log−1/2
Testing the Mean Matrix in High-Dimensional Transposable Data
The structural information in high-dimensional transposable data allows us to
write the data recorded for each subject in a matrix such that both the rows
and the columns correspond to variables of interest. One important problem is
to test the null hypothesis that the mean matrix has a particular structure
without ignoring the potential dependence structure among and/or between the
row and column variables. To address this, we develop a simple and
computationally efficient nonparametric testing procedure to assess the
hypothesis that, in each predefined subset of columns (rows), the column (row)
mean vector remains constant. In simulation studies, the proposed testing
procedure seems to have good performance and unlike traditional approaches, it
is powerful without leading to inflated nominal sizes. Finally, we illustrate
the use of the proposed methodology via two empirical examples from gene
expression microarrays.Comment: in Biometrics, 201
multiSNV: a probabilistic approach for improving detection of somatic point mutations from multiple related tumour samples.
Somatic variant analysis of a tumour sample and its matched normal has been widely used in cancer research to distinguish germline polymorphisms from somatic mutations. However, due to the extensive intratumour heterogeneity of cancer, sequencing data from a single tumour sample may greatly underestimate the overall mutational landscape. In recent studies, multiple spatially or temporally separated tumour samples from the same patient were sequenced to identify the regional distribution of somatic mutations and study intratumour heterogeneity. There are a number of tools to perform somatic variant calling from matched tumour-normal next-generation sequencing (NGS) data; however none of these allow joint analysis of multiple same-patient samples. We discuss the benefits and challenges of multisample somatic variant calling and present multiSNV, a software package for calling single nucleotide variants (SNVs) using NGS data from multiple same-patient samples. Instead of performing multiple pairwise analyses of a single tumour sample and a matched normal, multiSNV jointly considers all available samples under a Bayesian framework to increase sensitivity of calling shared SNVs. By leveraging information from all available samples, multiSNV is able to detect rare mutations with variant allele frequencies down to 3% from whole-exome sequencing experiments.Cancer Research UK grant C14303/A17197. Funding for
open access charge: University of Cambridge.This is the final published version. It first appeared at http://nar.oxfordjournals.org/content/early/2015/02/26/nar.gkv135.long
multiSNV : a probabilistic approach for improving detection of somatic point mutations from multiple related tumour samples
Funding: Cancer Research UK grant C14303/A17197. Funding for open access charge: University of Cambridge.Somatic variant analysis of a tumour sample and its matched normal has been widely used in cancer research to distinguish germline polymorphisms from somatic mutations. However, due to the extensive intratumour heterogeneity of cancer, sequencing data from a single tumour sample may greatly underestimate the overall mutational landscape. In recent studies, multiple spatially or temporally separated tumour samples from the same patient were sequenced to identify the regional distribution of somatic mutations and study intratumour heterogeneity. There are a number of tools to perform somatic variant calling from matched tumour-normal next-generation sequencing (NGS) data; however none of these allow joint analysis of multiple same-patient samples. We discuss the benefits and challenges of multisample somatic variant calling and present multiSNV, a software package for calling single nucleotide variants (SNVs) using NGS data from multiple same-patient samples. Instead of performing multiple pairwise analyses of a single tumour sample and a matched normal, multiSNV jointly considers all available samples under a Bayesian framework to increase sensitivity of calling shared SNVs. By leveraging information from all available samples, multiSNV is able to detect rare mutations with variant allele frequencies down to 3% from whole-exome sequencing experiments.Publisher PDFPeer reviewe
On random polynomials over finite fields
We consider random monic polynomials of degree n over a finite field of q elements, chosen with all qn possibilities equally likely, factored into monic irreducible factors. More generally, relaxing the restriction that q be a prime power, we consider that multiset construction in which the total number of possibilities of weight n is qn. We establish various approximations for the joint distribution of factors, by giving upper bounds on the total variation distance to simpler discrete distributions. For example, the counts for particular factors are approximately independent and geometrically distributed, and the counts for all factors of sizes 1, 2, ..., b, where b = O(n/log n), are approximated by independent negative binomial random variables. As another example, the joint distribution of the large factors is close to the joint distribution of the large cycles in a random permutation. We show how these discrete approximations imply a Brownian motion functional central limit theorem and a Poisson-Dirichiet limit theorem, together with appropriate error estimates. We also give Poisson approximations, with error bounds, for the distribution of the total number of factor
BayesPeak: Bayesian analysis of ChIP-seq data.
BACKGROUND: High-throughput sequencing technology has become popular and widely used to study protein and DNA interactions. Chromatin immunoprecipitation, followed by sequencing of the resulting samples, produces large amounts of data that can be used to map genomic features such as transcription factor binding sites and histone modifications. METHODS: Our proposed statistical algorithm, BayesPeak, uses a fully Bayesian hidden Markov model to detect enriched locations in the genome. The structure accommodates the natural features of the Solexa/Illumina sequencing data and allows for overdispersion in the abundance of reads in different regions. Moreover, a control sample can be incorporated in the analysis to account for experimental and sequence biases. Markov chain Monte Carlo algorithms are applied to estimate the posterior distributions of the model parameters, and posterior probabilities are used to detect the sites of interest. CONCLUSION: We have presented a flexible approach for identifying peaks from ChIP-seq reads, suitable for use on both transcription factor binding and histone modification data. Our method estimates probabilities of enrichment that can be used in downstream analysis. The method is assessed using experimentally verified data and is shown to provide high-confidence calls with low false positive rates
- …
