494 research outputs found

    Assessing molecular variability in cancer genomes

    Full text link
    The dynamics of tumour evolution are not well understood. In this paper we provide a statistical framework for evaluating the molecular variation observed in different parts of a colorectal tumour. A multi-sample version of the Ewens Sampling Formula forms the basis for our modelling of the data, and we provide a simulation procedure for use in obtaining reference distributions for the statistics of interest. We also describe the large-sample asymptotics of the joint distributions of the variation observed in different parts of the tumour. While actual data should be evaluated with reference to the simulation procedure, the asymptotics serve to provide theoretical guidelines, for instance with reference to the choice of possible statistics.Comment: 22 pages, 1 figure. Chapter 4 of "Probability and Mathematical Genetics: Papers in Honour of Sir John Kingman" (Editors N.H. Bingham and C.M. Goldie), Cambridge University Press, 201

    Ancestral inference from haplotypes and mutations

    Full text link
    We consider inference about the history of a sample of DNA sequences, conditional upon the haplotype counts and the number of segregating sites observed at the present time. After deriving some theoretical results in the coalescent setting, we implement rejection sampling and importance sampling schemes to perform the inference. The importance sampling scheme addresses an extension of the Ewens Sampling Formula for a configuration of haplotypes and the number of segregating sites in the sample. The implementations include both constant and variable population size models. The methods are illustrated by two human Y chromosome data sets

    Exploiting the feller coupling for the ewens sampling formula

    Get PDF
    This is the final version of the article. It first appeared from the Institute of Mathematical Statistics via http://dx.doi.org/10.1214/15-STS53

    A Rate for the Erdős-Turán Law

    Get PDF
    The Erdős-Turán law gives a normal approximation for the order of a randomly chosen permutation of n objects. In this paper, we provide a sharp error estimate for the approximation, showing that, if the mean of the approximating normal distribution is slightly adjusted, the error is of order log−1/2

    Testing the Mean Matrix in High-Dimensional Transposable Data

    Get PDF
    The structural information in high-dimensional transposable data allows us to write the data recorded for each subject in a matrix such that both the rows and the columns correspond to variables of interest. One important problem is to test the null hypothesis that the mean matrix has a particular structure without ignoring the potential dependence structure among and/or between the row and column variables. To address this, we develop a simple and computationally efficient nonparametric testing procedure to assess the hypothesis that, in each predefined subset of columns (rows), the column (row) mean vector remains constant. In simulation studies, the proposed testing procedure seems to have good performance and unlike traditional approaches, it is powerful without leading to inflated nominal sizes. Finally, we illustrate the use of the proposed methodology via two empirical examples from gene expression microarrays.Comment: in Biometrics, 201

    multiSNV: a probabilistic approach for improving detection of somatic point mutations from multiple related tumour samples.

    Get PDF
    Somatic variant analysis of a tumour sample and its matched normal has been widely used in cancer research to distinguish germline polymorphisms from somatic mutations. However, due to the extensive intratumour heterogeneity of cancer, sequencing data from a single tumour sample may greatly underestimate the overall mutational landscape. In recent studies, multiple spatially or temporally separated tumour samples from the same patient were sequenced to identify the regional distribution of somatic mutations and study intratumour heterogeneity. There are a number of tools to perform somatic variant calling from matched tumour-normal next-generation sequencing (NGS) data; however none of these allow joint analysis of multiple same-patient samples. We discuss the benefits and challenges of multisample somatic variant calling and present multiSNV, a software package for calling single nucleotide variants (SNVs) using NGS data from multiple same-patient samples. Instead of performing multiple pairwise analyses of a single tumour sample and a matched normal, multiSNV jointly considers all available samples under a Bayesian framework to increase sensitivity of calling shared SNVs. By leveraging information from all available samples, multiSNV is able to detect rare mutations with variant allele frequencies down to 3% from whole-exome sequencing experiments.Cancer Research UK grant C14303/A17197. Funding for open access charge: University of Cambridge.This is the final published version. It first appeared at http://nar.oxfordjournals.org/content/early/2015/02/26/nar.gkv135.long

    multiSNV : a probabilistic approach for improving detection of somatic point mutations from multiple related tumour samples

    Get PDF
    Funding: Cancer Research UK grant C14303/A17197. Funding for open access charge: University of Cambridge.Somatic variant analysis of a tumour sample and its matched normal has been widely used in cancer research to distinguish germline polymorphisms from somatic mutations. However, due to the extensive intratumour heterogeneity of cancer, sequencing data from a single tumour sample may greatly underestimate the overall mutational landscape. In recent studies, multiple spatially or temporally separated tumour samples from the same patient were sequenced to identify the regional distribution of somatic mutations and study intratumour heterogeneity. There are a number of tools to perform somatic variant calling from matched tumour-normal next-generation sequencing (NGS) data; however none of these allow joint analysis of multiple same-patient samples. We discuss the benefits and challenges of multisample somatic variant calling and present multiSNV, a software package for calling single nucleotide variants (SNVs) using NGS data from multiple same-patient samples. Instead of performing multiple pairwise analyses of a single tumour sample and a matched normal, multiSNV jointly considers all available samples under a Bayesian framework to increase sensitivity of calling shared SNVs. By leveraging information from all available samples, multiSNV is able to detect rare mutations with variant allele frequencies down to 3% from whole-exome sequencing experiments.Publisher PDFPeer reviewe

    On random polynomials over finite fields

    Get PDF
    We consider random monic polynomials of degree n over a finite field of q elements, chosen with all qn possibilities equally likely, factored into monic irreducible factors. More generally, relaxing the restriction that q be a prime power, we consider that multiset construction in which the total number of possibilities of weight n is qn. We establish various approximations for the joint distribution of factors, by giving upper bounds on the total variation distance to simpler discrete distributions. For example, the counts for particular factors are approximately independent and geometrically distributed, and the counts for all factors of sizes 1, 2, ..., b, where b = O(n/log n), are approximated by independent negative binomial random variables. As another example, the joint distribution of the large factors is close to the joint distribution of the large cycles in a random permutation. We show how these discrete approximations imply a Brownian motion functional central limit theorem and a Poisson-Dirichiet limit theorem, together with appropriate error estimates. We also give Poisson approximations, with error bounds, for the distribution of the total number of factor

    BayesPeak: Bayesian analysis of ChIP-seq data.

    Get PDF
    BACKGROUND: High-throughput sequencing technology has become popular and widely used to study protein and DNA interactions. Chromatin immunoprecipitation, followed by sequencing of the resulting samples, produces large amounts of data that can be used to map genomic features such as transcription factor binding sites and histone modifications. METHODS: Our proposed statistical algorithm, BayesPeak, uses a fully Bayesian hidden Markov model to detect enriched locations in the genome. The structure accommodates the natural features of the Solexa/Illumina sequencing data and allows for overdispersion in the abundance of reads in different regions. Moreover, a control sample can be incorporated in the analysis to account for experimental and sequence biases. Markov chain Monte Carlo algorithms are applied to estimate the posterior distributions of the model parameters, and posterior probabilities are used to detect the sites of interest. CONCLUSION: We have presented a flexible approach for identifying peaks from ChIP-seq reads, suitable for use on both transcription factor binding and histone modification data. Our method estimates probabilities of enrichment that can be used in downstream analysis. The method is assessed using experimentally verified data and is shown to provide high-confidence calls with low false positive rates
    corecore