Search CORE

494 research outputs found

Assessing molecular variability in cancer genomes

Author: Barbour A. D.
Tavaré Simon
Publication venue
Publication date: 13/04/2010
Field of study

The dynamics of tumour evolution are not well understood. In this paper we provide a statistical framework for evaluating the molecular variation observed in different parts of a colorectal tumour. A multi-sample version of the Ewens Sampling Formula forms the basis for our modelling of the data, and we provide a simulation procedure for use in obtaining reference distributions for the statistics of interest. We also describe the large-sample asymptotics of the joint distributions of the variation observed in different parts of the tumour. While actual data should be evaluated with reference to the simulation procedure, the asymptotics serve to provide theoretical guidelines, for instance with reference to the choice of possible statistics.Comment: 22 pages, 1 figure. Chapter 4 of "Probability and Mathematical Genetics: Papers in Honour of Sir John Kingman" (Editors N.H. Bingham and C.M. Goldie), Cambridge University Press, 201

arXiv.org e-Print Archive

ZORA

Ancestral inference from haplotypes and mutations

Author: Griffiths Robert C.
Tavaré Simon
Publication venue
Publication date: 28/02/2018
Field of study

We consider inference about the history of a sample of DNA sequences, conditional upon the haplotype counts and the number of segregating sites observed at the present time. After deriving some theoretical results in the coalescent setting, we implement rejection sampling and importance sampling schemes to perform the inference. The importance sampling scheme addresses an extension of the Ewens Sampling Formula for a configuration of haplotypes and the number of segregating sites in the sample. The implementations include both constant and variable population size models. The methods are illustrated by two human Y chromosome data sets

arXiv.org e-Print Archive

Crossref

Exploiting the feller coupling for the ewens sampling formula

Author: Arratia Richard
Barbour A. D.
Tavaré Simon
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2016
Field of study

This is the final version of the article. It first appeared from the Institute of Mathematical Statistics via http://dx.doi.org/10.1214/15-STS53

Crossref

ZORA

Apollo (Cambridge)

A Rate for the Erdős-Turán Law

Author: Barbour A. D.
Tavaré Simon
Publication venue
Publication date: 02/08/2017
Field of study

The Erdős-Turán law gives a normal approximation for the order of a randomly chosen permutation of n objects. In this paper, we provide a sharp error estimate for the approximation, showing that, if the mean of the approximating normal distribution is slightly adjusted, the error is of order log−1/2

RERO DOC Digital Library

Testing the Mean Matrix in High-Dimensional Transposable Data

Author: Marioni John C.
Tavaré Simon
Touloumis Anestis
Publication venue: 'Wiley'
Publication date: 23/01/2015
Field of study

The structural information in high-dimensional transposable data allows us to write the data recorded for each subject in a matrix such that both the rows and the columns correspond to variables of interest. One important problem is to test the null hypothesis that the mean matrix has a particular structure without ignoring the potential dependence structure among and/or between the row and column variables. To address this, we develop a simple and computationally efficient nonparametric testing procedure to assess the hypothesis that, in each predefined subset of columns (rows), the column (row) mean vector remains constant. In simulation studies, the proposed testing procedure seems to have good performance and unlike traditional approaches, it is powerful without leading to inflated nominal sizes. Finally, we illustrate the use of the proposed methodology via two empirical examples from gene expression microarrays.Comment: in Biometrics, 201

arXiv.org e-Print Archive

University of Brighton Research Portal

multiSNV: a probabilistic approach for improving detection of somatic point mutations from multiple related tumour samples.

Author: Josephidou Malvina
Lynch Andy G
Tavaré Simon
Publication venue: Nucleic Acids Res
Publication date: 26/02/2015
Field of study

Somatic variant analysis of a tumour sample and its matched normal has been widely used in cancer research to distinguish germline polymorphisms from somatic mutations. However, due to the extensive intratumour heterogeneity of cancer, sequencing data from a single tumour sample may greatly underestimate the overall mutational landscape. In recent studies, multiple spatially or temporally separated tumour samples from the same patient were sequenced to identify the regional distribution of somatic mutations and study intratumour heterogeneity. There are a number of tools to perform somatic variant calling from matched tumour-normal next-generation sequencing (NGS) data; however none of these allow joint analysis of multiple same-patient samples. We discuss the benefits and challenges of multisample somatic variant calling and present multiSNV, a software package for calling single nucleotide variants (SNVs) using NGS data from multiple same-patient samples. Instead of performing multiple pairwise analyses of a single tumour sample and a matched normal, multiSNV jointly considers all available samples under a Bayesian framework to increase sensitivity of calling shared SNVs. By leveraging information from all available samples, multiSNV is able to detect rare mutations with variant allele frequencies down to 3% from whole-exome sequencing experiments.Cancer Research UK grant C14303/A17197. Funding for open access charge: University of Cambridge.This is the final published version. It first appeared at http://nar.oxfordjournals.org/content/early/2015/02/26/nar.gkv135.long

Crossref

PubMed Central

Apollo (Cambridge)

University of St. Andrews - Pure

multiSNV : a probabilistic approach for improving detection of somatic point mutations from multiple related tumour samples

Author: Josephidou Malvina
Lynch Andy G.
Tavaré Simon
Publication venue: 'Oxford University Press (OUP)'
Publication date: 14/08/2017
Field of study

Funding: Cancer Research UK grant C14303/A17197. Funding for open access charge: University of Cambridge.Somatic variant analysis of a tumour sample and its matched normal has been widely used in cancer research to distinguish germline polymorphisms from somatic mutations. However, due to the extensive intratumour heterogeneity of cancer, sequencing data from a single tumour sample may greatly underestimate the overall mutational landscape. In recent studies, multiple spatially or temporally separated tumour samples from the same patient were sequenced to identify the regional distribution of somatic mutations and study intratumour heterogeneity. There are a number of tools to perform somatic variant calling from matched tumour-normal next-generation sequencing (NGS) data; however none of these allow joint analysis of multiple same-patient samples. We discuss the benefits and challenges of multisample somatic variant calling and present multiSNV, a software package for calling single nucleotide variants (SNVs) using NGS data from multiple same-patient samples. Instead of performing multiple pairwise analyses of a single tumour sample and a matched normal, multiSNV jointly considers all available samples under a Bayesian framework to increase sensitivity of calling shared SNVs. By leveraging information from all available samples, multiSNV is able to detect rare mutations with variant allele frequencies down to 3% from whole-exome sequencing experiments.Publisher PDFPeer reviewe

St Andrews Research Repository

On random polynomials over finite fields

Author: Arratia Richard
Barbour A. D.
Tavaré Simon
Publication venue
Publication date: 02/08/2017
Field of study

We consider random monic polynomials of degree n over a finite field of q elements, chosen with all qn possibilities equally likely, factored into monic irreducible factors. More generally, relaxing the restriction that q be a prime power, we consider that multiset construction in which the total number of possibilities of weight n is qn. We establish various approximations for the joint distribution of factors, by giving upper bounds on the total variation distance to simpler discrete distributions. For example, the counts for particular factors are approximately independent and geometrically distributed, and the counts for all factors of sizes 1, 2, ..., b, where b = O(n/log n), are approximated by independent negative binomial random variables. As another example, the joint distribution of the large factors is close to the joint distribution of the large cycles in a random permutation. We show how these discrete approximations imply a Brownian motion functional central limit theorem and a Poisson-Dirichiet limit theorem, together with appropriate error estimates. We also give Poisson approximations, with error bounds, for the distribution of the total number of factor

RERO DOC Digital Library

BayesPeak: Bayesian analysis of ChIP-seq data.

Author: Lynch Andy G
Spyrou Christiana
Stark Rory
Tavaré Simon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2009
Field of study

BACKGROUND: High-throughput sequencing technology has become popular and widely used to study protein and DNA interactions. Chromatin immunoprecipitation, followed by sequencing of the resulting samples, produces large amounts of data that can be used to map genomic features such as transcription factor binding sites and histone modifications. METHODS: Our proposed statistical algorithm, BayesPeak, uses a fully Bayesian hidden Markov model to detect enriched locations in the genome. The structure accommodates the natural features of the Solexa/Illumina sequencing data and allows for overdispersion in the abundance of reads in different regions. Moreover, a control sample can be incorporated in the analysis to account for experimental and sequence biases. Markov chain Monte Carlo algorithms are applied to estimate the posterior distributions of the model parameters, and posterior probabilities are used to detect the sites of interest. CONCLUSION: We have presented a flexible approach for identifying peaks from ChIP-seq reads, suitable for use on both transcription factor binding and histone modification data. Our method estimates probabilities of enrichment that can be used in downstream analysis. The method is assessed using experimentally verified data and is shown to provide high-confidence calls with low false positive rates

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Apollo (Cambridge)

University of St. Andrews - Pure