Search CORE

42 research outputs found

Comparing Nonparametric Bayesian Tree Priors for Clonal Reconstruction of Tumors

Author: Deshwar Amit G.
Morris Quaid
Vembu Shankar
Publication venue
Publication date: 11/08/2014
Field of study

Statistical machine learning methods, especially nonparametric Bayesian methods, have become increasingly popular to infer clonal population structure of tumors. Here we describe the treeCRP, an extension of the Chinese restaurant process (CRP), a popular construction used in nonparametric mixture models, to infer the phylogeny and genotype of major subclonal lineages represented in the population of cancer cells. We also propose new split-merge updates tailored to the subclonal reconstruction problem that improve the mixing time of Markov chains. In comparisons with the tree-structured stick breaking prior used in PhyloSub, we demonstrate superior mixing and running time using the treeCRP with our new split-merge procedures. We also show that given the same number of samples, TSSB and treeCRP have similar ability to recover the subclonal structure of a tumor.Comment: Preprint of an article submitted for consideration in the Pacific Symposium on Biocomputing \c{opyright} 2015; World Scientific Publishing Co., Singapore, 2015; http://psb.stanford.edu

arXiv.org e-Print Archive

Crossref

Bayesian nonparametric models for biomedical data analysis

Author: Zhou Tianjian, Ph. D.
Publication venue
Publication date: 26/10/2017
Field of study

In this dissertation, we develop nonparametric Bayesian models for biomedical data analysis. In particular, we focus on inference for tumor heterogeneity and inference for missing data. First, we present a Bayesian feature allocation model for tumor subclone reconstruction using mutation pairs. The key innovation lies in the use of short reads mapped to pairs of proximal single nucleotide variants (SNVs). In contrast, most existing methods use only marginal reads for unpaired SNVs. In the same context of using mutation pairs, in order to recover the phylogenetic relationship of subclones, we then develop a Bayesian treed feature allocation model. In contrast to commonly used feature allocation models, we allow the latent features to be dependent, using a tree structure to introduce dependence. Finally, we propose a nonparametric Bayesian approach to monotone missing data in longitudinal studies with non-ignorable missingness. In contrast to most existing methods, our method allow for incorporating information from auxiliary covariates and is able to capture complex structures among the response, missingness and auxiliary covariates. Our models are validated through simulation studies and are applied to real-world biomedical datasets.Statistic

arXiv.org e-Print Archive

Texas ScholarWorks

Algorithmic methods to infer the evolutionary trajectories in cancer progression

Author: Antoniotti Marco
Caravagna Giulio
De Sano Luca
Graudenzi Alex
Mauri Giancarlo
Mishra Bud
Moreno Aguado Víctor
Ramazzotti Daniele
Sanz Pamplona Rebeca
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 10/02/2023
Field of study

The genomic evolution inherent to cancer relates directly to a renewed focus on the voluminous next-generation sequencing data and machine learning for the inference of explanatory models of how the (epi)genomic events are choreographed in cancer initiation and development. However, despite the increasing availability of multiple additional -omics data, this quest has been frustrated by various theoretical and technical hurdles, mostly stemming from the dramatic heterogeneity of the disease. In this paper, we build on our recent work on the 'selective advantage' relation among driver mutations in cancer progression and investigate its applicability to the modeling problem at the population level. Here, we introduce PiCnIc (Pipeline for Cancer Inference), a versatile, modular, and customizable pipeline to extract ensemble-level progression models from cross-sectional sequenced cancer genomes. The pipeline has many translational implications because it combines state-of-the-art techniques for sample stratification, driver selection, identification of fitness-equivalent exclusive alterations, and progression model inference. We demonstrate PiCnIc's ability to reproduce much of the current knowledge on colorectal cancer progression as well as to suggest novel experimentally verifiable hypotheses

Diposit Digital de la Universitat de Barcelona

Recommended from our members

Bayesian Inference for Genomic Data Analysis

Author: Ogundijo Oyetunji Enoch
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

High-throughput genomic data contain gazillion of information that are influenced by the complex biological processes in the cell. As such, appropriate mathematical modeling frameworks are required to understand the data and the data generating processes. This dissertation focuses on the formulation of mathematical models and the description of appropriate computational algorithms to obtain insights from genomic data. Specifically, characterization of intra-tumor heterogeneity is studied. Based on the total number of allele copies at the genomic locations in the tumor subclones, the problem is viewed from two perspectives: the presence or absence of copy-neutrality assumption. With the presence of copy-neutrality, it is assumed that the genome contains mutational variability and the three possible genotypes may be present at each genomic location. As such, the genotypes of all the genomic locations in the tumor subclones are modeled by a ternary matrix. In the second case, in addition to mutational variability, it is assumed that the genomic locations may be affected by structural variabilities such as copy number variation (CNV). Thus, the genotypes are modeled with a pair of (Q + 1)-ary matrices. Using the categorical Indian buffet process (cIBP), state-space modeling framework is employed in describing the two processes and the sequential Monte Carlo (SMC) methods for dynamic models are applied to perform inference on important model parameters. Moreover, the problem of estimating gene regulatory network (GRN) from measurement with missing values is presented. Specifically, gene expression time series data may contain missing values for entire expression values of a single point or some set of consecutive time points. However, complete data is often needed to make inference on the underlying GRN. Using the missing measurement, a dynamic stochastic model is used to describe the evolution of gene expression and point-based Gaussian approximation (PBGA) filters with one-step or two-step missing measurements are applied for the inference. Finally, the problem of deconvolving gene expression data from complex heterogeneous biological samples is examined, where the observed data are a mixture of different cell types. A statistical description of the problem is used and the SMC method for static models is applied to estimate the cell-type specific expressions and the cell type proportions in the heterogeneous samples

Columbia University Academic Commons

Recommended from our members

STATISTICAL METHODS TO STUDY TRANSPOSON SEQUENCING DATA: NONPARAMETRIC BAYESIAN MODELS WITH SAMPLING ALGORITHMS

Author: He Shai
Publication venue: ScholarWorks@UMass Amherst
Publication date: 26/10/2022
Field of study

As the development of Next Generation Sequencing(NGS) technology, researchers can easily obtain data from millions of cells( bulk samples) or just collecting data from a single cell. However, while bulk samples can capture broad changes, it may risk providing an average measurement that is not representative of the genetic state of any individual cell. While single-cell experiments can capture the genetic state of the individual cell, a single cell sample can increase uncertainty, sampling enough cells to gain a representative sample of population is expensive. Therefore, there is a need to integrate information from both bulk and single-cell data to obtain a comprehensive understanding of subclonal populations in an individual tumor as well as across individuals. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments shows that our model outperforms state-of-the-art methods Another goal for analyzing genomic data is to understand which genes are essential and under what environmental conditions they are essential. Transposon sequencing method provides a powerful tool for researchers to find conditionally essential genes. However, methods are needed to go beyond a one-at-a-time analysis of conditionally essential genes and learn higher order representations that identify conditionally essential networks of genes. While the aforementioned methods do identify essential genes from transposon sequencing data, they do not provide a representation of the space of essential genes. For example, if two genes share the same pattern of essentiality across all conditions there is a higher level representation that couples those genes into a network. The goal of this work is to build such a higher level representations of the set of essential genes and identify genes that share essentiality patterns across conditions. To address this need, we develop a novel, computationally efficient hierarchical non-parametric Bayesian model: hierarchical Gamma-Poisson Process (hGP)

ScholarWorks@UMass Amherst

Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes

Author: Dentro Stefan C.
Mustonen Ville
PCAWG Consortium
PCAWG Evolution Heterogeneity Work
Publication venue
Publication date: 15/04/2021
Field of study

Intra-tumor heterogeneity (ITH) is a mechanism of therapeutic resistance and therefore an important clinical challenge. However, the extent, origin, and drivers of ITH across cancer types are poorly understood. To address this, we extensively characterize ITH across whole-genome sequences of 2,658 cancer samples spanning 38 cancer types. Nearly all informative samples (95.1 %) contain evidence of distinct subclonal expansions with frequent branching relationships between subclones, We observe positive selection of subclonal driver mutations across most cancer types and identify cancer type-specific subclonal patterns of driver gene mutations, fusions, structural variants, and copy number alterations as well as dynamic changes in mutational processes between subclonal expansions. Our results underline the importance of ITH and its drivers in tumor evolution and provide a pan-cancer resource of comprehensively annotated subclonal events from whole-genome sequencing data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes.

Intra-tumor heterogeneity (ITH) is a mechanism of therapeutic resistance and therefore an important clinical challenge. However, the extent, origin, and drivers of ITH across cancer types are poorly understood. To address this, we extensively characterize ITH across whole-genome sequences of 2,658 cancer samples spanning 38 cancer types. Nearly all informative samples (95.1%) contain evidence of distinct subclonal expansions with frequent branching relationships between subclones. We observe positive selection of subclonal driver mutations across most cancer types and identify cancer type-specific subclonal patterns of driver gene mutations, fusions, structural variants, and copy number alterations as well as dynamic changes in mutational processes between subclonal expansions. Our results underline the importance of ITH and its drivers in tumor evolution and provide a pan-cancer resource of comprehensively annotated subclonal events from whole-genome sequencing data

UCL Discovery

University of Melbourne Institutional Repository

Statistical Methods For Genomic And Transcriptomic Sequencing

Author: Jiang Yuchao
Publication venue: ScholarlyCommons
Publication date: 01/01/2017
Field of study

Part 1: High-throughput sequencing of DNA coding regions has become a common way of assaying genomic variation in the study of human diseases. Copy number variation (CNV) is an important type of genomic variation, but CNV profiling from whole-exome sequencing (WES) is challenging due to the high level of biases and artifacts. We propose CODEX, a normalization and CNV calling procedure for WES data. CODEX includes a Poisson latent factor model, which includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based segmentation procedure that explicitly models the count-based WES data. CODEX is compared to existing methods on germline CNV detection in HapMap samples using microarray-based gold standard and is further evaluated on 222 neuroblastoma samples with matched normal, with focus on somatic CNVs within the ATRX gene. Part 2: Cancer is a disease driven by evolutionary selection on somatic genetic and epigenetic alterations. We propose Canopy, a method for inferring the evolutionary phylogeny of a tumor using both somatic copy number alterations and single nucleotide alterations from one or more samples derived from a single patient. Canopy is applied to bulk sequencing datasets of both longitudinal and spatial experimental designs and to a transplantable metastasis model derived from human cancer cell line MDA-MB-231. Canopy successfully identifies cell populations and infers phylogenies that are in concordance with existing knowledge and ground truth. Through simulations, we explore the effects of key parameters on deconvolution accuracy, and compare against existing methods. Part 3: Allele-specific expression is traditionally studied by bulk RNA sequencing, which measures average expression across cells. Single-cell RNA sequencing (scRNA-seq) allows the comparison of expression distribution between the two alleles of a diploid organism and thus the characterization of allele-specific bursting. We propose SCALE to analyze genome-wide allele-specific bursting, with adjustment of technical variability. SCALE detects genes exhibiting allelic differences in bursting parameters, and genes whose alleles burst non-independently. We apply SCALE to mouse blastocyst and human fibroblast cells and find that, globally, cis control in gene expression overwhelmingly manifests as differences in burst frequency

ScholarlyCommons@Penn

A phylogenetic latent feature model for clonal deconvolution

Author: Marass F
Markowetz F
Mouliere F
Rosenfeld N
Yuan K
Publication venue: Annals of Applied Statistics
Publication date: 06/04/2016
Field of study

Tumours develop in an evolutionary process, in which the accumulation of mutations produces subpopulations of cells with distinct mutational profiles, called clones. This process leads to the genetic heterogeneity widely observed in tumour sequencing data, but identifying the genotypes and frequencies of the different clones is still a major challenge. Here, we present Cloe, a phylogenetic latent feature model to deconvolute tumour sequencing data into a set of related genotypes. Our approach extends latent feature models by placing the features as nodes in a latent tree. The resulting model can capture both the acquisition and the loss of mutations, as well as episodes of convergent evolution. We establish the validity of Cloe on synthetic data and assess its performance on controlled biological data, comparing our reconstructions to those of several published state-of-the-art methods. We show that our method provides highly accurate reconstructions and identifies the number of clones, their genotypes and frequencies even at a modest sequencing depth. As a proof of concept, we apply our model to clinical data from three cases with chronic lymphocytic leukaemia and one case with acute myeloid leukaemia.CRUK (Core grant C14303/A17197, A20240 (Rosenfeld lab core grant), A19274 (Markowetz lab core grant)), University of Cambridge, Hutchison Whampoa Limite

arXiv.org e-Print Archive

Crossref

Enlighten

Apollo (Cambridge)