756 research outputs found

    Bayesian estimation of genomic copy number with single nucleotide polymorphism genotyping arrays

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The identification of copy number aberration in the human genome is an important area in cancer research. We develop a model for determining genomic copy numbers using high-density single nucleotide polymorphism genotyping microarrays. The method is based on a Bayesian spatial normal mixture model with an unknown number of components corresponding to true copy numbers. A reversible jump Markov chain Monte Carlo algorithm is used to implement the model and perform posterior inference.</p> <p>Results</p> <p>The performance of the algorithm is examined on both simulated and real cancer data, and it is compared with the popular CNAG algorithm for copy number detection.</p> <p>Conclusions</p> <p>We demonstrate that our Bayesian mixture model performs at least as well as the hidden Markov model based CNAG algorithm and in certain cases does better. One of the added advantages of our method is the flexibility of modeling normal cell contamination in tumor samples.</p

    Statistical Methods For Genomic And Transcriptomic Sequencing

    Get PDF
    Part 1: High-throughput sequencing of DNA coding regions has become a common way of assaying genomic variation in the study of human diseases. Copy number variation (CNV) is an important type of genomic variation, but CNV profiling from whole-exome sequencing (WES) is challenging due to the high level of biases and artifacts. We propose CODEX, a normalization and CNV calling procedure for WES data. CODEX includes a Poisson latent factor model, which includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based segmentation procedure that explicitly models the count-based WES data. CODEX is compared to existing methods on germline CNV detection in HapMap samples using microarray-based gold standard and is further evaluated on 222 neuroblastoma samples with matched normal, with focus on somatic CNVs within the ATRX gene. Part 2: Cancer is a disease driven by evolutionary selection on somatic genetic and epigenetic alterations. We propose Canopy, a method for inferring the evolutionary phylogeny of a tumor using both somatic copy number alterations and single nucleotide alterations from one or more samples derived from a single patient. Canopy is applied to bulk sequencing datasets of both longitudinal and spatial experimental designs and to a transplantable metastasis model derived from human cancer cell line MDA-MB-231. Canopy successfully identifies cell populations and infers phylogenies that are in concordance with existing knowledge and ground truth. Through simulations, we explore the effects of key parameters on deconvolution accuracy, and compare against existing methods. Part 3: Allele-specific expression is traditionally studied by bulk RNA sequencing, which measures average expression across cells. Single-cell RNA sequencing (scRNA-seq) allows the comparison of expression distribution between the two alleles of a diploid organism and thus the characterization of allele-specific bursting. We propose SCALE to analyze genome-wide allele-specific bursting, with adjustment of technical variability. SCALE detects genes exhibiting allelic differences in bursting parameters, and genes whose alleles burst non-independently. We apply SCALE to mouse blastocyst and human fibroblast cells and find that, globally, cis control in gene expression overwhelmingly manifests as differences in burst frequency

    Estimation of Parent Specific DNA Copy Number in Tumors using High-Density Genotyping Arrays

    Get PDF
    Chromosomal gains and losses comprise an important type of genetic change in tumors, and can now be assayed using microarray hybridization-based experiments. Most current statistical models for DNA copy number estimate total copy number, which do not distinguish between the underlying quantities of the two inherited chromosomes. This latter information, sometimes called parent specific copy number, is important for identifying allele-specific amplifications and deletions, for quantifying normal cell contamination, and for giving a more complete molecular portrait of the tumor. We propose a stochastic segmentation model for parent-specific DNA copy number in tumor samples, and give an estimation procedure that is computationally efficient and can be applied to data from the current high density genotyping platforms. The proposed method does not require matched normal samples, and can estimate the unknown genotypes simultaneously with the parent specific copy number. The new method is used to analyze 223 glioblastoma samples from the Cancer Genome Atlas (TCGA) project, giving a more comprehensive summary of the copy number events in these samples. Detailed case studies on these samples reveal the additional insights that can be gained from an allele-specific copy number analysis, such as the quantification of fractional gains and losses, the identification of copy neutral loss of heterozygosity, and the characterization of regions of simultaneous changes of both inherited chromosomes

    Statistical Methods for Bioinformatics: Estimation of Copy N umber and Detection of Gene Interactions

    Get PDF
    Identification of copy number aberrations in the human genome has been an important area in cancer research. In the first part of my thesis, I propose a new model for determining genomic copy numbers using high-density single nucleotide polymorphism genotyping microarrays. The method is based on a Bayesian spatial normal mixture model with an unknown number of components corresponding to true copy numbers. A reversible jump Markov chain Monte Carlo algorithm is used to implement the model and perform posterior inference. The second part of the thesis describes a new method for the detection of gene-gene interactions using gene expression data extracted from micro array experiments. The method is based on a two-step Genetic Algorithm, with the first step detecting main effects and the second step looking for interacting gene pairs. The performances of both algorithms are examined on both simulated data and real cancer data and are compared with popular existing algorithms. Conclusions are given and possible extensions are discussed

    CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing

    Get PDF
    Abstract High-throughput DNA sequencing enables detection of copy number variations (CNVs) on the genome-wide scale with finer resolution compared to array-based methods but suffers from biases and artifacts that lead to false discoveries and low sensitivity. We describe CODEX2, as a statistical framework for full-spectrum CNV profiling that is sensitive for variants with both common and rare population frequencies and that is applicable to study designs with and without negative control samples. We demonstrate and evaluate CODEX2 on whole-exome and targeted sequencing data, where biases are the most prominent. CODEX2 outperforms existing methods and, in particular, significantly improves sensitivity for common CNVs

    Robust unmixing of tumor states in array comparative genomic hybridization data

    Get PDF
    Motivation: Tumorigenesis is an evolutionary process by which tumor cells acquire sequences of mutations leading to increased growth, invasiveness and eventually metastasis. It is hoped that by identifying the common patterns of mutations underlying major cancer sub-types, we can better understand the molecular basis of tumor development and identify new diagnostics and therapeutic targets. This goal has motivated several attempts to apply evolutionary tree reconstruction methods to assays of tumor state. Inference of tumor evolution is in principle aided by the fact that tumors are heterogeneous, retaining remnant populations of different stages along their development along with contaminating healthy cell populations. In practice, though, this heterogeneity complicates interpretation of tumor data because distinct cell types are conflated by common methods for assaying the tumor state. We previously proposed a method to computationally infer cell populations from measures of tumor-wide gene expression through a geometric interpretation of mixture type separation, but this approach deals poorly with noisy and outlier data

    STRONG: metagenomics strain resolution on assembly graphs

    Get PDF
    We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads

    Utilizing Haplotypes for Sensitive SNP Array-based Discovery of Somatic Chromosomal Mutations

    Get PDF
    Somatic copy-number (CN) gains and losses and copy-neutral loss of heterozygosity (CNLOH) frequently occur in tumors and play a major role in the progression of disease by altering gene dosage and unmasking deleterious recessive variants. Characterizing these mutations in an individual tumor sample is therefore critical for research on the relationship of specific mutations to disease outcome and for clinical decision-making based on mutations with known impact. A pervasive hindrance to sensitive detection of these mutations is genetic heterogeneity and high levels of contaminating normal cells in tumor samples, which limit the fraction of cells carrying informative mutations. The method presented here is the first method to utilize population-based haplotype estimates to discover low-frequency somatic kilobase- to megabase-size CN alterations and CNLOH mutations using DNA microarrays. The major innovation of the method is the use of phase concordance as a robust metric to measure evidence of allelic imbalance in the face of sporadic phasing errors in the statistical haplotype estimates and stochastic variation in the microarray data. In addition to presenting a hidden Markov model that uses the phase concordance data to perform agnostic whole-genome discovery of imbalanced regions, we also describe how to test candidate regions, and to infer the haplotype of the major chromosome. We demonstrate through controlled experiments using lab-created tumor-normal mixture samples and in silico simulated data that the sensitivity is higher than that of existing methods, detecting specific imbalance events in samples with 7% tumor or less, while maintaining specificity. We also demonstrate the potential of the method via a real-data analysis of genomic mosaicism in the general population using over 30,000 samples that were previously analyzed using another method. We made nearly three times as many calls in these samples as the previous analysis (1,119 vs. 379), most of which appear to exist at low frequencies. These findings validate recent hypotheses that somatic variation in healthy tissues is more prevalent than had previously been reported, and provides valuable observations of in vivo mutations that can be studied to make inference on genetic robustness and how these mutations impact cell fitness
    corecore