1,278 research outputs found

    Implications of non-uniqueness in phylogenetic deconvolution of bulk DNA samples of tumors

    Get PDF
    Background Tumors exhibit extensive intra-tumor heterogeneity, the presence of groups of cellular populations with distinct sets of somatic mutations. This heterogeneity is the result of an evolutionary process, described by a phylogenetic tree. In addition to enabling clinicians to devise patient-specific treatment plans, phylogenetic trees of tumors enable researchers to decipher the mechanisms of tumorigenesis and metastasis. However, the problem of reconstructing a phylogenetic tree T given bulk sequencing data from a tumor is more complicated than the classic phylogeny inference problem. Rather than observing the leaves of T directly, we are given mutation frequencies that are the result of mixtures of the leaves of T. The majority of current tumor phylogeny inference methods employ the perfect phylogeny evolutionary model. The underlying PERFECT PHYLOGENY MIXTURE (PPM) combinatorial problem typically has multiple solutions. Results We prove that determining the exact number of solutions to the PPM problem is #P-complete and hard to approximate within a constant factor. Moreover, we show that sampling solutions uniformly at random is hard as well. On the positive side, we provide a polynomial-time computable upper bound on the number of solutions and introduce a simple rejection-sampling based scheme that works well for small instances. Using simulated and real data, we identify factors that contribute to and counteract non-uniqueness of solutions. In addition, we study the sampling performance of current methods, identifying significant biases. Conclusions Awareness of non-uniqueness of solutions to the PPM problem is key to drawing accurate conclusions in downstream analyses based on tumor phylogenies. This work provides the theoretical foundations for non-uniqueness of solutions in tumor phylogeny inference from bulk DNA samples.Ope

    Parsimonious Clone Tree Reconciliation in Cancer

    Get PDF
    Every tumor is composed of heterogeneous clones, each corresponding to a distinct subpopulation of cells that accumulated different types of somatic mutations, ranging from single-nucleotide variants (SNVs) to copy-number aberrations (CNAs). As the analysis of this intra-tumor heterogeneity has important clinical applications, several computational methods have been introduced to identify clones from DNA sequencing data. However, due to technological and methodological limitations, current analyses are restricted to identifying tumor clones only based on either SNVs or CNAs, preventing a comprehensive characterization of a tumor's clonal composition. To overcome these challenges, we formulate the identification of clones in terms of both SNVs and CNAs as a reconciliation problem while accounting for uncertainty in the input SNV and CNA proportions. We thus characterize the computational complexity of this problem and we introduce a mixed integer linear programming formulation to solve it exactly. On simulated data, we show that tumor clones can be identified reliably, especially when further taking into account the ancestral relationships that can be inferred from the input SNVs and CNAs. On 49 tumor samples from 10 prostate cancer patients, our reconciliation approach provides a higher resolution view of tumor evolution than previous studies

    Graph algorithms for the haplotyping problem

    Get PDF
    Evidence from investigations of genetic differences among human beings shows that genetic diseases are often the results of genetic mutations. The most common form of these mutations is single nucleotide polymorphism (SNP). A complete map of all SNPs in the human genome will be extremely valuable for studying the relationships between specific haplotypes and specific genetic diseases. Some recent discoveries show that the DNA sequence of human beings can be partitioned into long blocks where genetic recombination has been rare. Then, inferring both haplotypes from chromosome sequences is a biologically meaningful research topic, which has compounded mathematical and computational problems.;We are interested in the algorithmic implications to infer haplotypes from long blocks of DNA that have not undergone recombination in populations. The assumption justifies a model of haplotype evolution---haplotypes in a population evolves along a coalescent, based on the standard population-genetic assumption of infinite sites, which as a rooted tree is a perfect phylogeny. The Perfect Phylogeny Haplotyping (PPH) Problem was introduced by Daniel Gusfield in 2002. A nearly linear-time solution to the PPH problem (O( nmalpha(nm)), where alpha is the extremely slowly growing inverse Ackerman function) is provided. However, it is very complex and difficult to implement. So far, even the best practical solution to the PPH problem has the worst-case running time of O( nm2). D. Gusfield conjectured that a linear-time ( O(nm)) solution to the PPH problem should be possible.;We solve the conjecture of Gusfield by introducing a linear-time algorithm for the PPH problem. Different kinds of posets for haplotype matrices and genotype matrices are designed and the relationships between them are studied. Since redundant calculations can be avoided by the transitivity of partial ordering in posets, we design a linear-time (O(nm )) algorithm for the PPH problem that provides all the possible solutions from an input. The algorithm is fully implemented and the simulation shows that it is much faster than previous methods

    Predicting Horizontal Gene Transfers with Perfect Transfer Networks

    Get PDF
    Horizontal gene transfer inference approaches are usually based on gene sequences: parametric methods search for patterns that deviate from a particular genomic signature, while phylogenetic methods use sequences to reconstruct the gene and species trees. However, it is well-known that sequences have difficulty identifying ancient transfers since mutations have enough time to erase all evidence of such events. In this work, we ask whether character-based methods can predict gene transfers. Their advantage over sequences is that homologous genes can have low DNA similarity, but still have retained enough important common motifs that allow them to have common character traits, for instance the same functional or expression profile. A phylogeny that has two separate clades that acquired the same character independently might indicate the presence of a transfer even in the absence of sequence similarity. We introduce perfect transfer networks, which are phylogenetic networks that can explain the character diversity of a set of taxa. This problem has been studied extensively in the form of ancestral recombination networks, but these only model hybridation events and do not differentiate between direct parents and lateral donors. We focus on tree-based networks, in which edges representing vertical descent are clearly distinguished from those that represent horizontal transmission. Our model is a direct generalization of perfect phylogeny models to such networks. Our goal is to initiate a study on the structural and algorithmic properties of perfect transfer networks. We then show that in polynomial time, one can decide whether a given network is a valid explanation for a set of taxa, and show how, for a given tree, one can add transfer edges to it so that it explains a set of taxa

    Towards characterizing the solution space of the 1-Dollo Phylogeny problem

    Get PDF
    Cancer cells may mutate multiple times, from a normal state to a mutated state and vice versa. Given our sequenced data, we can model the mutation process with a phylogenetic tree. One representative model is the k-Dollo parsimony, where all observed mutations mutate from a single normal cell and each character of a cell is gained at most once and lost at most k times. We examine the 1-Dollo Phylogeny problem, does a 1-Dollo phylogeny, a tree that follows the 1-Dollo parsimony model, exist for the observations. Current algorithms to solve the 1-Dollo Phylogeny problem only tell us whether or not a set of observations has a 1-Dollo phylogeny by outputting a single solution. We explore the structure of 1-Dollo phylogenies and use our idea of a skeleton to develop an algorithm that enumerates all 1-Dollo phylogenies for any set of observations. This algorithm runs much faster than the naive brute force enumeration algorithm for random input. The implementation is here: https://github.com/sxie12/skeleton_solver

    Reconstructing cancer genomes from paired-end sequencing data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A cancer genome is derived from the germline genome through a series of somatic mutations. Somatic structural variants - including duplications, deletions, inversions, translocations, and other rearrangements - result in a cancer genome that is a scrambling of intervals, or "blocks" of the germline genome sequence. We present an efficient algorithm for reconstructing the block organization of a cancer genome from paired-end DNA sequencing data.</p> <p>Results</p> <p>By aligning paired reads from a cancer genome - and a matched germline genome, if available - to the human reference genome, we derive: (i) a partition of the reference genome into intervals; (ii) adjacencies between these intervals in the cancer genome; (iii) an estimated copy number for each interval. We formulate the Copy Number and Adjacency Genome Reconstruction Problem of determining the cancer genome as a sequence of the derived intervals that is consistent with the measured adjacencies and copy numbers. We design an efficient algorithm, called Paired-end Reconstruction of Genome Organization (PREGO), to solve this problem by reducing it to an optimization problem on an interval-adjacency graph constructed from the data. The solution to the optimization problem results in an Eulerian graph, containing an alternating Eulerian tour that corresponds to a cancer genome that is consistent with the sequencing data. We apply our algorithm to five ovarian cancer genomes that were sequenced as part of The Cancer Genome Atlas. We identify numerous rearrangements, or structural variants, in these genomes, analyze reciprocal vs. non-reciprocal rearrangements, and identify rearrangements consistent with known mechanisms of duplication such as tandem duplications and breakage/fusion/bridge (B/F/B) cycles.</p> <p>Conclusions</p> <p>We demonstrate that PREGO efficiently identifies complex and biologically relevant rearrangements in cancer genome sequencing data. An implementation of the PREGO algorithm is available at <url>http://compbio.cs.brown.edu/software/</url>.</p

    Statistical Methods For Genomic And Transcriptomic Sequencing

    Get PDF
    Part 1: High-throughput sequencing of DNA coding regions has become a common way of assaying genomic variation in the study of human diseases. Copy number variation (CNV) is an important type of genomic variation, but CNV profiling from whole-exome sequencing (WES) is challenging due to the high level of biases and artifacts. We propose CODEX, a normalization and CNV calling procedure for WES data. CODEX includes a Poisson latent factor model, which includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based segmentation procedure that explicitly models the count-based WES data. CODEX is compared to existing methods on germline CNV detection in HapMap samples using microarray-based gold standard and is further evaluated on 222 neuroblastoma samples with matched normal, with focus on somatic CNVs within the ATRX gene. Part 2: Cancer is a disease driven by evolutionary selection on somatic genetic and epigenetic alterations. We propose Canopy, a method for inferring the evolutionary phylogeny of a tumor using both somatic copy number alterations and single nucleotide alterations from one or more samples derived from a single patient. Canopy is applied to bulk sequencing datasets of both longitudinal and spatial experimental designs and to a transplantable metastasis model derived from human cancer cell line MDA-MB-231. Canopy successfully identifies cell populations and infers phylogenies that are in concordance with existing knowledge and ground truth. Through simulations, we explore the effects of key parameters on deconvolution accuracy, and compare against existing methods. Part 3: Allele-specific expression is traditionally studied by bulk RNA sequencing, which measures average expression across cells. Single-cell RNA sequencing (scRNA-seq) allows the comparison of expression distribution between the two alleles of a diploid organism and thus the characterization of allele-specific bursting. We propose SCALE to analyze genome-wide allele-specific bursting, with adjustment of technical variability. SCALE detects genes exhibiting allelic differences in bursting parameters, and genes whose alleles burst non-independently. We apply SCALE to mouse blastocyst and human fibroblast cells and find that, globally, cis control in gene expression overwhelmingly manifests as differences in burst frequency

    Bayesian mixture modelling of migration by founder analysis

    Get PDF
    In this thesis a new method is proposed to estimate major periods of migration from one region into another using phased, non-recombined sequence data from the present. The assumption is made that migration occurs in multiple waves and that during each migration period, a number of sequences, called `founder sequences', migrate into the new region. It is first shown through appropriate simulations based on the structured coalescent that previous inferences based on the idea of founder sequences sufer from the fundamental problem that it is assumed that migration events coincide with the nodes (coalescent events) of the reconstructed tree. It is shown that such an assumption leads to contradictions with the assumed underlying migration process, and that inferences based on such a method have the potential for bias in the date estimates obtained. An improved method is proposed which involves `connected star trees', a tree structure that allows the uncertainty in the time of the migration event to be modelled in a probabilistic manner. Useful theoretical results under this assumption are derived. To model the uncertainty of which founder sequence belongs to which migration period, a Bayesian mixture modelling approach is taken, inferences in which are made by Markov Chain Monte Carlo techniques. Using the developed model, a reanalysis of a dataset that pertains to the settlement of Europe is undertaken. It is shown that sensible inferences can be made under certain conditions using the new model. However, it is also shown that questions of major interest cannot be answered, and certain inferences cannot be made due to an inherent lack of information in any dataset composed of sequences from the present day. It is argued that many of the major questions of interest regarding the migration of modern day humans into Europe cannot be answered without strong prior assumptions being made by the investigator. It is further argued that the same reasons that prohibit certain inferences from being made under the proposed model would remain in any method which has similar assumptions

    Prickly puzzle: phylogeny and evolution of the Carduus-Cirsium group (Cardueae: Compositae), and untangling the taxonomy of Cirsium in North America, A

    Get PDF
    2020 Fall.Includes bibliographical references.Generic delimitations within the cosmopolitan Carduus-Cirsium group (i.e., "thistles") have a long history of taxonomic confusion and debate. We present the most comprehensive molecular phylogeny of the group to date to test generic limits, reconstruct the evolution of pappus type, and elucidate the role of chromosomal evolution. We offer two solutions for the recognition of monophyletic genera: (1) consolidate all taxa into one large genus (Carduus or Cirsium), or (2) recognize each major clade as a genus (Carduus, Cirsium, Eriolepis, Notobasis, Picnomon, Silybum, and Tyrimnus). Under the second proposal, the cryptic genus Eriolepis is segregated from Cirsium, and the African Carduus are included within Cirsium. The best diagnosable morphological character to delimit the genera is pollen type, which is not practical in field-based application. We caution that prior to implementing either solution, a thorough, comprehensive morphological analysis of all current members of Cirsium sect. Epitrachys (= genus Eriolepis) be completed. Future morphological studies may find additional achene or leaf surface characters that could be used for practical field identification of the segregate genera. The data show that the plumose pappus state is symplesiomorphic for the group, with one transition to barbellate pappus, likely followed by a reversal to its ancestral state as the group colonized Eurasia. The data are consistent with a North African origin in the region of the Mediterranean and a single colonization event to North America. An ancestral chromosome state of n = 17 is hypothesized for the group, and a descending dysploidy series in Carduus is hypothesized to correspond with the aridification of the Mediterranean region. The Carduus-Cirsium group highlights the difficulty of delimiting morphologically similar, cryptic genera. Cirsium is one of the most taxonomically challenging groups of Compositae in North America. This study represents the first attempt to infer a broadly sampled phylogeny of Cirsium in North America. The two main objectives are to: (1) test whether currently hypothesized species variety complexes (C. arizonicum, C. clavatum, C. eatonii, and C. scariosum) constitute monophyletic lineages, and (2) recircumscribe any taxa that are identified as problematic. Phylogeny reconstructions based on DNA sequence data from two nuclear ribosomal regions and four plastid markers were used to infer evolutionary lineages and test species' delimitations. Eight species varietal complexes were resolved as polyphyletic. We recircumscribed these complexes and in doing so found evidence to support the recognition of six new taxa. We hypothesize that the extensive taxonomic difficulty within Cirsium is the result of several factors: 1) previously undescribed taxa, 2) inadequate representation of taxa from herbarium specimens, 3) phenotypic convergence, 4) hybridization, and 5) incipient speciation. While we can provide evidence to support the recircumscription of some taxa, others remain unresolved
    • …
    corecore