33 research outputs found

    An Integer Programming Formulation of the Parsimonious Loss of Heterozygosity Problem

    Full text link

    Parsimonious Clone Tree Reconciliation in Cancer

    Get PDF
    Every tumor is composed of heterogeneous clones, each corresponding to a distinct subpopulation of cells that accumulated different types of somatic mutations, ranging from single-nucleotide variants (SNVs) to copy-number aberrations (CNAs). As the analysis of this intra-tumor heterogeneity has important clinical applications, several computational methods have been introduced to identify clones from DNA sequencing data. However, due to technological and methodological limitations, current analyses are restricted to identifying tumor clones only based on either SNVs or CNAs, preventing a comprehensive characterization of a tumor's clonal composition. To overcome these challenges, we formulate the identification of clones in terms of both SNVs and CNAs as a reconciliation problem while accounting for uncertainty in the input SNV and CNA proportions. We thus characterize the computational complexity of this problem and we introduce a mixed integer linear programming formulation to solve it exactly. On simulated data, we show that tumor clones can be identified reliably, especially when further taking into account the ancestral relationships that can be inferred from the input SNVs and CNAs. On 49 tumor samples from 10 prostate cancer patients, our reconciliation approach provides a higher resolution view of tumor evolution than previous studies

    Parsimonious Clone Tree Integration in cancer

    Get PDF
    BACKGROUND: Every tumor is composed of heterogeneous clones, each corresponding to a distinct subpopulation of cells that accumulated different types of somatic mutations, ranging from single-nucleotide variants (SNVs) to copy-number aberrations (CNAs). As the analysis of this intra-tumor heterogeneity has important clinical applications, several computational methods have been introduced to identify clones from DNA sequencing data. However, due to technological and methodological limitations, current analyses are restricted to identifying tumor clones only based on either SNVs or CNAs, preventing a comprehensive characterization of a tumor's clonal composition. RESULTS: To overcome these challenges, we formulate the identification of clones in terms of both SNVs and CNAs as a integration problem while accounting for uncertainty in the input SNV and CNA proportions. We thus characterize the computational complexity of this problem and we introduce PACTION (PArsimonious Clone Tree integratION), an algorithm that solves the problem using a mixed integer linear programming formulation. On simulated data, we show that tumor clones can be identified reliably, especially when further taking into account the ancestral relationships that can be inferred from the input SNVs and CNAs. On 49 tumor samples from 10 prostate cancer patients, our integration approach provides a higher resolution view of tumor evolution than previous studies. CONCLUSION: PACTION is an accurate and fast method that reconstructs clonal architecture of cancer tumors by integrating SNV and CNA clones inferred using existing methods

    Computational haplotyping : theory and practice

    Get PDF
    Genomics has paved a new way to comprehend life and its evolution, and also to investigate causes of diseases and their treatment. One of the important problems in genomic analyses is haplotype assembly. Constructing complete and accurate haplotypes plays an essential role in understanding population genetics and how species evolve. In this thesis, we focus on computational approaches to haplotype assembly from third generation sequencing technologies. This involves huge amounts of sequencing data, and such data contain errors due to the single molecule sequencing protocols employed. Taking advantage of combinatorial formulations helps to correct for these errors to solve the haplotyping problem. Various computational techniques such as dynamic programming, parameterized algorithms, and graph algorithms are used to solve this problem. This thesis presents several contributions concerning the area of haplotyping. First, a novel algorithm based on dynamic programming is proposed to provide approximation guarantees for phasing a single individual. Second, an integrative approach is introduced to combining multiple sequencing datasets to generating complete and accurate haplotypes. The effectiveness of this integrative approach is demonstrated on a real human genome. Third, we provide a novel efficient approach to phasing pedigrees and demonstrate its advantages in comparison to phasing a single individual. Fourth, we present a generalized graph-based framework for performing haplotype-aware de novo assembly. Specifically, this generalized framework consists of a hybrid pipeline for generating accurate and complete haplotypes from data stemming from multiple sequencing technologies, one that provides accurate reads and other that provides long reads.Die Genomik hat neue Wege eröffnet, die es ermöglichen, die Evolution lebendiger Organismen zu verstehen, sowie die Ursachen zahlreicher Krankheiten zu erforschen und neue Therapien zu entwickeln. Ein wichtiges Problem ist die Assemblierung der Haplotypen eines Individuums. Diese Rekonstruktion von Haplotypen spielt eine zentrale Rolle für das Verständnis der Populationsgenetik und der Evolution einer Spezies. In der vorliegenden Arbeit werden Algorithmen zur Assemblierung von Haplotypen vorgestellt, die auf Sequenzierdaten der dritten Generation basieren. Dies erfordert große Mengen an Daten, welche wiederum Fehler enthalten, die die zugrunde liegenden Sequenzierprotokolle hervorbringen. Durch kombinatorische Formulierungen des Problems ist die Rekonstruktion von Haplotypen dennoch möglich, da Fehler erfolgreich korrigiert werden können. Verschiedene informatische Methoden, wie dynamische Programmierung, parametrisierte Algorithmen und Graph Algorithmen können verwendet werden, um dieses Problem zu lösen. Die vorliegende Arbeit stellt mehrere Lösungsansätze für die Rekonstruktion von Haplotypen vor. Als erstes wird ein neuartiger Algorithmus vorgestellt, der basierend auf dem Prinzip der dynamischen Programmierung Approximationsgarantien für das Haplotyping eines einzelnen Individuums liefert. Als zweites wird ein integrativer Ansatz präsentiert, um mehrere Sequenzierdatensätze zu kombinieren und somit akkurate Haplotypen zu generieren. Die Effektivität dieser Methode wird auf einem echten, menschlichen Datensatz demonstriert. Als drittes wird ein neuer, effzienter Algorithmus beschrieben, um Haplotypen verwandter Individuen simultan zu konstruieren und die Vorteile gegenüber der Betrachtung einzelner Individuen aufgezeigt. Als viertes präsentieren wir eine Graph-basierte Methode um mittels Haplotypinformation de-novo Assemblierung durchzuführen. Dieser Methode kombiniert Daten stammend von verschiedenen Sequenziertechnologien, welche entweder genaue oder aber lange Sequenzierreads liefern

    Investigating the Epidemiology of bovine Tuberculosis in the European Badger

    Get PDF
    Global health is becoming increasingly reliant on our understanding and management of wildlife disease. An estimated 60% of emerging infectious diseases in humans are zoonotic and with human-wildlife interactions set to increase as populations rise and we expand further into wild habitats there is pressure to seek modelling frameworks that enable a deeper understanding of natural systems. Survival and mortality are fundamental parameters of interest when investigating the impact of disease with far reaching implications for species conservation, management and control. Survival analysis has traditionally been dominated by non- and semi-parametric methods but these can sometimes miss subtle yet important dynamics. Survival and mortality trajectory analysis can alleviate some of these problems by fitting fully parametric functions that describe lifespan patterns of mortality and survival. In the first part of this thesis we investigate the use of survival and mortality trajectories in epidemiology and uncover novel patterns of age-, sex- and infection-specific mortality in a wild population of European badgers (Meles meles) naturally infected with Mycobacterium bovis, the causative agent of bovine tuberculosis (bTB). Limitations of dedicated software packages to conduct such analyses led us to investigate alternative methods to build models from first principles and we found the NIMBLE package to offer an attractive blend of flexibility and speed. We create a novel parameterisation of the Siler model to enable more flexible model specification but encounter the common problem of competing models having comparable fits to the data. Multi-model inference approaches can alleviate some of these issues but require efficient methods to carry out model comparisons; we present an approach based on the estimation of the marginal likelihood through importance sampling and demonstrate its application through a series of simulation- and case-studies. The approach works well for both census and capture-mark-recapture (CMR) data, both of which are common within ecological research, but we uncover challenges in recording and modelling early life mortality dynamics that occur as a result of the CMR sampling process. The final part of the thesis looks at another alternative approach for model comparison that doesn’t require direct estimation of the marginal likelihood, Reversible Jump Markov Chain Monte Carlo (RJMCMC), which is particularly efficient when models to be compared are nested and the problem can reduce to one of variable selection. In the final chapter we carry out an investigation of age-, sex-, infection- and inbreeding-specific variation in survival and mortality in a wild population of European badgers naturally infected with bovine Tuberculosis. Using the methods and knowledge presented through the earlier chapters of this thesis we uncover patterns of mortality consistent with both the mutation accumulation and antagonistic pleiotropy theories of senescence but most interestingly uncover antagonistic pleiotropic effects of inbreeding on age-specific mortality in a wild population for the first time. This thesis provides a number of straightforward approaches to Bayesian survival analysis that are widely applicable to ecological research and can offer greater insight and uncover subtle patterns of survival and mortality that traditional methods can overlook. Our investigation into the epidemiology of bovine Tuberculosis and in particular the effects of inbreeding have far-reaching implications for the control of this disease. This research can also inform future conservation efforts and management strategies as all species are likely to be at increasing risk of inbreeding in an age of dramatic global change, rapid habitat loss and isolation

    Development and Application of Novel Methods to Study Tumor Heterogeneity and Cancer Genome Evolution.

    Full text link
    Cancer is one of the leading causes of death worldwide. In recent years, with the aid of high-throughput genomic technologies, large cohorts of tumor samples have been analyzed to characterize molecular aberrations in many cancer types. These studies have generated enormous amount of cancer genomics data, providing not only new opportunities to understand tumor evolution and cancer progression mechanisms but also new challenges in efficiently and rigorously analyzing the data. Heterogeneity is an important feature of cancer and has significant impact on the diagnosis and treatment of the disease. My dissertation focuses on developing new bioinformatics and biostatistical approaches to study the heterogeneity and evolutionary history of cancer genomes. Under this theme, my thesis consists of four main chapters. First, I have developed an algorithm to infer aneuploid and euploid cell mixing ratios using allele-specific DNA copy number alteration (CNA) data, and made a striking discovery that gene expression patterns in brain and ovarian tumors are strongly influenced by aneuploid content. The ability to infer mixing ratios allowed me to revise the current classification system for glioblastoma, with better predictive power of clinical outcome than previous results. Second, I developed a Clonal Heterogeneity Analysis Tool (CHAT) that estimates cellular fractions for individual CNAs and individual somatic mutations, allowing us to use the distribution of these fractions to inform the macroscopic clonal architecture and the relative order of occurrence of somatic changes. For example, a CNA with a higher frequency in the cell population may have occurred earlier in tumor development or conferred a greater growth rate, therefore is more likely to contain driver genes. Third, I developed a method to detect short tandem repeat (STR) variation using paired-end short-read next-generation DNA sequencing data. Unlike previous methods which are limited to finding short STR alleles, my method is capable of finding both STR alleles shorter than a read and those longer than the read or the read pair. This capability addresses the need to reliably detect expanded STR alleles in germline DNA that underlie many rare inherited diseases as well as somatic aberrations characterized by microsatellite instability.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/110386/1/libo_1.pd

    Seventh Biennial Report : June 2003 - March 2005

    No full text
    corecore