35 research outputs found
Fast and scalable inference of multi-sample cancer lineages.
Somatic variants can be used as lineage markers for the phylogenetic reconstruction of cancer evolution. Since somatic phylogenetics is complicated by sample heterogeneity, novel specialized tree-building methods are required for cancer phylogeny reconstruction. We present LICHeE (Lineage Inference for Cancer Heterogeneity and Evolution), a novel method that automates the phylogenetic inference of cancer progression from multiple somatic samples. LICHeE uses variant allele frequencies of somatic single nucleotide variants obtained by deep sequencing to reconstruct multi-sample cell lineage trees and infer the subclonal composition of the samples. LICHeE is open source and available at http://viq854.github.io/lichee
Triplet-based similarity score for fully multilabeled trees with poly-occurring labels
Motivation: The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the
evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has
emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations
occurring multiple times in the tree, a circumstance often occurring in real cases.
Results: To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can
occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly
similar and dissimilar trees, both on simulated and on real data
Recommended from our members
Bayesian Inference for Genomic Data Analysis
High-throughput genomic data contain gazillion of information that are influenced by the complex biological processes in the cell. As such, appropriate mathematical modeling frameworks are required to understand the data and the data generating processes. This dissertation focuses on the formulation of mathematical models and the description of appropriate computational algorithms to obtain insights from genomic data.
Specifically, characterization of intra-tumor heterogeneity is studied. Based on the total number of allele copies at the genomic locations in the tumor subclones, the problem is viewed from two perspectives: the presence or absence of copy-neutrality assumption. With the presence of copy-neutrality, it is assumed that the genome contains mutational variability and the three possible genotypes may be present at each genomic location. As such, the genotypes of all the genomic locations in the tumor subclones are modeled by a ternary matrix. In the second case, in addition to mutational variability, it is assumed that the genomic locations may be affected by structural variabilities such as copy number variation (CNV). Thus, the genotypes are modeled with a pair of (Q + 1)-ary matrices. Using the categorical Indian buffet process (cIBP), state-space modeling framework is employed in describing the two processes and the sequential Monte Carlo (SMC) methods for dynamic models are applied to perform inference on important model parameters.
Moreover, the problem of estimating gene regulatory network (GRN) from measurement with missing values is presented. Specifically, gene expression time series data may contain missing values for entire expression values of a single point or some set of consecutive time points. However, complete data is often needed to make inference on the underlying GRN. Using the missing measurement, a dynamic stochastic model is used to describe the evolution of gene expression and point-based Gaussian approximation (PBGA) filters with one-step or two-step missing measurements are applied for the inference. Finally, the problem of deconvolving gene expression data from complex heterogeneous biological samples is examined, where the observed data are a mixture of different cell types. A statistical description of the problem is used and the SMC method for static models is applied to estimate the cell-type specific expressions and the cell type proportions in the heterogeneous samples
Algorithms for phylogenetic tree correction in species and cancer evolution
Reconstructing evolutionary trees, also known as phylogenies, from molecular sequence data is a fundamental problem in computational biology. Classically, evolutionary trees have been estimated over a set of species, where leaves correspond to extant species and internal nodes correspond to ancestral species. This type of phylogeny is colloquially thought of as the âTree of Lifeâ and assembling it has been designated as a Grand Challenge by the National Science Foundation Advisory Committee for Cyberinfrastructure. However, processes other than speciation are also shaped by evolution. One notable example is in the development of a malignant tumor; tumor cells rapidly grow and divide, acquiring new mutations with each subsequent generation. Tumor cells then compete for resources, often resulting in selection for more aggressive cell types. Recent advancements in sequencing technology rapidly increased the amount of sequencing data taken from tumor biopsies. This development has allowed researchers to attempt reconstructing evolutionary histories for individual patient tumors, improving our understanding of cancer and laying the groundwork for precision therapy.
Despite algorithmic improvements in the estimation of both species and tumor phylogenies from molecular sequence data, current approaches still suffer a number of limitations. Incomplete sampling and estimation error can lead to missing leaves and low-support branches in the estimated phylogenies. Moreover, commonly posed optimization problems are often under-determined given the limited amounts and low quality of input data, leading to large solution spaces of equally plausible phylogenies. In this dissertation, we explore current limitations in both species and tumor phylogeny estimation, connecting similarities and highlighting key differences. We then put forward four new methods that improve phylogeny estimation methods by incorporating auxiliary information: OCTAL, TRACTION, PhySigs, and RECAP. For each method, we present theoretical results (e.g., optimization problem complexity, algorithmic correctness, running time analysis) as well as empirical results on simulated and real datasets. Collectively, these methods show we can significantly improve the accuracy of leading phylogeny estimation methods by leveraging additional signal in distinct, but related datasets
Recommended from our members
Network and Algebraic Topology of Influenza Evolution
Evolution is a force that has molded human existence since its divergence from chimpanzees about 5.4 million years ago. In that same amount of time, an influenza virus, which replicates every six hours, would have undergone an equivalent number of generations over only a hundred years. The fast replication times of influenza, coupled with its high mutation rate, make the virus a perfect model to study real-time evolution at a mega-Darwin scale, more than a million times faster than human evolution. While recent developments in high-throughput sequencing provide an optimal opportunity to dissect their genetic evolution, a concurrent growth in computational tools is necessary to analyze the large influx of complex genomic data. In my thesis, I present novel computational methods to examine different aspects of influenza evolution.
I first focus on seasonal influenza, particularly the problems that hamper public health initiatives to combat the virus. I introduce two new approaches: 1. The q2-coefficient, a method of quantifying pathogen surveillance, and 2. FluGraph, a technique that employs network topology to track the spread of seasonal influenza around the world.
The second chapter of my thesis examines how mutations and reassortment combine to alter the course of influenza evolution towards pandemic formation. I highlight inherent deficiencies in the current phylogenetic paradigm for analyzing evolution and offer a novel methodology based on algebraic topology that comprehensively reconstructs both vertical and horizontal evolutionary events. I apply this method to viruses, with emphasis on influenza, but foresee broader application to cancer cells, bacteria, eukaryotes, and other taxa
Recommended from our members
Inferring tumour evolution from single-cell and multi-sample data
Tumour development has long been recognised as an evolutionary process during which cells accumulate mutations and evolve into a mix of genetically distinct cell subpopulations. The resulting genetic intra-tumour heterogeneity poses a major challenge to cancer therapy, as it increases the chance of drug resistance. To study tumour evolution in more detail, reliable approaches to infer the life histories of tumours are needed. This dissertation focuses on computational methods for inferring trees of tumour evolution from single-cell and multi-sample sequencing data.
Recent advances in single-cell sequencing technologies have promised to reveal tumour heterogeneity at a much higher resolution, but single-cell sequencing data is inherently noisy, making it unsuitable for analysis with classic phylogenetic methods. The first part of the dissertation describes OncoNEM, a novel probabilistic method to infer clonal lineage trees from noisy single nucleotide variants of single cells. Simulation studies are used to validate the method and to compare its performance to that of other methods. Finally, OncoNEM is applied in two case studies.
In the second part of the dissertation, a comprehensive collection of existing multi-sample approaches is used to infer the phylogenies of metastatic breast cancers from ten patients. In particular, shallow whole-genome, whole exome and targeted deep sequencing data are analysed. The inference methods comprise copy number and point mutation based approaches, as well as a method that utilises a combination of the two. To improve the copy number based inference, a novel allele-specific multi-sample segmentation algorithm is presented. The results are compared across methods and data types to assess the reliability of the different methods.
In summary, this thesis presents substantial methodological advances to understand tumour evolution from genomic profiles of single cells or related bulk samples
Lineage-Based Subclonal Reconstruction of Cancer Samples
Sundermann LK. Lineage-Based Subclonal Reconstruction of Cancer Samples. Bielefeld: UniversitÀt Bielefeld; 2019.Cancer is caused by the accumulation of mutations, leading to genetically heterogeneous cell populations. The characterization of a cancer sample in terms of a subclonal reconstruction is essential. The subclonal reconstruction informs about the co-occurrence of mutations per population, as well as the proportion of cells belonging to each population, and the ancestral relationships among populations. Typical mutations used to infer a subclonal reconstruction are simple somatic mutations (SSMs) and copy number aberrations (CNAs).
Methods building subclonal reconstructions only with SSMs use the concept of lineages instead of populations. In contrast to a population, which comprises only cells with the same genotype, a lineage comprises all cells that are descendant from the same founder cell. In a lineage-based subclonal reconstruction, mutations are assigned to the lineage in which they arose. The lineage frequency indicates the proportion of cells in which mutations assigned to this lineage can be found.
Methods building subclonal reconstructions with CNAs are population-based. In contrast to the lineage-based approach, mutations are assigned to all populations in which they occur, not just to the one in which they arose. In order to calculate the mutation frequencies, the ancestor-descendant relationships between all populations have to be inferred. Hence, multiple subclonal reconstructions are needed to model ambiguous population relationships.
Two population-based subclonal reconstruction methods working with SSMs and CNAs are PhyloWGS and Canopy. In contrast to Canopy, PhyloWGS does not infer CNAs but needs them as input.
In this thesis, we present the first lineage-based model that builds subclonal reconstructions from SSMs and CNAs of bulk-sequenced tumor samples. Modeling CNAs as relative copy numbers, so copy number changes, instead of absolute copy numbers allows us to assign them to lineages. Another special feature of our method is that we infer present or absent ancestor-descendant relationships between lineages only if they can be observed in the data, modeling them as ambiguous relationships otherwise. This enables us to combine multiple ambiguous subclonal reconstructions within a single subclonal reconstruction.
As input, our method uses the variant allele frequencies of SSMs, as well as the average allele-specific major and minor copy numbers of genome segments where the genome is segmented in a way that consecutive regions with the same copy number profile belong to the same segment. Furthermore, the number of lineages needs to be given as input. We present a joint likelihood function for SSMs and CNAs and show a linear relaxation of our model as a mixed integer linear program that can be solved with state-of-the-art solvers. Given subclonal reconstructions of the same dataset inferred with different lineage numbers, we use the minimum description length principle to choose the subclonal reconstruction with the best lineage number. An extensive analysis of the chosen subclonal reconstruction allows us to classify the ancestor-descendant relationships between each pair of lineages as either present, absent or ambiguous.
We implemented our method in a software called Onctopus. We evaluate Onctopus extensively on simulated data, analyzing its run time and memory usage as well as its performance when the mathematically optimal solution cannot be proved in the given time and space. We present different approaches to improve Onctopusâ performance, such as by clustering mutations, fixing CNAs or fixing lineage frequencies.
Finally, we compare the performance of Onctopus against the performance of PhyloWGS and Canopy on simulated datasets and a deep sequenced breast cancer dataset. On the simulated datasets, we evaluate different aspects of the inferred subclonal reconstructions and show that Onctopus is superior in inferring the number of lineages and the lineage relationships. For the breast cancer dataset, we follow an analysis by Deshwar et al., comparing the inferred mutation assignment to a gold standard assignment. Here, Onctopus and PhyloWGS reach a comparable performance
Exploring the Intersection of Multi-Omics and Machine Learning in Cancer Research
Cancer biology and machine learning represent two seemingly disparate yet intrinsically linked fields of study. Cancer biology, with its complexities at the cellular and molecular levels, brings up a myriad of challenges. Of particular concern are the deviations in cell behaviour and rearrangements of genetic material that fuel transformation, growth, and spread of cancerous cells. Contemporary studies of cancer biology often utilise wide arrays of genomic data to pinpoint and exploit these abnormalities with an end-goal of translating them into functional therapies.
Machine learning allows machines to make predictions based on the learnt data without explicit programming. It leverages patterns and inferences from large datasets, making it an invaluable tool in the modern era of large scale genomics. To this end, this doctoral thesis is underpinned by three themes: the application of machine learning, multi-omics, and cancer biology. It focuses on employment of machine learning algorithms to the tasks of cell annotation in single-cell RNA-seq datasets and drug response prediction in pre-clinical cancer models.
In the first study, the author and colleagues developed a pipeline named Ikarus to differentiate between neoplastic and healthy cells within single-cell datasets, a task crucial for understanding the cellular landscape of tumours. Ikarus is designed to construct cancer cell-specific gene signatures from expert-annotated scRNA-seq datasets, score these genes, and distribute the scores to neighbouring cells via network propagation. This method successfully circumvents two common challenges in single-cell annotation: batch effects and unstable clustering. Furthermore, Ikarus utilises a multi-omic approach by incorporating CNVs inferred from scRNA-seq to enhance classification accuracy.
The second study investigated how multi-omic analysis could enhance drug response prediction in pre-clinical cancer models. The research suggests that the typical practice of panel sequencing â a deep profiling of select, validated genomic features â is limited in its predictive power. However, incorporating transcriptomic features into the model significantly improves predictive ability across a variety of cancer models and is especially effective for drugs with collateral effects. This implies that the combined use of genomic and transcriptomic data has potential advantages in the pharmacogenomic arena.
This dissertation recapitulates the findings of two aforementioned studies, which were published in Genome Biology and Cancers journals respectively. The two studies illustrate the application of machine learning techniques and multi-omic approaches to address conceptually distinct problems within the realm of cancer biology.Die Krebsbiologie und das maschinelle Lernen sind zwei scheinbar kontrĂ€re, aber intrinsisch verbundene Forschungsbereiche. Insbesondere die Krebsbiologie ist auf zellul Ìarer und molekularer Ebene hoch komplex und stellt den Forschenden vor eine Vielzahl von Herausforderungen. Zu verstehen wie abweichendes Zellverhalten und die Umstrukturierung genetischer Komponente die Transformation, das Wachstum und die Ausbreitung von Krebszellen antreiben, ist hierbei eine besondere Herausforderung. Gleichzeitig bestrebt die Krebsbiologie diese AbnormalitĂ€ten zu nutzen zu machen, Wissen aus ihnen zu gewinnen und sie so in funktionale Therapien umzusetzen.
Maschinelles Lernen ermöglicht es Vorhersagen auf der Grundlage von gelernten Daten ohne explizite Programmierung zu treffen. Es erkennt Muster in groĂen DatensĂ€tzen, erschlieĂt sich so Erkenntnisse und ist deswegen ein unschĂ€tzbar wertvolles Werkzeug im modernen Zeitalter der Hochdurchsatz Genomforschung. Aus diesem Grund ist maschinelles Lernen eines der drei Haupthemen dieser Doktorarbeit, neben Multi-Omics und Krebsbiologie. Der Fokus liegt hierbei insbesondere auf dem Einsatz von maschinellen Lernalgorithmen zum Zweck der Zellannotation in Einzelzell RNA-SequenzdatensĂ€tzen und der Vorhersage der Arzneimittelwirkung in prĂ€klinischen Krebsmodellen.
In der ersten, hier prĂ€sentierten Studie, entwickelten der Autor und seine Kollegen eine Pipeline namens Ikarus. Diese kann zwischen neoplastischen und gesunden Zellen in Einzelzell-DatensĂ€tzen unterscheiden. Eine Aufgabe, die fĂŒr das Verst Ìandnis der zellulĂ€ren Landschaft von Tumoren entscheidend ist. Ikarus ist darauf ausgelegt, krebszellenspezifische Gensignaturen aus expertenanotierten scRNA-seq-DatensĂ€tzen zu konstruieren, diese Gene zu bewerten und die Bewertungen ĂŒber Netzwerkverbreitung auf benachbarte Zellen zu verteilen. Diese Methode umgeht erfolgreich zwei hĂ€ufige Herausforderungen bei der Einzelzellannotation: den Chargeneffekt und die instabile Clusterbildung. DarĂŒber hinaus verwendet Ikarus, durch das Einbeziehen von scRNA-seq abgeleiteten CNVs, einen Multi-Omic-Ansatz der die Klassifikationsgenauigkeit verbessert.
Die zweite Studie untersuchte, wie Multi-Omic-Analysen die Vorhersage der Arzneimittelwirkung in prĂ€klinischen Krebsmodellen optimieren können. Die Forschung legt nahe, dass die ĂŒbliche Praxis des Panel Sequenzierens - die umfassende Profilierung ausgewĂ€hlter, validierter genomischer Merkmale - in ihrer Vorhersagekraft begrenzt ist. Durch das Einbeziehen transkriptomischer Merkmale in das Modell konnte jedoch die VorhersagefĂ€higkeit bei verschiedenen Krebsmodellen signifikant verbessert werden, ins besondere fĂŒr Arzneimittel mit Nebenwirkungen.
Diese Dissertation fasst die Ergebnisse der beiden oben genannten Studien zusammen, die jeweils in Genome Biology und Cancers Journalen veröffentlicht wurden. Die beiden Studien veranschaulichen die Anwendung von maschinellem Lernen und Multi-Omic-AnsÀtzen zur Lösung konzeptionell unterschiedlicher Probleme im Bereich der Krebsbiologie