35 research outputs found

    Fast and scalable inference of multi-sample cancer lineages.

    Get PDF
    Somatic variants can be used as lineage markers for the phylogenetic reconstruction of cancer evolution. Since somatic phylogenetics is complicated by sample heterogeneity, novel specialized tree-building methods are required for cancer phylogeny reconstruction. We present LICHeE (Lineage Inference for Cancer Heterogeneity and Evolution), a novel method that automates the phylogenetic inference of cancer progression from multiple somatic samples. LICHeE uses variant allele frequencies of somatic single nucleotide variants obtained by deep sequencing to reconstruct multi-sample cell lineage trees and infer the subclonal composition of the samples. LICHeE is open source and available at http://viq854.github.io/lichee

    Fast and scalable inference of multi-sample cancer lineages

    Get PDF

    Triplet-based similarity score for fully multilabeled trees with poly-occurring labels

    Get PDF
    Motivation: The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. Results: To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data

    Algorithms for phylogenetic tree correction in species and cancer evolution

    Get PDF
    Reconstructing evolutionary trees, also known as phylogenies, from molecular sequence data is a fundamental problem in computational biology. Classically, evolutionary trees have been estimated over a set of species, where leaves correspond to extant species and internal nodes correspond to ancestral species. This type of phylogeny is colloquially thought of as the “Tree of Life” and assembling it has been designated as a Grand Challenge by the National Science Foundation Advisory Committee for Cyberinfrastructure. However, processes other than speciation are also shaped by evolution. One notable example is in the development of a malignant tumor; tumor cells rapidly grow and divide, acquiring new mutations with each subsequent generation. Tumor cells then compete for resources, often resulting in selection for more aggressive cell types. Recent advancements in sequencing technology rapidly increased the amount of sequencing data taken from tumor biopsies. This development has allowed researchers to attempt reconstructing evolutionary histories for individual patient tumors, improving our understanding of cancer and laying the groundwork for precision therapy. Despite algorithmic improvements in the estimation of both species and tumor phylogenies from molecular sequence data, current approaches still suffer a number of limitations. Incomplete sampling and estimation error can lead to missing leaves and low-support branches in the estimated phylogenies. Moreover, commonly posed optimization problems are often under-determined given the limited amounts and low quality of input data, leading to large solution spaces of equally plausible phylogenies. In this dissertation, we explore current limitations in both species and tumor phylogeny estimation, connecting similarities and highlighting key differences. We then put forward four new methods that improve phylogeny estimation methods by incorporating auxiliary information: OCTAL, TRACTION, PhySigs, and RECAP. For each method, we present theoretical results (e.g., optimization problem complexity, algorithmic correctness, running time analysis) as well as empirical results on simulated and real datasets. Collectively, these methods show we can significantly improve the accuracy of leading phylogeny estimation methods by leveraging additional signal in distinct, but related datasets

    Lineage-Based Subclonal Reconstruction of Cancer Samples

    Get PDF
    Sundermann LK. Lineage-Based Subclonal Reconstruction of Cancer Samples. Bielefeld: UniversitĂ€t Bielefeld; 2019.Cancer is caused by the accumulation of mutations, leading to genetically heterogeneous cell populations. The characterization of a cancer sample in terms of a subclonal reconstruction is essential. The subclonal reconstruction informs about the co-occurrence of mutations per population, as well as the proportion of cells belonging to each population, and the ancestral relationships among populations. Typical mutations used to infer a subclonal reconstruction are simple somatic mutations (SSMs) and copy number aberrations (CNAs). Methods building subclonal reconstructions only with SSMs use the concept of lineages instead of populations. In contrast to a population, which comprises only cells with the same genotype, a lineage comprises all cells that are descendant from the same founder cell. In a lineage-based subclonal reconstruction, mutations are assigned to the lineage in which they arose. The lineage frequency indicates the proportion of cells in which mutations assigned to this lineage can be found. Methods building subclonal reconstructions with CNAs are population-based. In contrast to the lineage-based approach, mutations are assigned to all populations in which they occur, not just to the one in which they arose. In order to calculate the mutation frequencies, the ancestor-descendant relationships between all populations have to be inferred. Hence, multiple subclonal reconstructions are needed to model ambiguous population relationships. Two population-based subclonal reconstruction methods working with SSMs and CNAs are PhyloWGS and Canopy. In contrast to Canopy, PhyloWGS does not infer CNAs but needs them as input. In this thesis, we present the first lineage-based model that builds subclonal reconstructions from SSMs and CNAs of bulk-sequenced tumor samples. Modeling CNAs as relative copy numbers, so copy number changes, instead of absolute copy numbers allows us to assign them to lineages. Another special feature of our method is that we infer present or absent ancestor-descendant relationships between lineages only if they can be observed in the data, modeling them as ambiguous relationships otherwise. This enables us to combine multiple ambiguous subclonal reconstructions within a single subclonal reconstruction. As input, our method uses the variant allele frequencies of SSMs, as well as the average allele-specific major and minor copy numbers of genome segments where the genome is segmented in a way that consecutive regions with the same copy number profile belong to the same segment. Furthermore, the number of lineages needs to be given as input. We present a joint likelihood function for SSMs and CNAs and show a linear relaxation of our model as a mixed integer linear program that can be solved with state-of-the-art solvers. Given subclonal reconstructions of the same dataset inferred with different lineage numbers, we use the minimum description length principle to choose the subclonal reconstruction with the best lineage number. An extensive analysis of the chosen subclonal reconstruction allows us to classify the ancestor-descendant relationships between each pair of lineages as either present, absent or ambiguous. We implemented our method in a software called Onctopus. We evaluate Onctopus extensively on simulated data, analyzing its run time and memory usage as well as its performance when the mathematically optimal solution cannot be proved in the given time and space. We present different approaches to improve Onctopus’ performance, such as by clustering mutations, fixing CNAs or fixing lineage frequencies. Finally, we compare the performance of Onctopus against the performance of PhyloWGS and Canopy on simulated datasets and a deep sequenced breast cancer dataset. On the simulated datasets, we evaluate different aspects of the inferred subclonal reconstructions and show that Onctopus is superior in inferring the number of lineages and the lineage relationships. For the breast cancer dataset, we follow an analysis by Deshwar et al., comparing the inferred mutation assignment to a gold standard assignment. Here, Onctopus and PhyloWGS reach a comparable performance

    Exploring the Intersection of Multi-Omics and Machine Learning in Cancer Research

    Get PDF
    Cancer biology and machine learning represent two seemingly disparate yet intrinsically linked fields of study. Cancer biology, with its complexities at the cellular and molecular levels, brings up a myriad of challenges. Of particular concern are the deviations in cell behaviour and rearrangements of genetic material that fuel transformation, growth, and spread of cancerous cells. Contemporary studies of cancer biology often utilise wide arrays of genomic data to pinpoint and exploit these abnormalities with an end-goal of translating them into functional therapies. Machine learning allows machines to make predictions based on the learnt data without explicit programming. It leverages patterns and inferences from large datasets, making it an invaluable tool in the modern era of large scale genomics. To this end, this doctoral thesis is underpinned by three themes: the application of machine learning, multi-omics, and cancer biology. It focuses on employment of machine learning algorithms to the tasks of cell annotation in single-cell RNA-seq datasets and drug response prediction in pre-clinical cancer models. In the first study, the author and colleagues developed a pipeline named Ikarus to differentiate between neoplastic and healthy cells within single-cell datasets, a task crucial for understanding the cellular landscape of tumours. Ikarus is designed to construct cancer cell-specific gene signatures from expert-annotated scRNA-seq datasets, score these genes, and distribute the scores to neighbouring cells via network propagation. This method successfully circumvents two common challenges in single-cell annotation: batch effects and unstable clustering. Furthermore, Ikarus utilises a multi-omic approach by incorporating CNVs inferred from scRNA-seq to enhance classification accuracy. The second study investigated how multi-omic analysis could enhance drug response prediction in pre-clinical cancer models. The research suggests that the typical practice of panel sequencing — a deep profiling of select, validated genomic features — is limited in its predictive power. However, incorporating transcriptomic features into the model significantly improves predictive ability across a variety of cancer models and is especially effective for drugs with collateral effects. This implies that the combined use of genomic and transcriptomic data has potential advantages in the pharmacogenomic arena. This dissertation recapitulates the findings of two aforementioned studies, which were published in Genome Biology and Cancers journals respectively. The two studies illustrate the application of machine learning techniques and multi-omic approaches to address conceptually distinct problems within the realm of cancer biology.Die Krebsbiologie und das maschinelle Lernen sind zwei scheinbar kontrĂ€re, aber intrinsisch verbundene Forschungsbereiche. Insbesondere die Krebsbiologie ist auf zellul ̈arer und molekularer Ebene hoch komplex und stellt den Forschenden vor eine Vielzahl von Herausforderungen. Zu verstehen wie abweichendes Zellverhalten und die Umstrukturierung genetischer Komponente die Transformation, das Wachstum und die Ausbreitung von Krebszellen antreiben, ist hierbei eine besondere Herausforderung. Gleichzeitig bestrebt die Krebsbiologie diese AbnormalitĂ€ten zu nutzen zu machen, Wissen aus ihnen zu gewinnen und sie so in funktionale Therapien umzusetzen. Maschinelles Lernen ermöglicht es Vorhersagen auf der Grundlage von gelernten Daten ohne explizite Programmierung zu treffen. Es erkennt Muster in großen DatensĂ€tzen, erschließt sich so Erkenntnisse und ist deswegen ein unschĂ€tzbar wertvolles Werkzeug im modernen Zeitalter der Hochdurchsatz Genomforschung. Aus diesem Grund ist maschinelles Lernen eines der drei Haupthemen dieser Doktorarbeit, neben Multi-Omics und Krebsbiologie. Der Fokus liegt hierbei insbesondere auf dem Einsatz von maschinellen Lernalgorithmen zum Zweck der Zellannotation in Einzelzell RNA-SequenzdatensĂ€tzen und der Vorhersage der Arzneimittelwirkung in prĂ€klinischen Krebsmodellen. In der ersten, hier prĂ€sentierten Studie, entwickelten der Autor und seine Kollegen eine Pipeline namens Ikarus. Diese kann zwischen neoplastischen und gesunden Zellen in Einzelzell-DatensĂ€tzen unterscheiden. Eine Aufgabe, die fĂŒr das Verst ̈andnis der zellulĂ€ren Landschaft von Tumoren entscheidend ist. Ikarus ist darauf ausgelegt, krebszellenspezifische Gensignaturen aus expertenanotierten scRNA-seq-DatensĂ€tzen zu konstruieren, diese Gene zu bewerten und die Bewertungen ĂŒber Netzwerkverbreitung auf benachbarte Zellen zu verteilen. Diese Methode umgeht erfolgreich zwei hĂ€ufige Herausforderungen bei der Einzelzellannotation: den Chargeneffekt und die instabile Clusterbildung. DarĂŒber hinaus verwendet Ikarus, durch das Einbeziehen von scRNA-seq abgeleiteten CNVs, einen Multi-Omic-Ansatz der die Klassifikationsgenauigkeit verbessert. Die zweite Studie untersuchte, wie Multi-Omic-Analysen die Vorhersage der Arzneimittelwirkung in prĂ€klinischen Krebsmodellen optimieren können. Die Forschung legt nahe, dass die ĂŒbliche Praxis des Panel Sequenzierens - die umfassende Profilierung ausgewĂ€hlter, validierter genomischer Merkmale - in ihrer Vorhersagekraft begrenzt ist. Durch das Einbeziehen transkriptomischer Merkmale in das Modell konnte jedoch die VorhersagefĂ€higkeit bei verschiedenen Krebsmodellen signifikant verbessert werden, ins besondere fĂŒr Arzneimittel mit Nebenwirkungen. Diese Dissertation fasst die Ergebnisse der beiden oben genannten Studien zusammen, die jeweils in Genome Biology und Cancers Journalen veröffentlicht wurden. Die beiden Studien veranschaulichen die Anwendung von maschinellem Lernen und Multi-Omic-AnsĂ€tzen zur Lösung konzeptionell unterschiedlicher Probleme im Bereich der Krebsbiologie
    corecore