42 research outputs found

    Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields

    Get PDF
    Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome

    Determining Frequent Patterns of Copy Number Alterations in Cancer

    Get PDF
    Cancer progression is often driven by an accumulation of genetic changes but also accompanied by increasing genomic instability. These processes lead to a complicated landscape of copy number alterations (CNAs) within individual tumors and great diversity across tumor samples. High resolution array-based comparative genomic hybridization (aCGH) is being used to profile CNAs of ever larger tumor collections, and better computational methods for processing these data sets and identifying potential driver CNAs are needed. Typical studies of aCGH data sets take a pipeline approach, starting with segmentation of profiles, calls of gains and losses, and finally determination of frequent CNAs across samples. A drawback of pipelines is that choices at each step may produce different results, and biases are propagated forward. We present a mathematically robust new method that exploits probe-level correlations in aCGH data to discover subsets of samples that display common CNAs. Our algorithm is related to recent work on maximum-margin clustering. It does not require pre-segmentation of the data and also provides grouping of recurrent CNAs into clusters. We tested our approach on a large cohort of glioblastoma aCGH samples from The Cancer Genome Atlas and recovered almost all CNAs reported in the initial study. We also found additional significant CNAs missed by the original analysis but supported by earlier studies, and we identified significant correlations between CNAs

    Detecting Biomarkers among Subgroups with Structured Latent Features and Multitask Learning Methods

    Get PDF
    University of Minnesota Ph.D. dissertation. May 2017. Major: Computer Science. Advisor: Rui Kuang. 1 computer file (PDF); viii, 89 pages.Because of disease progression and heterogeneity in samples and single cells, biomarker detection among subgroups is important as it provides better understanding on population genetics and cancer causative. In this thesis, we proposed several structured latent features based and multitask learning based methods for biomarker detection on DNA Copy-Number Variations (CNVs) data and single cell RNA sequencing (scRNA-seq) data. By incorporating prior known group information or taking domain heterogeneity into consideration, our models are able to achieve meaningful biomarker detection and accurate sample classification. 1. By cooperating population relationship from human phylogenetic tree, we introduced a latent feature model to detect population-differentiation CNV markers. The algorithm, named tree-guided sparse group selection (treeSGS), detects sample sub- groups organized by a population phylogenetic tree such that the evolutionary relations among the populations are incorporated for more accurate detection of population- differentiation CNVs. 2. We applied transfer learning technic for cross-cancer-type CNV studies. We proposed Transfer Learning with Fused LASSO (TLFL) algorithm, which detects latent CNV components from multiple CNV datasets of different tumor types and distinguishes the CNVs that are common across the datasets and those that are specific in each dataset. Both the common and type-specific CNVs are detected as latent components in matrix factorization coupled with fused LASSO on adjacent CNV probe features. 3. We further applied multitask learning idea on scRNA-seq data. We introduced variance-driven multitask clustering on single-cell RNA-seq data (scV DMC) that utilizes multiple cell populations from biological replicates or related samples with significant biological variances. scVDMC clusters single cells of similar cell types and markers but varies expression patterns across different domains such that the scRNA-seq data are adjusted for better integration. We applied both simulations and several publicly available CNV and scRNA-seq datasets, including one in house scRNA-seq dataset, to evaluate the performance of our models. The promising results show that we achieve better biomarker prediction among subgroups

    Finding regions of aberrant DNA copy number associated with tumor phenotype

    Get PDF
    DNA copy number alterations are a hallmark of cancer. Understanding their role in tumor progression can help improve diagnosis, prognosis and therapy selection for cancer patients. High-resolution, genome-wide measurements of DNA copy number changes for large cohorts of tumors are currently available, owing to technologies like microarray-based array comparative hybridization (arrayCGH). In this thesis, we present a computational pipeline for statistical analysis of tumor cohorts, which can help extract relevant patterns of copy number aberrations and infer their association with various phenotypical indicators. The main challenges are the instability of classification models due to the high dimensionality of the arrays compared to the small number of tumor samples, as well as the large correlations between copy number estimates measured at neighboring loci. We show that the feature ranking given by several widely-used methods for feature selection is biased due to the large correlations between features. In order to correct for the bias and instability of the feature ranking, we introduce methods for consensus segmentation of the set of arrays. We present three algorithms for consensus segmentation, which are based on identifying recurrent DNA breakpoints or DNA regions of constant copy number profile. The segmentation constitutes the basis for computing a set of super-features, corresponding to the regions. We use the super-features for supervised classification and we compare the models to baseline models trained on probe data. We validated the methods by training models for prediction of the phenotype of breast cancers and neuroblastoma tumors. We show that the multivariate segmentation affords higher model stability, in general improves prediction accuracy and facilitates model interpretation. One of our most important biological results refers to the classification of neuroblastoma tumors. We show that patients belonging to different age subgroups are characterized by distinct copy number patterns, with largest discrepancy when the subgroups are defined as older or younger than 16-18 months. We thereby confirm the recommendation for a higher age cutoff than 12 months (current clinical practice) for differential diagnosis of neuroblastoma.Die abnormale MultiplizitĂ€t bestimmter Segmente der DNS (copy number aberrations) ist eines der hervorstechenden Merkmale von Krebs. Das VerstĂ€ndnis der Rolle dieses Merkmals fĂŒr das Tumorwachstum könnte massgeblich zur Verbesserung von Krebsdiagnose,-prognose und -therapie beitragen und somit bei der Auswahl individueller Therapien helfen. Micoroarray-basierte Technologien wie 'Array Comparative Hybridization' (array-CGH) erlauben es, hochauflösende, genomweite Kopiezahl-Karten von Tumorgeweben zu erstellen. Gegenstand dieser Arbeit ist die Entwicklung einer Software-Pipeline fĂŒr die statistische Analyse von Tumorkohorten, die es ermöglicht, relevante Muster abnormaler Kopiezahlen abzuleiten und diese mit diversen phĂ€notypischen Merkmalen zu assoziieren. Dies geschieht mithilfe maschineller Lernmethoden fĂŒr Klassifikation und Merkmalselektion mit Fokus auf die Interpretierbarkeit der gelernten Modelle (regularisierte lineare Methoden sowie Entscheidungsbaum-basierte Modelle). Herausforderungen an die Methoden liegen vor allem in der hohen DimensionalitĂ€t der Daten, denen lediglich eine vergleichsweise geringe Anzahl von gemessenen Tumorproben gegenĂŒber steht, sowie der hohen Korrelation zwischen den gemessenen Kopiezahlen in benachbarten genomischen Regionen. Folglich hĂ€ngen die Resultate der Merkmalselektion stark von der Auswahl des Trainingsdatensatzes ab, was die Reproduzierbarkeit bei unterschiedlichen klinischen DatensĂ€tzen stark einschrĂ€nkt. Diese Arbeit zeigt, dass die von diversen gĂ€ngigen Methoden bestimmte Rangfolge von Features in Folge hoher Korrelationskoefizienten einzelner PrĂ€diktoren stark verfĂ€lscht ist. Um diesen 'Bias' sowie die InstabilitĂ€t der Merkmalsrangfolge zu korrigieren, fĂŒhren wir in unserer Pipeline einen dimensions-reduzierenden Schritt ein, der darin besteht, die Arrays gemeinsam multivariat zu segmentieren. Wir prĂ€sentieren drei Algorithmen fĂŒr diese multivariate Segmentierung,die auf der Identifikation rekurrenter DNA Breakpoints oder genomischer Regionen mit konstanten Kopiezahl-Profilen beruhen. Durch Zusammenfassen der DNA Kopiezahlwerte innerhalb einer Region bildet die multivariate Segmentierung die Grundlage fĂŒr die Berechnung einer kleineren Menge von 'Super-Merkmalen'. Im Vergleich zu Klassifikationsverfahren,die auf Ebene einzelner Arrayproben beruhen, verbessern wir durch ĂŒberwachte Klassifikation basierend auf den Super-Merkmalen die Interpretierbarkeit sowie die StabilitĂ€t der Modelle. Wir validieren die Methoden in dieser Arbeit durch das Trainieren von Vorhersagemodellen auf Brustkrebs und Neuroblastoma DatensĂ€tzen. Hier zeigen wir, dass der multivariate Segmentierungsschritt eine erhöhte ModellstabilitĂ€t erzielt, wobei die VorhersagequalitĂ€t nicht abnimmt. Die Dimension des Problems wird erheblich reduziert (bis zu 200-fach weniger Merkmale), welches die multivariate Segmentierung nicht nur zu einem probaten Mittel fĂŒr die Vorhersage von PhĂ€notypen macht.Vielmehr eignet sich das Verfahren darĂŒberhinaus auch als Vorverarbeitungschritt fĂŒr spĂ€tere integrative Analysen mit anderen Datentypen. Auch die Interpretierbarkeit der Modelle wird verbessert. Dies ermöglicht die Identifikation von wichtigen Relationen zwischen Änderungen der Kopiezahl und PhĂ€notyp. Beispielsweise zeigen wir, dass eine Koamplifikation in direkter Nachbarschaft des ERBB2 Genlokus einen höchst informativen PrĂ€diktor fĂŒr die Unterscheidung von entzĂŒndlichen und nicht-entzĂŒndlichen Brustkrebsarten darstellt. Damit bestĂ€tigen wir die in der Literatur gĂ€ngige Hypothese, dass die Grösse eines Amplikons mit dem Krebssubtyp zusammenhĂ€ngt. Im Fall von Neuroblastoma Tumoren zeigen wir, dass Untergruppen, die durch das Alter des Patienten deniert werden, durch Kopiezahl-Muster charakterisiert werden können. Insbesondere ist dies möglich, wenn ein Altersschwellenwert von 16 bis 18 Monaten zur Definition der Gruppen verwandt wird, bei dem ausserdem auch die höchste Vorhersagegenauigkeit vorliegt. Folglich geben wir weitere Evidenz fĂŒr die Empfehlung, einen höheren Schwellenwert als zwölf Monate fĂŒr die differentielle Diagnose von Neuroblastoma zu verwenden

    Collaborative Filtering via Group-Structured Dictionary Learning

    Get PDF
    Structured sparse coding and the related structured dictionary learning problems are novel research areas in machine learning. In this paper we present a new application of structured dictionary learning for collaborative filtering based recommender systems. Our extensive numerical experiments demonstrate that the presented technique outperforms its state-of-the-art competitors and has several advantages over approaches that do not put structured constraints on the dictionary elements.Comment: A compressed version of the paper has been accepted for publication at the 10th International Conference on Latent Variable Analysis and Source Separation (LVA/ICA 2012
    corecore