195 research outputs found

    Unsupervised multiple kernel learning approaches for integrating molecular cancer patient data

    Get PDF
    Cancer is the second leading cause of death worldwide. A characteristic of this disease is its complexity leading to a wide variety of genetic and molecular aberrations in the tumors. This heterogeneity necessitates personalized therapies for the patients. However, currently defined cancer subtypes used in clinical practice for treatment decision-making are based on relatively few selected markers and thus provide only a coarse classifcation of tumors. The increased availability in multi-omics data measured for cancer patients now offers the possibility of defining more informed cancer subtypes. Such a more fine-grained characterization of cancer subtypes harbors the potential of substantially expanding treatment options in personalized cancer therapy. In this thesis, we identify comprehensive cancer subtypes using multidimensional data. For this purpose, we apply and extend unsupervised multiple kernel learning methods. Three challenges of unsupervised multiple kernel learning are addressed: robustness, applicability, and interpretability. First, we show that regularization of the multiple kernel graph embedding framework, which enables the implementation of dimensionality reduction techniques, can increase the stability of the resulting patient subgroups. This improvement is especially beneficial for data sets with a small number of samples. Second, we adapt the objective function of kernel principal component analysis to enable the application of multiple kernel learning in combination with this widely used dimensionality reduction technique. Third, we improve the interpretability of kernel learning procedures by performing feature clustering prior to integrating the data via multiple kernel learning. On the basis of these clusters, we derive a score indicating the impact of a feature cluster on a patient cluster, thereby facilitating further analysis of the cluster-specific biological properties. All three procedures are successfully tested on real-world cancer data. Comparing our newly derived methodologies to established methods provides evidence that our work offers novel and beneficial ways of identifying patient subgroups and gaining insights into medically relevant characteristics of cancer subtypes.Krebs ist eine der häufigsten Todesursachen weltweit. Krebs ist gekennzeichnet durch seine Komplexität, die zu vielen verschiedenen genetischen und molekularen Aberrationen im Tumor führt. Die Unterschiede zwischen Tumoren erfordern personalisierte Therapien für die einzelnen Patienten. Die Krebssubtypen, die derzeit zur Behandlungsplanung in der klinischen Praxis verwendet werden, basieren auf relativ wenigen, genetischen oder molekularen Markern und können daher nur eine grobe Unterteilung der Tumoren liefern. Die zunehmende Verfügbarkeit von Multi-Omics-Daten für Krebspatienten ermöglicht die Neudefinition von fundierteren Krebssubtypen, die wiederum zu spezifischeren Behandlungen für Krebspatienten führen könnten. In dieser Dissertation identifizieren wir neue, potentielle Krebssubtypen basierend auf Multi-Omics-Daten. Hierfür verwenden wir unüberwachtes Multiple Kernel Learning, welches in der Lage ist mehrere Datentypen miteinander zu kombinieren. Drei Herausforderungen des unüberwachten Multiple Kernel Learnings werden adressiert: Robustheit, Anwendbarkeit und Interpretierbarkeit. Zunächst zeigen wir, dass die zusätzliche Regularisierung des Multiple Kernel Learning Frameworks zur Implementierung verschiedener Dimensionsreduktionstechniken die Stabilität der identifizierten Patientengruppen erhöht. Diese Robustheit ist besonders vorteilhaft für Datensätze mit einer geringen Anzahl von Proben. Zweitens passen wir die Zielfunktion der kernbasierten Hauptkomponentenanalyse an, um eine integrative Version dieser weit verbreiteten Dimensionsreduktionstechnik zu ermöglichen. Drittens verbessern wir die Interpretierbarkeit von kernbasierten Lernprozeduren, indem wir verwendete Merkmale in homogene Gruppen unterteilen bevor wir die Daten integrieren. Mit Hilfe dieser Gruppen definieren wir eine Bewertungsfunktion, die die weitere Auswertung der biologischen Eigenschaften von Patientengruppen erleichtert. Alle drei Verfahren werden an realen Krebsdaten getestet. Den Vergleich unserer Methodik mit etablierten Methoden weist nach, dass unsere Arbeit neue und nützliche Möglichkeiten bietet, um integrative Patientengruppen zu identifizieren und Einblicke in medizinisch relevante Eigenschaften von Krebssubtypen zu erhalten

    Dissection of Complex Genetic Correlations into Interaction Effects

    Get PDF
    Living systems are overwhelmingly complex and consist of many interacting parts. Already the quantitative characterization of a single human cell type on genetic level requires at least the measurement of 20000 gene expressions. It remains a big challenge for theoretical approaches to discover patterns in these signals that represent specific interactions in such systems. A major problem is that available standard procedures summarize gene expressions in a hard-to-interpret way. For example, principal components represent axes of maximal variance in the gene vector space and thus often correspond to a superposition of multiple different gene regulation effects (e.g. I.1.4). Here, a novel approach to analyze and interpret such complex data is developed (Chapter II). It is based on an extremum principle that identifies an axis in the gene vector space to which as many as possible samples are correlated as highly as possible (II.3). This axis is maximally specific and thus most probably corresponds to exactly one gene regulation effect, making it considerably easier to interpret than principle components. To stabilize and optimize effect discovery, axes in the sample vector space are identified simultaneously. Genes and samples are always handled symmetrically by the algorithm. While sufficient for effect discovery, effect axes can only linearly approximate regulation laws. To represent a broader class of nonlinear regulations, including saturation effects or activity thresholds (e.g. II.1.1.2), a bimonotonic effect model is defined (II.2.1.2). A corresponding regression is realized that is monotonic over projections of samples (or genes) onto discovered gene (or sample) axes. Resulting effect curves can approximate regulation laws precisely (II.4.1). This enables the dissection of exclusively the discovered effect from the signal (II.4.2). Signal parts from other potentially overlapping effects remain untouched. This continues iteratively. In this way, the high-dimensional initial signal (II.2.1.1) can be dissected into highly specific effects. Method validation demonstrates that superposed effects of various size, shape and signal strength can be dissected reliably (II.6.2). Simulated laws of regulation are reconstructed with high correlation. Detection limits, e.g. for signal strength or for missing values, lie above practical requirements (II.6.4). The novel approach is systematically compared with standard procedures such as principal component analysis. Signal dissection is shown to have clear advantages, especially for many overlapping effects of comparable size (II.6.3). An ideal test field for such approaches is cancer cells, as they may be driven by multiple overlapping gene regulation networks that are largely unknown. Additionally, quantification and classification of cancer cells by their particular set of driving gene regulations is a prerequisite towards precision medicine. To validate the novel method against real biological data, it is applied to gene expressions of over 1000 tumor samples from Diffuse Large B-Cell Lymphoma (DLBCL) patients (Chapter III). Two already known subtypes of this disease (cf. I.1.2.1) with significantly different survival following the same chemotherapy were originally also discovered as a gene expression effect. These subtypes can only be precisely determined by this effect on molecular level. Such previous results offer a possibility for method validation and indeed, this effect has been unsupervisedly rediscovered (III.3.2.2). Several additional biologically relevant effects have been discovered and validated across four patient cohorts. Multivariate analyses (III.2) identify combinations of validated effects that can predict significant differences in patient survival. One novel effect possesses an even higher predictive value (cf. III.2.5.1) than the rediscovered subtype effect and is genetically more specific (cf. III.3.3.1). A trained and validated Cox survival model (III.2.5) can predict significant survival differences within known DLBCL subtypes (III.2.5.6), demonstrating that they are genetically heterogeneous as well. Detailed biostatistical evaluations of all survival effects (III.3.3) may help to clarify the molecular pathogenesis of DLBCL. Furthermore, the applicability of signal dissection is not limited to biological data. For instance, dissecting spectral energy distributions of stars observed in astrophysics might be useful to discover laws of light emission

    Survival-Related Clustering of Cancer Patients by Integrating Clinical and Biological Datasets

    Get PDF
    Subtype-based treatments and drug therapies are essential aspects to be considered in cancer patients\u27 clinical trials to provide appropriate personalized therapies. With the advancement of the next-generation sequencing technology, several computational models, integrating genomic and transcriptomic datasets (i.e., multi-omics) in the prediction of subtype-based classification in cancer patients, were emerged. However, integration of the prognostic features from the clinical data, related to survival risks with the multi-omics datasets in the prediction of different subtypes, is limited and an important research area to be explored. In this study, we proposed a data integration pipeline with the prognostic features from the clinical data and multi-omics datasets to predict the survival-risk-based subtypes in Kidney Renal Clear Cell Carcinoma (KIRC) patients from The Cancer Genome Atlas (TCGA) database. Firstly, we applied an unsupervised clustering algorithm on KIRC patients and clustered them into two survival-risk-based subgroups, i.e., subtypes. Then, using the clustering-based subtype labels as class labels for cancer patients, we trained a supervised classification model to determine the class label of un-labeled patients.In our clustering step, we applied multivariate Cox Proportional Hazard (Cox-PH) model to select the survival-related prognostically significant features (p-value \u3c 0.05) from the patients’ multivariate clinical data. Then, we used the Silhouette Coefficient to determine the optimal number (k) of the clusters. In our classification step, we integrated high dimensional multi-omics datasets with three different data modalities (such as gene expression, microRNA expression, and DNA methylation). We utilized a dimension-reduction approach, followed by a univariate Cox-PH for each reduced data modality with patients’ survival status. Then, we selected the survival-related reduced-omics-features in our classification model. In this step, we applied a supervised classification method with 10-fold cross-validation to check our survival-based subtype prediction accuracy. We tested multiple machine learning and deep learning algorithms in different steps of the pipeline for clustering (K-means, K-modes and, Gaussian mixture model), dimension-reduction (Denoising Autoencoder and Principal Component Analysis) and classification (Support Vector Machine and Random Forest) purposes. We proposed an optimized model with the highest survival-specific-subtype classification accuracy as the final model
    • …
    corecore