9 research outputs found

    Co-regulatory expression quantitative trait loci mapping: method and application to endometrial cancer

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Expression quantitative trait loci (eQTL) studies have helped identify the genetic determinants of gene expression. Understanding the potential interacting mechanisms underlying such findings, however, is challenging.</p> <p>Methods</p> <p>We describe a method to identify the <it>trans-</it>acting drivers of multiple gene co-expression, which reflects the action of regulatory molecules. This method-termed <it>co-regulatory expression quantitative trait locus </it>(creQTL) <it>mapping</it>-allows for evaluation of a more focused set of phenotypes within a clear biological context than conventional eQTL mapping.</p> <p>Results</p> <p>Applying this method to a study of endometrial cancer revealed regulatory mechanisms supported by the literature: a creQTL between a locus upstream of STARD13/DLC2 and a group of seven IFNβ-induced genes. This suggests that the Rho-GTPase encoded by STARD13 regulates IFNβ-induced genes and the DNA damage response.</p> <p>Conclusions</p> <p>Because of the importance of IFNβ in cancer, our results suggest that creQTL may provide a finer picture of gene regulation and may reveal additional molecular targets for intervention. An open source R implementation of the method is available at <url>http://sites.google.com/site/kenkompass/</url>.</p

    Unsupervised multiple kernel learning approaches for integrating molecular cancer patient data

    Get PDF
    Cancer is the second leading cause of death worldwide. A characteristic of this disease is its complexity leading to a wide variety of genetic and molecular aberrations in the tumors. This heterogeneity necessitates personalized therapies for the patients. However, currently defined cancer subtypes used in clinical practice for treatment decision-making are based on relatively few selected markers and thus provide only a coarse classifcation of tumors. The increased availability in multi-omics data measured for cancer patients now offers the possibility of defining more informed cancer subtypes. Such a more fine-grained characterization of cancer subtypes harbors the potential of substantially expanding treatment options in personalized cancer therapy. In this thesis, we identify comprehensive cancer subtypes using multidimensional data. For this purpose, we apply and extend unsupervised multiple kernel learning methods. Three challenges of unsupervised multiple kernel learning are addressed: robustness, applicability, and interpretability. First, we show that regularization of the multiple kernel graph embedding framework, which enables the implementation of dimensionality reduction techniques, can increase the stability of the resulting patient subgroups. This improvement is especially beneficial for data sets with a small number of samples. Second, we adapt the objective function of kernel principal component analysis to enable the application of multiple kernel learning in combination with this widely used dimensionality reduction technique. Third, we improve the interpretability of kernel learning procedures by performing feature clustering prior to integrating the data via multiple kernel learning. On the basis of these clusters, we derive a score indicating the impact of a feature cluster on a patient cluster, thereby facilitating further analysis of the cluster-specific biological properties. All three procedures are successfully tested on real-world cancer data. Comparing our newly derived methodologies to established methods provides evidence that our work offers novel and beneficial ways of identifying patient subgroups and gaining insights into medically relevant characteristics of cancer subtypes.Krebs ist eine der häufigsten Todesursachen weltweit. Krebs ist gekennzeichnet durch seine Komplexität, die zu vielen verschiedenen genetischen und molekularen Aberrationen im Tumor führt. Die Unterschiede zwischen Tumoren erfordern personalisierte Therapien für die einzelnen Patienten. Die Krebssubtypen, die derzeit zur Behandlungsplanung in der klinischen Praxis verwendet werden, basieren auf relativ wenigen, genetischen oder molekularen Markern und können daher nur eine grobe Unterteilung der Tumoren liefern. Die zunehmende Verfügbarkeit von Multi-Omics-Daten für Krebspatienten ermöglicht die Neudefinition von fundierteren Krebssubtypen, die wiederum zu spezifischeren Behandlungen für Krebspatienten führen könnten. In dieser Dissertation identifizieren wir neue, potentielle Krebssubtypen basierend auf Multi-Omics-Daten. Hierfür verwenden wir unüberwachtes Multiple Kernel Learning, welches in der Lage ist mehrere Datentypen miteinander zu kombinieren. Drei Herausforderungen des unüberwachten Multiple Kernel Learnings werden adressiert: Robustheit, Anwendbarkeit und Interpretierbarkeit. Zunächst zeigen wir, dass die zusätzliche Regularisierung des Multiple Kernel Learning Frameworks zur Implementierung verschiedener Dimensionsreduktionstechniken die Stabilität der identifizierten Patientengruppen erhöht. Diese Robustheit ist besonders vorteilhaft für Datensätze mit einer geringen Anzahl von Proben. Zweitens passen wir die Zielfunktion der kernbasierten Hauptkomponentenanalyse an, um eine integrative Version dieser weit verbreiteten Dimensionsreduktionstechnik zu ermöglichen. Drittens verbessern wir die Interpretierbarkeit von kernbasierten Lernprozeduren, indem wir verwendete Merkmale in homogene Gruppen unterteilen bevor wir die Daten integrieren. Mit Hilfe dieser Gruppen definieren wir eine Bewertungsfunktion, die die weitere Auswertung der biologischen Eigenschaften von Patientengruppen erleichtert. Alle drei Verfahren werden an realen Krebsdaten getestet. Den Vergleich unserer Methodik mit etablierten Methoden weist nach, dass unsere Arbeit neue und nützliche Möglichkeiten bietet, um integrative Patientengruppen zu identifizieren und Einblicke in medizinisch relevante Eigenschaften von Krebssubtypen zu erhalten

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Identifying disease-associated genes based on artificial intelligence

    Get PDF
    Identifying disease-gene associations can help improve the understanding of disease mechanisms, which has a variety of applications, such as early diagnosis and drug development. Although experimental techniques, such as linkage analysis, genome-wide association studies (GWAS), have identified a large number of associations, identifying disease genes is still challenging since experimental methods are usually time-consuming and expensive. To solve these issues, computational methods are proposed to predict disease-gene associations. Based on the characteristics of existing computational algorithms in the literature, we can roughly divide them into three categories: network-based methods, machine learning-based methods, and other methods. No matter what models are used to predict disease genes, the proper integration of multi-level biological data is the key to improving prediction accuracy. This thesis addresses some limitations of the existing computational algorithms, and integrates multi-level data via artificial intelligence techniques. The thesis starts with a comprehensive review of computational methods, databases, and evaluation methods used in predicting disease-gene associations, followed by one network-based method and four machine learning-based methods. The first chapter introduces the background information, objectives of the studies and structure of the thesis. After that, a comprehensive review is provided in the second chapter to discuss the existing algorithms as well as the databases and evaluation methods used in existing studies. Having the objectives and future directions, the thesis then presents five computational methods for predicting disease-gene associations. The first method proposed in Chapter 3 considers the issue of non-disease gene selection. A shortest path-based strategy is used to select reliable non-disease genes from a disease gene network and a differential network. The selected genes are then used by a network-energy model to improve its performance. The second method proposed in Chapter 4 constructs sample-based networks for case samples and uses them to predict disease genes. This strategy improves the quality of protein-protein interaction (PPI) networks, which further improves the prediction accuracy. Chapter 5 presents a generic model which applies multimodal deep belief nets (DBN) to fuse different types of data. Network embeddings extracted from PPI networks and gene ontology (GO) data are fused with the multimodal DBN to obtain cross-modality representations. Chapter 6 presents another deep learning model which uses a convolutional neural network (CNN) to integrate gene similarities with other types of data. Finally, the fifth method proposed in Chapter 7 is a nonnegative matrix factorization (NMF)-based method. This method maps diseases and genes onto a lower-dimensional manifold, and the geodesic distance between diseases and genes are used to predict their associations. The method can predict disease genes even if the disease under consideration has no known associated genes. In summary, this thesis has proposed several artificial intelligence-based computational algorithms to address the typical issues existing in computational algorithms. Experimental results have shown that the proposed methods can improve the accuracy of disease-gene prediction

    BAYESIAN FRAMEWORKS FOR PARSIMONIOUS MODELING OF MOLECULAR CANCER DATA

    Get PDF
    In this era of precision medicine, clinicians and researchers critically need the assistance of computational models that can accurately predict various clinical events and outcomes (e.g,, diagnosis of disease, determining the stage of the disease, or molecular subtyping). Typically, statistics and machine learning are applied to ‘omic’ datasets, yielding computational models that can be used for prediction. In cancer research there is still a critical need for computational models that have high classification performance but are also parsimonious in the number of variables they use. Some models are very good at performing their intended classification task, but are too complex for human researchers and clinicians to understand, due to the large number of variables they use. In contrast, some models are specifically built with a small number of variables, but may lack excellent predictive performance. This dissertation proposes a novel framework, called Junction to Knowledge (J2K), for the construction of parsimonious computational models. The J2K framework consists of four steps: filtering (discretization and variable selection), Bayesian network generation, Junction tree generation, and clique evaluation. The outcome of applying J2K to a particular dataset is a parsimonious Bayesian network model with high predictive performance, but also that is composed of a small number of variables. Not only does J2K find parsimonious gene cliques, but also provides the ability to create multi-omic models that can further improve the classification performance. These multi-omic models have the potential to accelerate biomedical discovery, followed by translation of their results into clinical practice

    동시조절 유전적 상호작용 발굴을 위한 하이퍼그래프 모델

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 협동과정 생물정보학전공, 2014. 2. 장병탁.A comprehensive understanding of biological systems requires the analysis of higher-order interactions among many genomic factors. Various genomic factors cooperate to affect biological processes including cancer occurrence, progression and metastasis. However, the complexity of genomic interactions presents a major barrier to identifying their co-regulatory roles and functional effects. Thus, this dissertation addresses the problem of analyzing complex relationships among many genomic factors in biological processes including cancers. We propose a hypergraph approach for modeling, learning and extracting: explicitly modeling higher-order genomic interactions, efficiently learning based on evolutionary methods, and effectively extracting biological knowledge from the model. A hypergraph model is a higher-order graphical model explicitly representing complex relationships among many variables from high-dimensional data. This property allows the proposed model to be suitable for the analysis of biological and medical phenomena characterizing higher-order interactions between various genomic factors. This dissertation proposes the advanced hypergraph-based models in terms of the learning methods and the model structures to analyze large-scale biological data focusing on identifying co-regulatory genomic interactions on a genome-wide level. We introduce an evolutionary approach based on information-theoretic criteria into the learning mechanisms for efficiently searching a huge problem space reflecting higher-order interactions between factors. This evolutionary learning is explained from the perspective of a sequential Bayesian sampling framework. Also, a hierarchy is introduced into the hypergraph model for modeling hierarchical genomic relationships. This hierarchical structure allows the hypergraph model to explicitly represent gene regulatory circuits as functional blocks or groups across the level of epigenetic, transcriptional, and post-transcriptional regulation. Moreover, the proposed graph-analyzing method is able to grasp the global structures of biological systems such as genomic modules and regulatory networks by analyzing the learned model structures. The proposed model is applied to analyzing cancer genomics considered as a major topic in current biology and medicine. We show that the performance of our model competes with or outperforms state-of-the-art models on multiple cancer genomic data. Furthermore, the propose model is capable of discovering new or hidden patterns as candidates of potential gene regulatory circuits such as gene modules, miRNA-mRNA networks, and multiple genomic interactions, associated with the specific cancer. The results of these analysis can provide several crucial evidences that can pave the way for identifying unknown functions in the cancer system. The proposed hypergraph model will contribute to elucidating core regulatory mechanisms and to comprehensive understanding of biological processes including cancers.Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .i 1 Introduction 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problems to be Addressed . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 The Proposed Approach and its Contribution . . . . . . . . . . . . . . 4 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 6 2 Related Work 2.1 Analysis of Co-Regulatory Genomic Interactions from Omics Data . . 9 2.2 Probabilistic Graphical Models for Biological Problems . . . . . . . . 11 2.2.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Higher-order Graphical Models for Biological Problems . . . . . . . . 16 2.3.1 Higher-Order Models . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Hypergraph Classifiers for Identifying Prognostic Modules in Cancer 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Analyzing Gene Modules for Cancer Prognosis Prediction . . . . . . 24 3.3 Hypergraph Classifiers for Identifying Cancer Gene Modules . . . . 26 3.3.1 Hypergraph Classifiers . . . . . . . . . . . . . . . . . . . . . . 26 3.3.2 Bayesian Evolutionary Algorithm . . . . . . . . . . . . . . . . 27 3.3.3 Bayesian Evolutionary Learning for Hypergraph Classifiers . 29 3.4 Predicting Cancer Clinical Outcomes Based on Gene Modules . . . . 34 3.4.1 Data and Experimental Settings . . . . . . . . . . . . . . . . . 34 3.4.2 Prediction Performance . . . . . . . . . . . . . . . . . . . . . . 36 3.4.3 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.4 Identification of Prognostic Gene Modules . . . . . . . . . . . 44 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4 Hypergraph-based Models for Constructing Higher-Order miRNA-mRNA Interaction Networks in Cancer 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Analyzing Relationships between miRNAs and mRNAs from Heterogeneous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Hypergraph-based Models for Identifying miRNA-mRNA Interactions 57 4.3.1 Hypergraph-based Models . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Learning Hypergraph-based Models . . . . . . . . . . . . . . . 61 4.3.3 Building Interaction Networks from Hypergraphs . . . . . . . 64 4.4 Constructing miRNA-mRNA Interaction Networks Based on Higher- Order Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Data and Experimental Settings . . . . . . . . . . . . . . . . . 66 4.4.2 Classification Performance . . . . . . . . . . . . . . . . . . . . 68 4.4.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 70 CONTENTS iii 4.4.4 Constructed Higher-Order miRNA-mRNA Interaction Networks in Prostate Cancer . . . . . . . . . . . . . . . . . . . . . 74 4.4.5 Functional Analysis of the Constructed Interaction Networks 78 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5 Hierarchical Hypergraphs for Identifying Higher-Order Genomic Interactions in Multilevel Regulation 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Analyzing Epigenetic and Genetic Interactions from Multiple Genomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3 Hierarchical Hypergraphs for Identifying Epigenetic and Genetic Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3.1 Hierarchical Hypergraphs . . . . . . . . . . . . . . . . . . . . . 92 5.3.2 Learning Hierarchical Hypergraphs . . . . . . . . . . . . . . . 95 5.4 Identifying Higher-Order Genomic Interactions in Multilevel Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4.1 Data and Experimental Settings . . . . . . . . . . . . . . . . . 100 5.4.2 Identified Higher-Order miRNA-mRNA Interactions Induced by DNA Methylation in Ovarian Cancer . . . . . . . . . . . . 102 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6 Concluding Remarks 6.1 Summary of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 107 6.2 Directions for Further Research . . . . . . . . . . . . . . . . . . . . . . 109 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 초록 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132Docto

    Biclustering strategies for genetic marker selection in gynecologic tumor cell lines

    No full text
    Summarization: Over the past few decades great interest has been focused on cell lines derived from tumors, because of their usability as models to understand the biology of cancer. At the same time, advanced technologies such as DNA-microarrays have been broadly used to study the expression level of thousands of genes in primary tumors or cancer cell lines in a single experiment. Results from microarray analysis approaches have provided valuable insights into the underlying biology and proven useful for tumor classification, prognostication and prediction. Our approach utilizes biclustering methods for the discovery of genes with coherent expression across a subset of conditions (cell lines of a tumor type). More specifically, we present a novel modification on Cheng & Church's algorithm that searches for differences across the studied conditions, but also enforces consistent intensity characteristics of each cluster within each condition. The application of this approach on a gynecologic panel of cell lines succeeds to derive discriminant groups of compact bi-clusters across four types of tumor cell lines. In this form, the proposed approach is proven efficient for the derivation of tumor-specific markers.Παρουσιάστηκε στο: 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Societ
    corecore