153 research outputs found
Hybrid Approach of Relation Network and Localized Graph Convolutional Filtering for Breast Cancer Subtype Classification
Network biology has been successfully used to help reveal complex mechanisms
of disease, especially cancer. On the other hand, network biology requires
in-depth knowledge to construct disease-specific networks, but our current
knowledge is very limited even with the recent advances in human cancer
biology. Deep learning has shown a great potential to address the difficult
situation like this. However, deep learning technologies conventionally use
grid-like structured data, thus application of deep learning technologies to
the classification of human disease subtypes is yet to be explored. Recently,
graph based deep learning techniques have emerged, which becomes an opportunity
to leverage analyses in network biology. In this paper, we proposed a hybrid
model, which integrates two key components 1) graph convolution neural network
(graph CNN) and 2) relation network (RN). We utilize graph CNN as a component
to learn expression patterns of cooperative gene community, and RN as a
component to learn associations between learned patterns. The proposed model is
applied to the PAM50 breast cancer subtype classification task, the standard
breast cancer subtype classification of clinical utility. In experiments of
both subtype classification and patient survival analysis, our proposed method
achieved significantly better performances than existing methods. We believe
that this work is an important starting point to realize the upcoming
personalized medicine.Comment: 8 pages, To be published in proceeding of IJCAI 201
Machine Learning Models for Deciphering Regulatory Mechanisms and Morphological Variations in Cancer
The exponential growth of multi-omics biological datasets is resulting in an emerging paradigm shift in fundamental biological research. In recent years, imaging and transcriptomics datasets are increasingly incorporated into biological studies, pushing biology further into the domain of data-intensive-sciences. New approaches and tools from statistics, computer science, and data engineering are profoundly influencing biological research. Harnessing this ever-growing deluge of multi-omics biological data requires the development of novel and creative computational approaches. In parallel, fundamental research in data sciences and Artificial Intelligence (AI) has advanced tremendously, allowing the scientific community to generate a massive amount of knowledge from data. Advances in Deep Learning (DL), in particular, are transforming many branches of engineering, science, and technology. Several of these methodologies have already been adapted for harnessing biological datasets; however, there is still a need to further adapt and tailor these techniques to new and emerging technologies.
In this dissertation, we present computational algorithms and tools that we have developed to study gene-regulation and cellular morphology in cancer. The models and platforms that we have developed are general and widely applicable to several problems relating to dysregulation of gene expression in diseases. Our pipelines and software packages are disseminated in public repositories for larger scientific community use.
This dissertation is organized in three main projects. In the first project, we present Causal Inference Engine (CIE), an integrated platform for the identification and interpretation of active regulators of transcriptional response. The platform offers visualization tools and pathway enrichment analysis to map predicted regulators to Reactome pathways. We provide a parallelized R-package for fast and flexible directional enrichment analysis to run the inference on custom regulatory networks. Next, we designed and developed MODEX, a fully automated text-mining system to extract and annotate causal regulatory interaction between Transcription Factors (TFs) and genes from the biomedical literature. MODEX uses putative TF-gene interactions derived from high-throughput ChIP-Seq or other experiments and seeks to collect evidence and meta-data in the biomedical literature to validate and annotate the interactions. MODEX is a complementary platform to CIE that provides auxiliary information on CIE inferred interactions by mining the literature.
In the second project, we present a Convolutional Neural Network (CNN) classifier to perform a pan-cancer analysis of tumor morphology, and predict mutations in key genes. The main challenges were to determine morphological features underlying a genetic status and assess whether these features were common in other cancer types. We trained an Inception-v3 based model to predict TP53 mutation in five cancer types with the highest rate of TP53 mutations. We also performed a cross-classification analysis to assess shared morphological features across multiple cancer types. Further, we applied a similar methodology to classify HER2 status in breast cancer and predict response to treatment in HER2 positive samples. For this study, our training slides were manually annotated by expert pathologists to highlight Regions of Interest (ROIs) associated with HER2+/- tumor microenvironment. Our results indicated that there are strong morphological features associated with each tumor type. Moreover, our predictions highly agree with manual annotations in the test set, indicating the feasibility of our approach in devising an image-based diagnostic tool for HER2 status and treatment response prediction. We have validated our model using samples from an independent cohort, which demonstrates the generalizability of our approach.
Finally, in the third project, we present an approach to use spatial transcriptomics data to predict spatially-resolved active gene regulatory mechanisms in tissues. Using spatial transcriptomics, we identified tissue regions with differentially expressed genes and applied our CIE methodology to predict active TFs that can potentially regulate the marker genes in the region. This project bridged the gap between inference of active regulators using molecular data and morphological studies using images. The results demonstrate a significant local pattern in TF activity across the tissue, indicating differential spatial-regulation in tissues. The results suggest that the integrative analysis of spatial transcriptomics data with CIE can capture discriminant features and identify localized TF-target links in the tissue
RNA μνΈμμ© λ° DNA μμ΄μ μ 보ν΄λ μ μν κΈ°κ³νμ΅ κΈ°λ²
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :곡과λν μ»΄ν¨ν°κ³΅νλΆ,2020. 2. κΉμ .μ물체 κ° νννμ μ°¨μ΄λ κ° κ°μ²΄μ μ μ μ μ 보 μ°¨μ΄λ‘λΆν° κΈ°μΈνλ€. μ μ μ μ 보μ λ³νμ λ°λΌμ, κ° μ물체λ μλ‘ λ€λ₯Έ μ’
μΌλ‘ μ§ννκΈ°λ νκ³ , κ°μ λ³μ κ±Έλ¦° νμλΌλ μλ‘ λ€λ₯Έ μνλ₯Ό 보μ΄κΈ°λ νλ€. μ΄μ²λΌ μ€μν μλ¬Όνμ μ 보λ λμ©λ μνμ± λΆμ κΈ°λ² λ±μ ν΅ν΄ λ€μν μ€λ―Ήμ€ λ°μ΄ν°λ‘ μΈ‘μ λλ€. κ·Έλ¬λ, μ€λ―Ήμ€ λ°μ΄ν°λ κ³ μ°¨μ νΉμ§ λ° μκ·λͺ¨ νλ³Έ λ°μ΄ν°μ΄κΈ° λλ¬Έμ, μ€λ―Ήμ€ λ°μ΄ν°λ‘λΆν° μλ¬Όνμ μ 보λ₯Ό ν΄μνλ κ²μ λ§€μ° μ΄λ €μ΄ λ¬Έμ μ΄λ€. μΌλ°μ μΌλ‘, λ°μ΄ν° νΉμ§μ κ°μκ° μνμ κ°μλ³΄λ€ λ§μ λ, μ€λ―Ήμ€ λ°μ΄ν°μ ν΄μμ κ°μ₯ λν΄ν κΈ°κ³νμ΅ λ¬Έμ λ€ μ€ νλλ‘ λ§λλλ€.
λ³Έ λ°μ¬νμ λ
Όλ¬Έμ κΈ°κ³νμ΅ κΈ°λ²μ νμ©νμ¬ κ³ μ°¨μμ μΈ μλ¬Όνμ λ°μ΄ν°λ‘λΆν° μλ¬Όνμ μ 보λ₯Ό μΆμΆνκΈ° μν μλ‘μ΄ μλ¬Όμ 보ν λ°©λ²λ€μ κ³ μνλ κ²μ λͺ©νλ‘ νλ€.
첫 λ²μ§Έ μ°κ΅¬λ DNA μμ΄μ νμ©νμ¬ μ’
κ° λΉκ΅μ λμμ DNA μμ΄μμ μλ λ€μν μ§μμ λ΄κΈ΄ μλ¬Όνμ μ 보λ₯Ό μ μ μ κ΄μ μμ ν΄μν΄λ³΄κ³ μ νμλ€. μ΄λ₯Ό μν΄, μμ κΈ°λ° k λ¨μ΄ λ¬Έμμ΄ λΉκ΅λ°©λ², RKSS 컀λμ κ°λ°νμ¬ λ€μν κ²λ μμ μ§μμμ μ¬λ¬ μ’
κ° λΉκ΅ μ€νμ μννμλ€. RKSS 컀λμ κΈ°μ‘΄μ k λ¨μ΄ λ¬Έμμ΄ μ»€λμ νμ₯ν κ²μΌλ‘, k κΈΈμ΄ λ¨μ΄μ μμ μ 보μ μ’
κ° κ³΅ν΅μ μ νννλ λΉκ΅κΈ°μ€μ κ°λ
μ νμ©νμλ€. k λ¨μ΄ λ¬Έμμ΄ μ»€λμ kμ κΈΈμ΄μ λ°λΌ λ¨μ΄ μκ° κΈμ¦νμ§λ§, λΉκ΅κΈ°μ€μ μ κ·Ήμμμ λ¨μ΄λ‘ μ΄λ£¨μ΄μ Έ μμΌλ―λ‘ μμ΄ κ° μ μ¬λλ₯Ό κ³μ°νλ λ° νμν κ³μ°λμ ν¨μ¨μ μΌλ‘ μ€μΌ μ μλ€. κ²λ μμ μΈ μ§μμ λν΄μ μ€νμ μ§νν κ²°κ³Ό, RKSS 컀λμ κΈ°μ‘΄μ 컀λμ λΉν΄ μ’
κ° μ μ¬λ λ° μ°¨μ΄λ₯Ό ν¨μ¨μ μΌλ‘ κ³μ°ν μ μμλ€. λν, RKSS 컀λμ μ€νμ μ¬μ©λ μλ¬Όνμ μ§μμ ν¬ν¨λ μλ¬Όνμ μ 보λ μ°¨μ΄λ₯Ό μλ¬Όνμ μ§μκ³Ό λΆν©λλ μμλ‘ λΉκ΅ν μ μμλ€.
λ λ²μ§Έ μ°κ΅¬λ μλ¬Όνμ λ€νΈμν¬λ₯Ό ν΅ν΄ 볡μ‘νκ² μ½ν μ μ μ μνΈμμ© κ° μ 보λ₯Ό ν΄μνμ¬, λ λμκ° μλ¬Όνμ κΈ°λ₯ ν΄μμ ν΅ν΄ μμ μνμ λΆλ₯νκ³ μ νμλ€. μ΄λ₯Ό μν΄, κ·Έλν 컨볼루μ
λ€νΈμν¬μ μ΄ν
μ
λ©μ»€λμ¦μ νμ©νμ¬ ν¨μ€μ¨μ΄ κΈ°λ° ν΄μ κ°λ₯ν μ μν λΆλ₯ λͺ¨λΈ(GCN+MAE)μ κ³ μνμλ€. κ·Έλν 컨볼루μ
λ€νΈμν¬λ₯Ό ν΅ν΄μ μλ¬Όνμ μ¬μ μ§μμΈ ν¨μ€μ¨μ΄ μ 보λ₯Ό νμ΅νμ¬ λ³΅μ‘ν μ μ μ μνΈμμ© μ 보λ₯Ό ν¨μ¨μ μΌλ‘ λ€λ£¨μλ€. λν, μ¬λ¬ ν¨μ€μ¨μ΄ μ 보λ₯Ό μ΄ν
μ
λ©μ»€λμ¦μ ν΅ν΄ ν΄μ κ°λ₯ν μμ€μΌλ‘ λ³ν©νμλ€. λ§μ§λ§μΌλ‘, νμ΅ν ν¨μ€μ¨μ΄ λ 벨 μ 보λ₯Ό λ³΄λ€ λ³΅μ‘νκ³ λ€μν μ μ μ λ λ²¨λ‘ ν¨μ¨μ μΌλ‘ μ λ¬νκΈ° μν΄μ λ€νΈμν¬ μ ν μκ³ λ¦¬μ¦μ νμ©νμλ€. λ€μ― κ°μ μ λ°μ΄ν°μ λν΄ GCN+MAE λͺ¨λΈμ μ μ©ν κ²°κ³Ό, κΈ°μ‘΄μ μ μν λΆλ₯ λͺ¨λΈλ€λ³΄λ€ λμ μ±λ₯μ 보μμΌλ©° μ μν νΉμ΄μ μΈ ν¨μ€μ¨μ΄ λ° μλ¬Όνμ κΈ°λ₯μ λ°κ΅΄ν μ μμλ€.
μΈ λ²μ§Έ μ°κ΅¬λ ν¨μ€μ¨μ΄λ‘λΆν° μλΈ ν¨μ€μ¨μ΄/λ€νΈμν¬λ₯Ό μ°ΎκΈ° μν μ°κ΅¬λ€. ν¨μ€μ¨μ΄λ μλ¬Όνμ λ€νΈμν¬μ λ¨μΌ μλ¬Όνμ κΈ°λ₯μ΄ μλλΌ λ€μν μλ¬Όνμ κΈ°λ₯μ΄ ν¬ν¨λμ΄ μμμ μ£Όλͺ©νμλ€. λ¨μΌ κΈ°λ₯μ μ§λ μ μ μ μ‘°ν©μ μ°ΎκΈ° μν΄μ μλ¬Όνμ λ€νΈμν¬μμμ 쑰건 νΉμ΄μ μΈ μ μ μ λͺ¨λμ μ°Ύκ³ μ νμμΌλ©° MIDASλΌλ λꡬλ₯Ό κ°λ°νμλ€. ν¨μ€μ¨μ΄λ‘λΆν° μ μ μ μνΈμμ© κ° νμ±λλ₯Ό μ μ μ λ°νλκ³Ό λ€νΈμν¬ κ΅¬μ‘°λ₯Ό ν΅ν΄ κ³μ°νμλ€. κ³μ°λ νμ±λλ€μ νμ©νμ¬ λ€μ€ ν΄λμ€μμ μλ‘ λ€λ₯΄κ² νμ±νλ μλΈ ν¨μ€λ€μ ν΅κ³μ κΈ°λ²μ κΈ°λ°νμ¬ λ°κ΅΄νμλ€. λν, μ΄ν
μ
λ©μ»€λμ¦κ³Ό κ·Έλν 컨볼루μ
λ€νΈμν¬λ₯Ό ν΅ν΄μ ν΄λΉ μ°κ΅¬λ₯Ό ν¨μ€μ¨μ΄λ³΄λ€ λ ν° μλ¬Όνμ λ€νΈμν¬μ νμ₯νλ €κ³ μλνμλ€. μ λ°©μ λ°μ΄ν°μ λν΄ μ€νμ μ§νν κ²°κ³Ό, MIDASμ λ₯λ¬λ λͺ¨λΈμ λ€μ€ ν΄λμ€μμ μ°¨μ΄κ° λλ μ μ μ λͺ¨λμ ν¨κ³Όμ μΌλ‘ μΆμΆν μ μμλ€.
κ²°λ‘ μ μΌλ‘, λ³Έ λ°μ¬νμ λ
Όλ¬Έμ DNA μμ΄μ λ΄κΈ΄ μ§νμ μ 보λ λΉκ΅, ν¨μ€μ¨μ΄ κΈ°λ° μ μν λΆλ₯, 쑰건 νΉμ΄μ μΈ μ μ μ λͺ¨λ λ°κ΅΄μ μν μλ‘μ΄ κΈ°κ³νμ΅ κΈ°λ²μ μ μνμλ€.Phenotypic differences among organisms are mainly due to the difference in genetic information. As a result of genetic information modification, an organism may evolve into a different species and patients with the same disease may have different prognosis. This important biological information can be observed in the form of various omics data using high throughput instrument technologies such as sequencing instruments. However, interpretation of such omics data is challenging since omics data is with very high dimensions but with relatively small number of samples. Typically, the number of dimensions is higher than the number of samples, which makes the interpretation of omics data one of the most challenging machine learning problems.
My doctoral study aims to develop new bioinformatics methods for decoding information in these high dimensional data by utilizing machine learning algorithms.
The first study is to analyze the difference in the amount of information between different regions of the DNA sequence. To achieve the goal, a ranked-based k-spectrum string kernel, RKSS kernel, is developed for comparative and evolutionary comparison of various genomic region sequences among multiple species. RKSS kernel extends the existing k-spectrum string kernel by utilizing rank information of k-mers and landmarks of k-mers that represents a species. By using a landmark as a reference point for comparison, the number of k-mers needed to calculating sequence similarities is dramatically reduced. In the experiments on three different genomic regions, RKSS kernel captured more reliable distances between species according to genetic information contents of the target region. Also, RKSS kernel was able to rearrange each region to match a biological common insight.
The second study aims to efficiently decode complex genetic interactions using biological networks and, then, to classify cancer subtypes by interpreting biological functions. To achieve the goal, a pathway-based deep learning model using graph convolutional network and multi-attention based ensemble (GCN+MAE) for cancer subtype classification is developed. In order to efficiently reduce the relationships between genes using pathway information, GCN+MAE is designed as an explainable deep learning structure using graph convolutional network and attention mechanism. Extracted pathway-level information of cancer subtypes is transported into gene-level again by network propagation. In the experiments of five cancer data sets, GCN+MAE showed better cancer subtype classification performances and captured subtype-specific pathways and their biological functions.
The third study is to identify sub-networks of a biological pathway. The goal is to dissect a biological pathway into multiple sub-networks, each of which is to be of a single functional unit. To achieve the goal, a condition-specific sub-module detection method in a biological network, MIDAS (MIning Differentially Activated Subpaths) is developed. From the pathway, edge activities are measured by explicit gene expression and network topology. Using the activities, differentially activated subpaths are explored by a statistical approach. Also, by extending this idea on graph convolutional network, different sub-networks are highlighted by attention mechanisms. In the experiment with breast cancer data, MIDAS and the deep learning model successfully decomposed gene-level features into sub-modules of single functions.
In summary, my doctoral study proposes new computational methods to compare genomic DNA sequences as information contents, to model pathway-based cancer subtype classifications and regulations, and to identify condition-specific sub-modules among multiple cancer subtypes.Chapter 1 Introduction 1
1.1 Biological questions with genetic information 2
1.1.1 Biological Sequences 2
1.1.2 Gene expression 2
1.2 Formulating computational problems for the biological questions 3
1.2.1 Decoding biological sequences by k-mer vectors 3
1.2.2 Interpretation of complex relationships between genes 7
1.3 Three computational problems for the biological questions 9
1.4 Outline of the thesis 14
Chapter 2 Ranked k-spectrum kernel for comparative and evolutionary comparison of DNA sequences 15
2.1 Motivation 16
2.1.1 String kernel for sequence comparison 17
2.1.2 Approach: RKSS kernel 19
2.2 Methods 21
2.2.1 Mapping biological sequences to k-mer space: the k-spectrum string kernel 23
2.2.2 The ranked k-spectrum string kernel with a landmark 24
2.2.3 Single landmark-based reconstruction of phylogenetic tree 27
2.2.4 Multiple landmark-based distance comparison of exons, introns, CpG islands 29
2.2.5 Sequence Data for analysis 30
2.3 Results 31
2.3.1 Reconstruction of phylogenetic tree on the exons, introns, and CpG islands 31
2.3.2 Landmark space captures the characteristics of three genomic regions 38
2.3.3 Cross-evaluation of the landmark-based feature space 45
Chapter 3 Pathway-based cancer subtype classification and interpretation by attention mechanism and network propagation 46
3.1 Motivation 47
3.2 Methods 52
3.2.1 Encoding biological prior knowledge using Graph Convolutional Network 52
3.2.2 Re-producing comprehensive biological process by Multi-Attention based Ensemble 53
3.2.3 Linking pathways and transcription factors by network propagation with permutation-based normalization 55
3.3 Results 58
3.3.1 Pathway database and cancer data set 58
3.3.2 Evaluation of individual GCN pathway models 60
3.3.3 Performance of ensemble of GCN pathway models with multi-attention 60
3.3.4 Identification of TFs as regulator of pathways and GO term analysis of TF target genes 67
Chapter 4 Detecting sub-modules in biological networks with gene expression by statistical approach and graph convolutional network 70
4.1 Motivation 70
4.1.1 Pathway based analysis of transcriptome data 71
4.1.2 Challenges and Summary of Approach 74
4.2 Methods 78
4.2.1 Convert single KEGG pathway to directed graph 79
4.2.2 Calculate edge activity for each sample 79
4.2.3 Mining differentially activated subpath among classes 80
4.2.4 Prioritizing subpaths by the permutation test 82
4.2.5 Extension: graph convolutional network and class activation map 83
4.3 Results 84
4.3.1 Identifying 36 subtype specific subpaths in breast cancer 86
4.3.2 Subpath activities have a good discrimination power for cancer subtype classification 88
4.3.3 Subpath activities have a good prognostic power for survival outcomes 90
4.3.4 Comparison with an existing tool, PATHOME 91
4.3.5 Extension: detection of subnetwork on PPI network 98
Chapter 5 Conclusions 101
κ΅λ¬Έμ΄λ‘ 127Docto
Implementing graph neural networks with TensorFlow-Keras
Graph neural networks are a versatile machine learning architecture that
received a lot of attention recently. In this technical report, we present an
implementation of convolution and pooling layers for TensorFlow-Keras models,
which allows a seamless and flexible integration into standard Keras layers to
set up graph models in a functional way. This implies the usage of mini-batches
as the first tensor dimension, which can be realized via the new RaggedTensor
class of TensorFlow best suited for graphs. We developed the Keras Graph
Convolutional Neural Network Python package kgcnn based on TensorFlow-Keras
that provides a set of Keras layers for graph networks which focus on a
transparent tensor structure passed between layers and an ease-of-use mindset
Integrated Multi-omics Analysis Using Variational Autoencoders: Application to Pan-cancer Classification
Different aspects of a clinical sample can be revealed by multiple types of
omics data. Integrated analysis of multi-omics data provides a comprehensive
view of patients, which has the potential to facilitate more accurate clinical
decision making. However, omics data are normally high dimensional with large
number of molecular features and relatively small number of available samples
with clinical labels. The "dimensionality curse" makes it challenging to train
a machine learning model using high dimensional omics data like DNA methylation
and gene expression profiles. Here we propose an end-to-end deep learning model
called OmiVAE to extract low dimensional features and classify samples from
multi-omics data. OmiVAE combines the basic structure of variational
autoencoders with a classification network to achieve task-oriented feature
extraction and multi-class classification. The training procedure of OmiVAE is
comprised of an unsupervised phase without the classifier and a supervised
phase with the classifier. During the unsupervised phase, a hierarchical
cluster structure of samples can be automatically formed without the need for
labels. And in the supervised phase, OmiVAE achieved an average classification
accuracy of 97.49% after 10-fold cross-validation among 33 tumour types and
normal samples, which shows better performance than other existing methods. The
OmiVAE model learned from multi-omics data outperformed that using only one
type of omics data, which indicates that the complementary information from
different omics datatypes provides useful insights for biomedical tasks like
cancer classification.Comment: 7 pages, 4 figure
- β¦