371 research outputs found
Response projected clustering for direct association with physiological and clinical response data
<p>Abstract</p> <p>Background</p> <p>Microarray gene expression data are often analyzed together with corresponding physiological response and clinical metadata of biological subjects, e.g. patients' residual tumor sizes after chemotherapy or glucose levels at various stages of diabetic patients. Current clustering analysis cannot directly incorporate such quantitative metadata into the clustering heatmap of gene expression. It will be quite useful if these clinical response data can be effectively summarized in the high-dimensional clustering display so that important groups of genes can be intuitively discovered with different degrees of relevance to target disease phenotypes.</p> <p>Results</p> <p>We introduced a novel clustering analysis approach, <it>response projected clustering </it>(RPC), which uses a high-dimensional geometrical projection of response data to the gene expression space. The projected response vector, which becomes the origin in the projected space, is then clustered together with the projected gene vectors based on their different degrees of association with the response vector. A bootstrap-counting based RPC analysis is also performed to evaluate statistical tightness of identified gene clusters. Our RPC analysis was applied to the <it>in vitro </it>growth-inhibition and microarray profiling data on the NCI-60 cancer cell lines and the microarray gene expression study of macrophage differentiation in atherogenesis. These RPC applications enabled us to identify many known and novel gene factors and their potential pathway associations which are highly relevant to the drug's chemosensitivity activities and atherogenesis.</p> <p>Conclusion</p> <p>We have shown that RPC can effectively discover gene networks with different degrees of association with clinical metadata. Performed on each gene's response projected vector based on its degree of association with the response data, RPC effectively summarizes individual genes' association with metadata as well as their own expression patterns. Thus, RPC greatly enhances the utility of clustering analysis on investigating high-dimensional microarray gene expression data with quantitative metadata.</p
Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review
A variety of genome-wide profiling techniques are available to probe
complementary aspects of genome structure and function. Integrative analysis of
heterogeneous data sources can reveal higher-level interactions that cannot be
detected based on individual observations. A standard integration task in
cancer studies is to identify altered genomic regions that induce changes in
the expression of the associated genes based on joint analysis of genome-wide
gene expression and copy number profiling measurements. In this review, we
provide a comparison among various modeling procedures for integrating
genome-wide profiling data of gene copy number and transcriptional alterations
and highlight common approaches to genomic data integration. A transparent
benchmarking procedure is introduced to quantitatively compare the cancer gene
prioritization performance of the alternative methods. The benchmarking
algorithms and data sets are available at http://intcomp.r-forge.r-project.orgComment: PDF file including supplementary material. 9 pages. Preprin
Recommended from our members
Statistical Methods for Integrated Cancer Genomic Data Using a Joint Latent Variable Model
Inspired by the TCGA (The Cancer Genome Atlas), we explore multimodal genomic datasets with integrative methods using a joint latent variable approach. We use iCluster+, an existing clustering method for integrative data, to identify potential subtypes within TCGA sarcoma and mesothelioma tumors, and across a large cohort of 33 dierent TCGA cancer datasets. For classication, motivated to improve the prediction of platinum resistance in high grade serous ovarian cancer (HGSOC) treatment, we propose novel integrative methods, iClassify to perform classication using a joint latent variable model. iClassify provides eective data integration and classication while handling heterogeneous data types, while providing a natural framework to incorporate covariate risk factors and examine genomic driver by covariate risk factor interaction. Feature selection is performed through a thresholding parameter that combines both latent variable and feature coecients. We demonstrate increased accuracy in classication over methods that assume homogeneous data type, such as linear discriminant analysis and penalized logistic regression, and improved feature selection. We apply iClassify to a TCGA cohort of HGSOC patients with three types of genomic data and platinum response data. This methodology has broad applications beyond predicting treatment outcomes and disease progression in cancer, including predicting prognosis and diagnosis in other diseases with major public health implications
Identification of activation of transcription factors from microarray data
Signaling pathways play a critical role in cell survival and development by regulation of transcription factor activity causing necessary gene products to be produced in response to different stimuli. Although the task of detecting activities of signaling pathways is extremely difficult, recent advances in microarray technology promise progress in the field. There are many clustering and pattern recognition algorithms that have been applied to analysis of microarray data. However, these methods lack an ability to address the biological nature of the data and force assignment of one gene to a single co-expression group, while ignoring the fact that many individual genes are regulated by different signaling pathways in response to different stimuli, and therefore the genes should be assigned to multiple groups of coexpression. Another issue in microarray analysis is a low signal-to-noise ratio provided by the technology, yet most of the clustering methods do not even take errors of the measurements into consideration.Bayesian Decomposition is an algorithm that decomposes microarray data into a set of biologically meaningful expression patterns that could be linked to certain signaling pathways and groups of genes that contain these patterns, allowing assignment of one gene to multiple patterns of expression. To address the problem of low signal-to-noise we modified the Bayesian Decomposition algorithm to allow inclusion of prior gene coregulation information to improve statistical power. We also created the Automated Sequence Annotation Pipeline to provide microarray data mining processes with annotation information at all steps and particularly to deduce the coregulation information for a given set of genes from transcription factor database TRANSFAC.We validated enhancements done to Bayesian Decomposition on simulated and real biological data and showed that using coregulation information can improve ability of the method to recover correct results. The designed data mining process that uses the Automated Sequence Annotation Pipeline and the modified Bayesian Decomposition was applied to determine transcription factor activities linked to patient outcome in gastrointestinal stromal tumor (GIST) patients undergoing treatment with imatinib mesylate (IM, Gleevec). The study demonstrates genes that can be potentially used as biomarkers to predict GIST patient response to Gleevec treatment and activity of transcription factors that can contribute to difference in the response.Ph.D., Biomedical Science -- Drexel University, 200
Uncovering Intratumoral And Intertumoral Heterogeneity Among Single-Cell Cancer Specimens
While several tools have been developed to map axes of variation among individual cells, no analogous approaches exist for identifying axes of variation among multicellular biospecimens profiled at single-cell resolution. Developing such an approach is of great translational relevance and interest, as single-cell expression data are now often collected across numerous experimental conditions (e.g., representing different drug perturbation conditions, CRISPR knockdowns, or patients undergoing clinical trials) that need to be compared. In this work, βPhenotypic Earth Mover\u27s Distanceβ (PhEMD) is presented as a solution to this problem. PhEMD is a general method for embedding a βmanifold of manifolds,β in which each datapoint in the higher-level manifold (of biospecimens) represents a collection of points that span a lower-level manifold (of cells).
PhEMD is applied to a newly-generated, 300-biospecimen mass cytometry drug screen experiment to map small-molecule inhibitors based on their differing effects on breast cancer cells undergoing epithelialβmesenchymal transition (EMT). These experiments highlight EGFR and MEK1/2 inhibitors as strongly halting EMT at an early stage and PI3K/mTOR/Akt inhibitors as enriching for a drug-resistant mesenchymal cell subtype characterized by high expression of phospho-S6. More generally, these experiments reveal that the final mapping of perturbation conditions has low intrinsic dimension and that the network of drugs demonstrates manifold structure, providing insight into how these single-cell experiments should be computational modeled and visualized. In the presented drug-screen experiment, the full spectrum of perturbation effects could be learned by profiling just a small fraction (11%) of drugs. Moreover, PhEMD could be integrated with complementary datasets to infer the phenotypes of biospecimens not directly profiled with single-cell profiling. Together, these findings have major implications for conducting future drug-screen experiments, as they suggest that large-scale drug screens can be conducted by measuring only a small fraction of the drugs using the most expensive high-throughput single-cell technologiesβthe effects of other drugs may be inferred by mapping and extending the perturbation space.
PhEMD is also applied to patient tumor biopsies to assess intertumoral heterogeneity. Applied to a melanoma dataset and a clear-cell renal cell carcinoma dataset (ccRCC), PhEMD maps tumors similarly to how it maps perturbation conditions as above in order to learn key axes along which tumors vary with respect to their tumor-infiltrating immune cells. In both of these datasets, PhEMD highlights a subset of tumors demonstrating a marked enrichment of exhausted CD8+ T-cells. The wide variability in tumor-infiltrating immune cell abundance and particularly prominent exhausted CD8+ T-cell subpopulation highlights the importance of careful patient stratification when assessing clinical response to T cell-directed immunotherapies.
Altogether, this work highlights PhEMDβs potential to facilitate drug discovery and patient stratification efforts by uncovering the network geometry of a large collection of single-cell biospecimens. Our varied experiments demonstrate that PhEMD is highly scalable, compatible with leading batch effect correction techniques, and generalizable to multiple experimental designs, with clear applicability to modern precision oncology efforts
RNA-seq λ°μ΄ν°λ₯Ό νμ©ν ν¨μ€μ¨μ΄ νμ±λμ μ λνμ κ΄ν μ°κ΅¬
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :μμ°κ³Όνλν νλκ³Όμ μλ¬Όμ 보νμ 곡,2019. 8. κΉμ .RNA-seq λ°μ΄ν°λ₯Ό μ¬μ©νμ¬ RNA μ μ¬μ²΄μ λ³νλμ μΈ‘μ νλ κ²μ μλ¬Όμ 보ν λΆμΌμμ νμμ μΌλ‘ μννκ³ μλ λΆμ λ°©λ² μ€ νλμ΄λ€. κ·Έλ¬λ RNA-seqμ μΈκ°μ 2λ§κ° μ΄μμ μ μ μλ₯Ό ν¬ν¨νλ κ³ μ°¨μμ μ μ¬μ²΄ λ°μ΄ν°λ₯Ό μμ±νκΈ° λλ¬Έμ, μλμ μΌλ‘ μ μ μμ μνλ€μ λΆμνκ³ μ ν λλ λ°μ΄ν° ν΄μμ μμ΄μ μ΄λ €μμ΄ μλ€. λ°λΌμ, λ λμ μλ¬Όνμ μ΄ν΄λ₯Ό μν΄μλ μλ¬Όνμ ν¨μ€μ¨μ΄μ κ°μ΄ μ μμ½λκ³ λ리 μ¬μ©λλ μ 보λ₯Ό μ¬μ©νλ κ²μ΄ μ μ©νλ€. κ·Έλ¬λ μ μ¬μ²΄ λ°μ΄ν°λ₯Ό μλ¬Όνμ ν¨μ€μ¨μ΄λ‘ μμ½νλ κ²μ λͺ κ°μ§ μ΄μ λ‘ λ§€μ° μ΄λ €μ΄ μμ
μ΄λ€. 첫째, μ μ¬μ²΄ λ°μ΄ν°λ₯Ό ν¨μ€μ¨μ΄ μ°¨μμΌλ‘ λ³νν λ μμ²λ μ 보 μμ€μ΄ λ°μνλ€. μλ₯Ό λ€μ΄, μΈκ°μ μ‘΄μ¬νλ μ 체 μ μ μμ 1/3λ§μ΄ KEGG ν¨μ€μ¨μ΄ λ°μ΄ν°λ² μ΄μ€μμ λ³΄κ³ λκ³ μλ€. λμ§Έ, κ° ν¨μ€μ¨μ΄λ λ§μ μ μ μλ‘ κ΅¬μ±λμ΄ μμΌλ―λ‘ ν¨μ€μ¨μ΄μ νμ±λλ₯Ό μΈ‘μ νλ €λ©΄ ꡬμ±νκ³ μλ μ μ μ κ°μ κ΄κ³λ₯Ό κ³ λ €νλ©΄μ μ μ μ λ°ν κ°μ λ¨μΌ κ°μΌλ‘ μμ½ν΄μΌ νλ€.
λ³Έ λ°μ¬ νμ λ
Όλ¬Έμ ν¨μ€μ¨μ΄ νμ±λ μΈ‘μ μ μν μλ‘μ΄ λ°©λ²μ κ°λ°νκ³ μ¬λ¬ λΉκ΅ κΈ°μ€μ λ°λΌ κΈ°μ‘΄μ λ³΄κ³ λ ν¨μ€μ¨μ΄ νμ±λ λꡬλ€μ λν κ΄λ²μν νκ° μ€νμ μννκ³ μ νλ€. λν μΌλ° μ¬μ©μκ° μμ μ λ°μ΄ν°λ₯Ό μ½κ² λΆμν μ μλλ‘ μμ μΈκΈν λꡬλ€μ μΉ κΈ°λ° μμ€ν
ꡬμΆμ ν΅ν΄ μ½κ² μ¬μ©ν μ μλλ‘ νμλ€.
첫 λ²μ§Έ μ°κ΅¬μμλ μ μ¬μ²΄ μ μ μ λ°νμ μ 보λ₯Ό κ·Έλλ‘ μ¬μ©νκ³ , μνΈμμ© λ€νΈμν¬ μΈ‘λ©΄μμ μ μ μ κ°μ κ΄κ³λ₯Ό κ³ λ €νμ¬ ν¨μ€μ¨μ΄μ κ΄μ μΌλ‘ μ μ¬μ²΄ λ°μ΄ν°λ₯Ό μμ½νλ μλ‘μ΄ λ°©λ²μ κ°λ°νμλ€. μ΄ μ°κ΅¬μμλ λ¨λ°±μ§ μνΈ μμ© λ€νΈμν¬, ν¨μ€μ¨μ΄ λ°μ΄ν°λ² μ΄μ€ λ° RNA-seq μ μ¬μ²΄ λ°μ΄ν°λ₯Ό νμ©νμ¬ μλ¬Όνμ ν¨μ€μ¨μ΄λ₯Ό μ¬λ¬ κ°μ μμ€ν
μΌλ‘ ꡬλΆνλ μλ‘μ΄ κ°λ
μ μ μνκ³ μ νλ€. κ° μμ€ν
λ° κ° μνλ§λ€μ νμ±ν μ λλ₯Ό μΈ‘μ νκΈ° μν΄ SAS (Subsystem Activation Score)λ₯Ό κ°λ°νμλ€. μ΄ λ°©λ²μ μν λ€κ° λ° μ λ°©μ μνλ€ μ¬μ΄μμ μ°¨λ³μ μΌλ‘ νμ±νλλ νΉμ μ μ μ 체 μμμμ νμ±ν ν¨ν΄ λλ μλΈ μμ€ν
μ ννν μ μμλ€. κ·Έλ° λ€μ, λΆλ₯ λ° νκ· νΈλ¦¬ (CART) λΆμμ μννμ¬ μν λͺ¨λΈλ§μ μν΄ SAS μ 보λ₯Ό μ¬μ©νμ΅λλ€. κ·Έ κ²°κ³Ό, 10 κ°μ κ°μ₯ μ€μν νμ μμ€ν
μΌλ‘ μ μ λ 11 κ°μ νμ νμ κ·Έλ£Ήμ μμ‘΄ κ²°κ³Όμ μμ΄ μ΅λ λΆμΌμΉλ‘ νμΈλμλ€. μ΄ λͺ¨λΈμ μ μ¬ν μμ‘΄ κ²°κ³Όλ₯Ό κ°μ§ νμ νμ κ·Έλ£Ήμ μ μνμλΏλ§ μλλΌ κΈ°λ₯μ μΌλ‘ μ μ΅ν μ λ°©μ μ μ μ μΈνΈλ₯Ό μ μνλ νμ μμ€ν
μ νμ±ν μνμ λ°λΌ κ²°μ λλ μν νΉμ΄μ μΈ μνμ νλ¨ κ²½λ‘λ₯Ό μ 곡νλ€.
λ λ²μ§Έ μ°κ΅¬λ μ μ (pan-cancer) λ°μ΄ν° μΈνΈλ₯Ό μ¬μ©νμ¬ λ€μ― κ°μ§ λΉκ΅ κΈ°μ€μ λ°λΌ 13 κ°μ§μ ν¨μ€μ¨μ΄ νμ±λ μΈ‘μ λꡬλ₯Ό 체κ³μ μΌλ‘ λΉκ΅ λ° νκ°νλ μ°κ΅¬μ΄λ€.νμ‘΄νλ ν¨μ€μ¨μ΄ νμ±λ μΈ‘μ λκ΅¬κ° λ§μ΄ μμ§λ§, μ΄λ¬ν λκ΅¬κ° μ½νΈνΈ μμ€μμ μ μ©ν μ 보λ₯Ό μ 곡νλμ§μ λν λΉκ΅ μ°κ΅¬λ μλ€. μ΄ μ°κ΅¬λ ν¬κ² λ κ°μ§ λΆλΆμ λν΄μ μλ―Έκ° μλ€. 첫째, μ΄ μ°κ΅¬λ κΈ°μ‘΄μ ν¨μ€μ¨μ΄ νμ±λ μΈ‘μ λꡬμμ μ¬μ©λλ κ³μ° κΈ°λ²μ λν ν¬κ΄μ μΈ μ 보λ₯Ό μ 곡νλ€. ν¨μ€μ¨μ΄ νμ±λ μΈ‘μ μ λ€μν μ κ·Όλ²μ μ¬μ©νκ³ , μ
λ ₯ λ°μ΄ν°μ λ³ν, μν μ 보μ μ¬μ©, μ½νΈνΈ μμ€μ μΈν λ°μ΄ν°μ νμμ±, μ μ μ κ΄κ³ λ° μ μ체κ³μ μ¬μ© λ±μμ λ€μν μꡬ μ¬νμ κ°μ ν΄μΌ νλ€. λμ§Έ, μ΄λ¬ν λꡬμ μ±λ₯μ λν λ€μ― κ°μ§ λΉκ΅ κΈ°μ€μ μ¬μ©νμ¬ κ΄λ²μν νκ°κ° μνλμλ€. λκ΅¬κ° μλμ μ μ μ λ°ν νλ‘νμΌμ νΉμ±μ μΌλ§λ μ μ μ§νλμ§λ₯Ό μΈ‘μ νλ κ²λΆν°, μ μ μ λ°ν λ°μ΄ν°μ λ
Έμ΄μ¦λ₯Ό μμλ‘ λμ
νμμ λ μΌλ§λ λκ°νμ§ λ±μ μ‘°μ¬νλ€. μμ μ μ©μ μν λꡬμ μ μ©μ±μ νκ°νκΈ° μν΄ μΈκ°μ§ λ³μ (μ’
μ λ μ μ, μμ‘΄ λ° μμ μν)μ λν λΆλ₯ μμ
μ μννλ€.
μΈ λ²μ§Έ μ°κ΅¬λ μ¬μ©μκ° μ μ¬μ²΄ λ°μ΄ν°λ₯Ό μ 곡νκ³ , μμ μ°κ΅¬μμ λΉκ΅ν νμ±λ μΈ‘μ λꡬλ₯Ό μ¬μ©νμ¬ ν¨μ€μ¨μ΄ νμ±λλ₯Ό μΈ‘μ νλ ν΄λΌμ°λ κΈ°λ° μμ€ν
(PathwayCloud)μ ꡬμΆνλ κ²μ΄λ€. μ¬μ©μκ° λ°μ΄ν°λ₯Ό μμ€ν
μ μ
λ‘λνκ³ μ€νν λΆμ λꡬλ₯Ό μ ννλ©΄, μ΄ μμ€ν
μ κ° λꡬμ λν ν¨μ€μ¨μ΄ νμ±λ κ°κ³Ό μ νν λꡬμ λν μ±λ₯ λΉκ΅ μμ½μ μλμΌλ‘ μννλ€. μ¬μ©μλ λν μ£Όμ΄μ§ μν μ 보μ μΈ‘λ©΄μμ μ΄λ€ ν¨μ€μ¨μ΄κ° μ€μνμ§ μ‘°μ¬ ν μ μμΌλ©°, KEGG rest APIλ₯Ό ν΅ν΄μ μ§μ ν¨μ€μ¨μ΄μ μ΄λ€ μ μ μμ λ³νκ° μ μλ―Ένμ§λ₯Ό μκ°μ μΌλ‘ λΆμν μ μλ€.
κ²°λ‘ μ μΌλ‘, λ³Έ νμ λ
Όλ¬Έμ κ³ μ©λμ μ μ μ λ°ν λ°μ΄ν°λ₯Ό μ¬μ©νμ¬ μλ¬Όνμ ν¨μ€μ¨μ΄μ λν λΆμ λ°©λ²μ κ°λ°νκ³ , λ€λ₯Έ μ νμ λꡬλ₯Ό ν¬κ΄μ μΈ κΈ°μ€μΌλ‘ λΉκ΅νκ³ , μ¬μ©μκ° μ΄ λꡬλ€μ μ½κ² μ κ·Όν μ μλ μΉ κΈ°λ° μμ€ν
μ μ 곡νλ κ²μ λͺ©νλ‘ νλ€. μ΄ μ λ°μ μΈ μ κ·Ό λ°©μμ μλ¬Όνμ ν¨μ€μ¨μ΄ μΈ‘λ©΄μμ μ μ μ λ°ν λ°μ΄ν°λ₯Ό μ΄ν΄νλ λ° μ€μνλ€.Measuring the dynamics of RNA transcripts using RNA-seq data has become routine in bioinformatics analyses. However, RNA-seq produces high-dimensional transcriptome data on more than 20,000 genes in humans. This makes the interpretation of the data extremely difficult given a relatively small set of samples. Therefore, it is desirable to use well-summarized and widely-used information such as biological pathways for better biological comprehension. However, summarizing transcriptome data in terms of biological pathways is a very challenging task for several reasons. First, there is a huge information loss when transforming transcriptome data to pathway space. For example, in humans, only one third of the entire set of genes being analyzed are present in KEGG pathways. Second, each pathway consists of many genes; thus, measuring pathway activity requires a strategy to summarize expression profiles of component genes into a single value, while considering relationship among the constituent genes.
My doctoral study aimed to develop a new method for pathway activity measurement, and to perform extensive evaluation experiments on existing pathway measurement tools in terms of multiple evaluation criteria. In addition, a cloud-based system was constructed to deploy such tools, which facilitates users analyzing their own data easily.
The first study is to develop a new method to summarize transcriptome data in terms of pathways by using explicit transcript quantity information and considering relationship among genes in terms of their interactions. In this study, I propose a novel concept of decomposing biological pathways into subsystems by utilizing protein interaction network, pathway information, and RNA-seq data. A subsystem activation score (SAS) was designed to measure the degree of activation for each subsystem and each patient. This method revealed distinctive genome-wide activation patterns or landscapes of subsystems that are differentially activated among samples as well as among breast cancer subtypes. Next, we used SAS information for prognostic modeling by classification and regression tree (CART) analysis. Eleven subgroups of patients, defined by the 10 most significant subsystems, were identified with maximal discrepancy in survival outcome. Our model not only defined patient subgroups with similar survival outcomes, but also provided patient-specific decision paths determined by SAS status, suggesting functionally informative gene sets in breast cancer.
The second study aimed to systematically compare and evaluate thirteen different pathway activity inference tools based on five comparison criteria using a pan-cancer data set. Although many pathway activity tools are available, there is no comparative study on how effective these tools are in producing useful information at the cohort level, enabling comparison of many samples. This study has two major contributions. First, this study provides a comprehensive survey on computational techniques used by existing pathway activity inference tools. Existing tools use different strategies and assume different requirements on data: input transformation, use of labels, necessity of cohort-level input data, use of gene relations and scoring metrics. Second, extensive evaluations were conducted using five comparison criteria concerning the performance of these tools. Starting from measuring how well a tool maintains the characteristics of an original gene expression profile, robustness was also investigated by introducing noise into gene expression data. Classification tasks on three clinical variables were performed to evaluate the utility of tools.
The third study is to build a cloud-based system where a user provides transcriptome data and measures pathway activities using the tools that were used for the comparative study. When a user uploads input data to the system and selects which preferred analysis tools are to be run, the system automatically generates pathway activity values for each tool as well as a summary of performance comparison for the selected tools. Users can also investigate which pathways are significant in terms of the given sample information and visually inspect genes within a pathway-linked KEGG rest API.
In conclusion, in my thesis, I sought to develop an analysis method regarding biological pathways using high throughput gene expression data to compare different types of tools with comprehensive criteria, and to arrange the tools in a cloud-based system that is easily accessible. As pathways aggregate various molecular events among genes in to a single entity, the set of suggested approaches will aid interpretation of high-throughput data as well as facilitate integration of diverse data layers such as miRNA or DNA methylation profiles being taken into consideration.Chapter 1 Introduction 1
1.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Biological pathways . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Gene expression . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Pathway-based analysis . . . . . . . . . . . . . . . . . . . 7
1.1.4 Pathway activity measurement . . . . . . . . . . . . . . . 8
1.2 Challenges in pathway activity measurement . . . . . . . . . . . 9
1.2.1 Calculating effective pathway activity values from RNAseq data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Lack of comparative criteria to evaluate pathway activity tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.3 Absence of a user-friendly environment of pathway activity inference tools . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2 Measuring pathway activity from RNA-seq data to identify breast cancer subsystems using protein-protein interaction network 14
2.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Breast cancer subsystems . . . . . . . . . . . . . . . . . . 20
2.3.2 Subsystem Activation Score . . . . . . . . . . . . . . . . . 22
2.3.3 Prognostic modeling . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Hierarchical clustering of patients and subsystems . . . . 24
2.3.5 Tools used in this study . . . . . . . . . . . . . . . . . . . 25
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Pathways were decomposed into coherent functional units - subsystems . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Landscape of subsystems reflect the breast cancer biology 26
2.4.3 SAS revealed patient clusters associated with PAM50 subtypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.4 Prognostic modeling by subsystems showed 11 patient subgroups with distinct survival outcome . . . . . . . . . 31
2.4.5 Relapse rate and CNVs were enriched to worse prognostic subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 3 Comprehensive evaluation of pathway activity measurement tools on pan-cancer data 40
3.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Pathway activity inference Tools . . . . . . . . . . . . . . 45
3.3.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.3 Pathway database . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Comparative approach . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Radar chart criteria . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Similarity among the tools . . . . . . . . . . . . . . . . . . 53
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5.1 Distance preservation . . . . . . . . . . . . . . . . . . . . 53
3.5.2 Robustness against noise . . . . . . . . . . . . . . . . . . . 57
3.5.3 Classification: Tumor vs Normal . . . . . . . . . . . . . . 60
3.5.4 Classification: survival information . . . . . . . . . . . . . 62
3.5.5 Classification: cancer subtypes . . . . . . . . . . . . . . . 63
3.5.6 Similarity among the tools . . . . . . . . . . . . . . . . . . 63
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 4 A cloud-based system of pathway activity inference tools using high-throughput gene expression data 68
4.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 Calculating pathway activity values . . . . . . . . . . . . 71
4.4.2 Identification of significant pathways . . . . . . . . . . . . 72
4.4.3 Visualization in KEGG pathways . . . . . . . . . . . . . . 72
4.4.4 Comparison of the tools . . . . . . . . . . . . . . . . . . . 75
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 5 Conclusion 77
μ΄λ‘ 101Docto
Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records
The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and protein-protein interactions datasets to evaluate the validation of the tool. The results show that the tool can effectively perform biological sequence analysis and classification tasks. Applying machine learning algorithms for Electronic medical record (EMR) data analysis is also included in this dissertation. Chronic kidney disease (CKD) is prevalent across the world and well defined by an estimated glomerular filtration rate (eGFR). The progression of kidney disease can be predicted if future eGFR can be accurately estimated using predictive analytics. Thus, I present a prediction model of eGFR that was built using Random Forest regression. The dataset includes demographic, clinical and laboratory information from a regional primary health care clinic. The final model included eGFR, age, gender, body mass index (BMI), obesity, hypertension, and diabetes, which achieved a mean coefficient of determination of 0.95. The estimated eGFRs were used to classify patients into CKD stages with high macro-averaged and micro-averaged metrics
- β¦