203 research outputs found

    A critical evaluation of network and pathway based classifiers for outcome prediction in breast cancer

    Get PDF
    Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single gene classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single gene classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single gene classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single gene sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single gene classifiers for predicting outcome in breast cancer

    Molecular Inverse Comorbidity between Alzheimerโ€™s Disease and Lung Cancer: New Insights from Matrix Factorization

    Get PDF
    International audienceMatrix factorization (MF) is an established paradigm for large-scale biological data analysis with tremendous potential in computational biology. Here, we challenge MF in depicting the molecular bases of epidemiologically described disease-disease (DD) relationships. As a use case, we focus on the inverse comorbidity association between Alzheimer's disease (AD) and lung cancer (LC), described as a lower than expected probability of developing LC in AD patients. To this day, the molecular mechanisms underlying DD relationships remain poorly explained and their better characterization might offer unprecedented clinical opportunities. To this goal, we extend our previously designed MF-based framework for the molecular characterization of DD relationships. Considering AD-LC inverse comorbidity as a case study, we highlight multiple molecular mechanisms, among which we confirm the involvement of processes related to the immune system and mitochondrial metabolism. We then distinguish mechanisms specific to LC from those shared with other cancers through a pan-cancer analysis. Additionally, new candidate molecular players, such as estrogen receptor (ER), cadherin 1 (CDH1) and histone deacetylase (HDAC), are pinpointed as factors that might underlie the inverse relationship, opening the way to new investigations. Finally, some lung cancer subtype-specific factors are also detected, also suggesting the existence of heterogeneity across patients in the context of inverse comorbidity

    NetCore: a network propagation approach using node coreness

    Get PDF
    We present NetCore, a novel network propagation approach based on node coreness, for phenotypeโ€“genotype associations and module identification. NetCore addresses the node degree bias in PPI networks by using node coreness in the random walk with restart procedure, and achieves improved re-ranking of genes after propagation. Furthermore, NetCore implements a semi-supervised approach to identify phenotype-associated network modules, which anchors the identification of novel candidate genes at known genes associated with the phenotype. We evaluated NetCore on gene sets from 11 different GWAS traits and showed improved performance compared to the standard degree-based network propagation using cross-validation. Furthermore, we applied NetCore to identify disease genes and modules for Schizophrenia GWAS data and pan-cancer mutation data. We compared the novel approach to existing network propagation approaches and showed the benefits of using NetCore in comparison to those. We provide an easy-to-use implementation, together with a high confidence PPI network extracted from ConsensusPathDB, which can be applied to various types of genomics data in order to obtain a re-ranking of genes and functionally relevant network modules

    From Correlation to Causality: Does Network Information improve Cancer Outcome Prediction?

    Get PDF
    Motivation: Disease progression in cancer can vary substantially between patients. Yet, patients often receive the same treatment. Recently, there has been much work on predicting disease progression and patient outcome variables from gene expression in order to personalize treatment options. A widely used approach is high-throughput experiments that aim to explore predictive signature genes which would provide identification of clinical outcome of diseases. Microarray data analysis helps to reveal underlying biological mechanisms of tumor progression, metastasis, and drug-resistance in cancer studies. Despite first diagnostic kits in the market, there are open problems such as the choice of random gene signatures or noisy expression data. The experimental or computational noise in data and limited tissue samples collected from patients might furthermore reduce the predictive power and biological interpretability of such signature genes. Nevertheless, signature genes predicted by different studies generally represent poor similarity; even for the same type of cancer. Integration of network information with gene expression data could provide more efficient signatures for outcome prediction in cancer studies. One approach to deal with these problems employs gene-gene relationships and ranks genes using the random surfer model of Google's PageRank algorithm. Unfortunately, the majority of published network-based approaches solely tested their methods on a small amount of datasets, questioning the general applicability of network-based methods for outcome prediction. Methods: In this thesis, I provide a comprehensive and systematically evaluation of a network-based outcome prediction approach -- NetRank - a PageRank derivative -- applied on several types of gene expression cancer data and four different types of networks. The algorithm identifies a signature gene set for a specific cancer type by incorporating gene network information with given expression data. To assess the performance of NetRank, I created a benchmark dataset collection comprising 25 cancer outcome prediction datasets from literature and one in-house dataset. Results: NetRank performs significantly better than classical methods such as foldchange or t-test as it improves the prediction performance in average for 7%. Besides, we are approaching the accuracy level of the authors' signatures by applying a relatively unbiased but fully automated process for biomarker discovery. Despite an order of magnitude difference in network size, a regulatory, a protein-protein interaction and two predicted networks perform equally well. Signatures as published by the authors and the signatures generated with classical methods do not overlap -- not even for the same cancer type -- whereas the network-based signatures strongly overlap. I analyze and discuss these overlapping genes in terms of the Hallmarks of cancer and in particular single out six transcription factors and seven proteins and discuss their specific role in cancer progression. Furthermore several tests are conducted for the identification of a Universal Cancer Signature. No Universal Cancer Signature could be identified so far, but a cancer-specific combination of general master regulators with specific cancer genes could be discovered that achieves the best results for all cancer types. As NetRank offers a great value for cancer outcome prediction, first steps for a secure usage of NetRank in a public cloud are described. Conclusion: Experimental evaluation of network-based methods on a gene expression benchmark dataset suggests that these methods are especially suited for outcome prediction as they overcome the problems of random gene signatures and noisy expression data. Through the combination of network information with gene expression data, network-based methods identify highly similar signatures over all cancer types, in contrast to classical methods that fail to identify highly common gene sets across the same cancer types. In general allows the integration of additional information in gene expression analysis the identification of more reliable, accurate and reproducible biomarkers and provides a deeper understanding of processes occurring in cancer development and progression.:1 Definition of Open Problems 2 Introduction 2.1 Problems in cancer outcome prediction 2.2 Network-based cancer outcome prediction 2.3 Universal Cancer Signature 3 Methods 3.1 NetRank algorithm 3.2 Preprocessing and filtering of the microarray data 3.3 Accuracy 3.4 Signature similarity 3.5 Classical approaches 3.6 Random signatures 3.7 Networks 3.8 Direct neighbor method 3.9 Dataset extraction 4 Performance of NetRank 4.1 Benchmark dataset for evaluation 4.2 The influence of NetRank parameters 4.3 Evaluation of NetRank 4.4 General findings 4.5 Computational complexity of NetRank 4.6 Discussion 5 Universal Cancer Signature 5.1 Signature overlap โ€“ a sign for Universal Cancer Signature 5.2 NetRank genes are highly connected and confirmed in literature 5.3 Hallmarks of Cancer 5.4 Testing possible Universal Cancer Signatures 5.5 Conclusion 6 Cloud-based Biomarker Discovery 6.1 Introduction to secure Cloud computing 6.2 Cancer outcome prediction 6.3 Security analysis 6.4 Conclusion 7 Contributions and Conclusion

    Proteomic analyses reveal distinct chromatin-associated and soluble transcription factor complexes.

    Get PDF
    The current knowledge on how transcription factors (TFs), the ultimate targets and executors of cellular signalling pathways, are regulated by protein-protein interactions remains limited. Here, we performed proteomics analyses of soluble and chromatin-associated complexes of 56 TFs, including the targets of many signalling pathways involved in development and cancer, and 37 members of the Forkhead box (FOX) TF family. Using tandem affinity purification followed by mass spectrometry (TAP/MS), we performed 214 purifications and identified 2,156 high-confident protein-protein interactions. We found that most TFs form very distinct protein complexes on and off chromatin. Using this data set, we categorized the transcription-related or unrelated regulators for general or specific TFs. Our study offers a valuable resource of protein-protein interaction networks for a large number of TFs and underscores the general principle that TFs form distinct location-specific protein complexes that are associated with the different regulation and diverse functions of these TFs

    Falsifiable Network Models. A Network-based Approach to Predict Treatment Efficacy in Ulcerative Colitis

    Get PDF
    This work is focused on understanding the treatment efficacy of patients with ulcerative colitis (UC) using a network-based approach. UC is one of two forms of inflammatory bowel disease (IBD) along with Crohnโ€™s disease. UC is a debilitating condition characterized by chronic inflammation and ulceration of the colon and rectum. UC symptoms occur gradually rather than abruptly, and the degree of symptoms differs across UC patients. Only around 20% of all UC cases can be explained by known genetic variations, implying a more ambiguous aetiology that is yet not fully understood but is thought to involve a complex interplay between genetic and environmental factors. The available therapy for UC substantially reduces symptoms and achieves long-term remission. However, about one-third of UC patients fail to respond to anti-TNFฮฑ therapy and consequently develop long-term side effects due to medication. Non-response to existing antibody-based therapies in subgroups of UC patients is a major challenge and incurs a healthcare burden. Therefore, the disease markers for predicting therapy response to assist individualized therapy decisions are needed. To date, no quantitative computational framework is available to predict treatment response in UC. We developed a quantitative framework that uses gene expression data and existing biological background information on signalling pathways to quantify network connectivity from receptors to transcription factors (TF) that are involved in UC pathogenesis. Variations in network connectivity in UC patients can be used to identify responders and non-responders to anti-TNFฮฑ and anti-Integrin treatment. Our findings allow us to summarize the effect of small gene expression changes on the overall connectivity of a signalling network and estimate the effect this will have on the individual patients' responses. Estimating the network connectivity associated with varied drug responses may provide an understanding of individualized treatment outcomes. Our model could be used to generate testable hypotheses about how individual genes act together in networks to cause inflammation in UC as well as other immune-inflammatory diseases such as psoriasis, asthma, and rheumatoid arthritis

    RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์˜ ํ•ด๋…๊ณผ ํ™œ์šฉ์„ ์œ„ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต,2019. 8. ๊น€์„ .์ง„ํ•ต ์„ธํฌ ์‹œ์Šคํ…œ์—์„œ๋Š” mRNA ๋ถ„์ž๊ฐ€ ์ „์‚ฌ๋œ ์ดํ›„ ์™„์ „ํžˆ ์ฒ˜๋ฆฌ๋˜์–ด ๋‹จ๋ฐฑ์งˆ๋กœ ๋ฒˆ์—ญ๋  ๋•Œ๊นŒ์ง€ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์˜ ์ „์‚ฌ ํ›„ ์กฐ์ ˆ ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋œ๋‹ค. ์ „์‚ฌ ํ›„ ์กฐ์ ˆ ๊ณผ์ •์€ RNA ํŽธ์ง‘, ์„ ํƒ์  ์ ‘ํ•ฉ, ์„ ํƒ์  ์•„๋ฐ๋‹ํ™” ๋“ฑ์„ ํฌํ•จํ•œ๋‹ค. ์ฆ‰ ์–ด๋Š ํ•œ ์‹œ์ ์—์„œ ์ „์‚ฌ์ฒด๋ฅผ ๋“ค์—ฌ๋‹ค๋ณด๋ฉด ๊ทธ ๋‚ด๋ถ€๋Š” ๋‹ค์–‘ํ•œ ์ค‘๊ฐ„์ฒด๋“ค์˜ ํ˜ผํ•ฉ๋ฌผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ณต์žกํ•œ ์กฐ์ ˆ ์‹œ์Šคํ…œ ๋•Œ๋ฌธ์— ์ „์‚ฌ์ฒด๋ฅผ ์ „์ฒด์ ์ธ ์ˆ˜์ค€์—์„œ ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ์‰ฝ์ง€ ์•Š๋‹ค. ๋ณธ ํ•™์œ„ ์—ฐ๊ตฌ๋Š” RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด๋…ํ•˜๊ณ  ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•๋“ค์— ๋Œ€ํ•œ ์—ฐ๊ตฌ์ด๋ฉฐ RNA ํŽธ์ง‘, ์„ ํƒ์  ์ ‘ํ•ฉ ๋ฐ ์œ ์ „์ž ๋ฐœํ˜„์˜ ๊ด€์ ์—์„œ ์ˆ˜ํ–‰๋œ ์„ธ ๊ฐ€์ง€ ์—ฐ๊ตฌ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. RNA ํŽธ์ง‘์€ ADAR(A=>I) ๊ณผ APOBEC(C=>U) ๋‘ ๊ฐ€์ง€ ํšจ์†Œ์— ์˜ํ•ด ์ด‰๋งค ๋˜๋Š” ์ „์‚ฌ ํ›„ RNA ์„œ์—ด ์กฐ์ ˆ ๊ธฐ์ž‘์ด๋‹ค. RNA ํŽธ์ง‘์€ ๋‹จ๋ฐฑ์งˆ ํ™œ์„ฑ๋„, ์„ ํƒ์  ์ ‘ํ•ฉ ๋ฐ miRNA ํ‘œ์  ์กฐ์ ˆ ๋“ฑ ๋‹ค์–‘ํ•œ ์„ธํฌ ๊ธฐ์ž‘์„ ์ œ์–ดํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ง„ ์ค‘์š”ํ•œ ์ƒˆํฌ ๋‚ด ์กฐ์ ˆ ์‹œ์Šคํ…œ์ด๋‹ค. RNA ์‹œํ€€์‹ฑ์„ ์ด์šฉํ•ด RNA ํŽธ์ง‘ ํ˜„์ƒ์„ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์€ RNA ํŽธ์ง‘ ํ˜„์ƒ์˜ ์ƒ๋ฌผํ•™์  ๊ธฐ๋Šฅ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ์— ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ๋ฌธ์ œ๋Š” ์ด ๊ณผ์ •์—์„œ ์ƒ๋‹นํ•œ ์–‘์˜ ์œ„์–‘์„ฑ์ด ๋ฐœ์ƒํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. ์ƒ˜ํ”Œ๋‹น ์ˆ˜๋งŒ ๊ฐœ ์ด์ƒ ๋ฐœ์ƒํ•˜๋Š” RNA ํŽธ์ง‘ ์ž”๊ธฐ๋“ค ๋ชจ๋‘๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๊ฑธ๋Ÿฌ๋‚ด๊ธฐ ์œ„ํ•œ ์ „์‚ฐํ•™์  ๋ชจ๋ธ์ด ์š”๊ตฌ๋œ๋‹ค. RDDpred๋Š” RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ RNA ํŽธ์ง‘ ํ˜„์ƒ์„ ๊ฒ€์ถœํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์œ„์–‘์„ฑ ์ž”๊ธฐ๋“ค์„ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ์ˆ ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. RDDpred๋Š” ๋‘ ๊ฐœ์˜ ๊ธฐ ๋ฐœํ‘œ๋œ RNA ํŽธ์ง‘ ์—ฐ๊ตฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. RNA ์‹œํ€€์‹ฑ ๊ธฐ์ˆ ์ด ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๋˜ ํ•˜๋‚˜์˜ ๋ณต์žกํ•œ ๋ฌธ์ œ๋กœ ์ ‘ํ•ฉ์ฒด ์ฐจ์›์—์„œ์˜ ์ข…์–‘ ์ด์งˆ์„ฑ (ITH) ์ธก์ • ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ITH๋Š” ์•” ์กฐ์ง์„ ๊ตฌ์„ฑํ•˜๋Š” ์„ธํฌ ์ง‘๋‹จ์˜ ๋‹ค์–‘์„ฑ์˜ ์ง€ํ‘œ์ด๋ฉฐ, ์ตœ๊ทผ ์ถœํŒ๋œ ์—ฐ๊ตฌ๋“ค์˜ ๊ฒฐ๊ณผ๋Š” ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ธก์ •๋œ ์ „์‚ฌ์ฒด ์ˆ˜์ค€์—์„œ์˜ ITH๊ฐ€ ์•” ํ™˜์ž์˜ ์˜ˆํ›„์˜ˆ์ธก์— ์œ ์šฉํ•จ์„ ์‹œ์‚ฌํ•œ๋‹ค. ์ ‘ํ•ฉ์ฒด๋Š” ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰๊ณผ ํ•จ๊ป˜ ์ „์‚ฌ์ฒด๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ฃผ์š” ์š”์†Œ ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ ๋”ฐ๋ผ์„œ ์ ‘ํ•ฉ์ฒด ์ˆ˜์ค€์—์„œ ITH๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์€ ๋ณด๋‹ค ์ „์ฒด์ ์ธ ์ˆ˜์ค€์—์„œ ์ „์‚ฌ์ฒด ITH๋ฅผ ์—ฐ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ๋ฆ„์ด๋‹ค. RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ์•” ์ ‘ํ•ฉ์ฒด ์ˆ˜์ค€์—์„œ ITH๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ณผ์ •์—๋Š” ๋ณต์žกํ•œ ์ ‘ํ•ฉ ํŒจํ„ด๊ณผ ๊ด‘๋ฒ”์œ„ํ•œ ์ธํŠธ๋ก  ์—ฐ์žฅ ๋ณ€์ด ๋ฐ ์งง์€ ์‹œํ€€์‹ฑ ํŒ๋… ๊ธธ์ด ๋“ฑ์˜ ์‹ฌ๊ฐํ•œ ๊ธฐ์ˆ ์  ๋‚œ๊ด€๋“ค์ด ์žˆ๋‹ค. SpliceHetero๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค์„ ๊ณ ๋ คํ•˜์—ฌ ์ ‘ํ•ฉ์ฒด ์ˆ˜์ค€์—์„œ์˜ ITH (์ฆ‰, sITH)๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋„๊ตฌ์ด๋ฉฐ ๋‚ด๋ถ€์ ์œผ๋กœ ์ •๋ณด์ด๋ก ์„ ํ™œ์šฉํ•œ๋‹ค. SpliceHetero๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ, ์ด์ข…์ด์‹ ์ข…์–‘ ๋ฐ์ดํ„ฐ ๋ฐ TCGA pan-cancer ๋ฐ์ดํ„ฐ ๋“ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ๊ฒ€์ฆ๋˜์—ˆ์œผ๋ฉฐ ITH๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋˜์—ˆ๋‹ค. ์ด๋ฟ ์•„๋‹ˆ๋ผ sITH๋Š” ์•”์˜ ์ง„ํ–‰๊ณผ ์•” ํ™˜์ž์˜ ์˜ˆํ›„ ๋ฐ PAM50์™€ ๊ฐ™์€ ์ž˜ ์•Œ๋ ค์ง„ ๋ถ„์ž ์•„ํ˜•๋“ค๊ณผ๋„ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋˜์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ์—ฐ๊ตฌ ์ฃผ์ œ๋Š” ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ํŠน์ • ์•” ํ‘œํ˜„ํ˜•์— ํŠน์ด์ ์ธ ํ™˜์ž ๋ถ€๋ถ„ ๊ณต๊ฐ„์„ ์ •์˜ํ•˜๋Š” ๊ธฐ๊ณ„ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋Š” ์•” ํ™˜์ž์˜ ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰ ํ”„๋กœํŒŒ์ผ์„ ์–ป๋Š” ๋ฐ์— ์œ ์šฉํ•œ ๋„๊ตฌ์ด์ง€๋งŒ, 2๋งŒ ๊ฐœ ์ด์ƒ์˜ ์ฐจ์›์„ ๊ฐ€์ง„ ๋งค์šฐ ๊ณ ์ฐจ์›์˜ ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์งˆ์ ์ธ ์šฉ๋„๋กœ ์‚ฌ์šฉ๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ทธ ์ฐจ์›์˜ ํฌ๊ธฐ๋ฅผ ์ถ•์†Œํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ด๋•Œ ๊ฐ ์œ ์ „์ž๋“ค์€ ๋ณต์žกํ•˜์ง€๋งŒ ๊ณ ์œ ํ•œ ๋ฐฉ์‹์œผ๋กœ ์„œ๋กœ ์ƒํ˜ธ์ž‘์šฉํ•œ๋‹ค๋Š” ์ ์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆ๋œ ๋‹จ๋ฐฑ์งˆ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ ์ •๋ณด๋ฅผ ๋ชจ์•„ ๋„คํŠธ์›Œํฌ ํ˜•ํƒœ๋กœ ๋ฌถ์€ ๊ฒƒ์„ ๋‹จ๋ฐฑ์งˆ ์ƒํ˜ธ์ž‘์šฉ ๋„คํŠธ์›Œํฌ (ํ˜น์€ PIN)๋ผ ๋ถ€๋ฅธ๋‹ค. ์ด PIN์„ ํ™œ์šฉํ•˜์—ฌ RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์ค„์ด๋ฉด์„œ๋„ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ƒ๋ฌผํ•™์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ํŠน์ง•๋“ค์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค. Tumor2Vec์€ ์ด๋ ‡๊ฒŒ ์ถ”์ถœ๋œ PIN ์ˆ˜์ค€์˜ ํŠน์ง•๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ํŠน์ • ์•” ํ‘œํ˜„ํ˜•์— ํŠน์ด์ ์ธ ํ™˜์ž ๋ถ€๋ถ„ ๊ณต๊ฐ„์„ ์ •์˜ํ•œ๋‹ค. Tumor2Vec์€ ์กฐ๊ธฐ ๊ตฌ๊ฐ• ์•”์—์„œ ๋ฆผํ”„์ ˆ ์ „์ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ํŒŒ์ผ๋Ÿฟ ์—ฐ๊ตฌ์— ์ ์šฉ๋˜์—ˆ์œผ๋ฉฐ ๊ทธ ๊ฒฐ๊ณผ RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์ค„์—ฌ ๋ฆผํ”„์ ˆ ์ „์ด ์˜ˆ์ธก ๋ชจ๋ธ์„ ์ƒ์„ฑํ–ˆ๊ณ  ์ด ๊ณผ์ •์—์„œ ์•” ํ‘œํ˜„ํ˜•์„ ์ž˜ ์„ค๋ช…ํ•˜๋Š” PIN ์ˆ˜์ค€์˜ ํŠน์ง•๋“ค์„ ๋ณด์กดํ•˜๋Š” ๋ฐ์—๋„ ์„ฑ๊ณตํ–ˆ๋‹ค.In eukaryotic cells, there are several post-transcriptional modification steps such as RNA editing and alternative splicing, until mRNA molecules are fully matured and translated into proteins. Thus, the transcriptome is a complex mixture of various intermediates that are processed in multiple steps. This complex regulatory structure makes it difficult to fully understand the landscape of transcriptome. My doctoral study consists of three studies that enable RNA-seq to be decoded and utilized in terms of RNA editing, alternative splicing, and gene expression. RNA editing is a post-transcriptional RNA sequence modification performed by two catalytic enzymes ADAR (A-to-I) and APOBEC (C-to-U). RNA editing is considered an important regulatory system that controls a variety of cellular functions such as protein activation, alternative splicing, and miRNA targeting. Therefore, detecting RNA editing events in RNA-seq data is important for understanding its biological functions. However, it is known that a significant amount of false-positives occur when detecting RNA editing in RNA-seq. Since it is not possible to experimentally validate all RNA editing residues extracted from RNA-seq, a computational model is needed to filter potential false-positive RNA editing calls. RDDpred, an RNA editing predictor based on machine learning techniques, was developed to filter out false-positive RNA editing calls in RNA-seq. It uses prior knowledge bases to collect training instances directly from the input data, and then trains the random forest (RF) predictors that are specific to the input data. RDDpred was tested using two publicly available datasets of RNA editing studies and has shown good performance. Another complex problem in RNA-seq decoding is spliceomic intratumor heterogeneity (ie, sITH). Intratumor heterogeneity (ITH) represents the diversity of cell populations that make up the cancer tissue. Recent studies have identified ITH at the transcriptome level and suggested that ITH at gene expression levels is useful for predicting prognosis. Measuring ITH levels at the spliceome level is a natural extension. There is a serious technical challenge in measuring sITH from bulk tumor RNA-seq, such as complex splicing patterns, widespread intron retentions, and short sequencing read lengths. SpliceHetero, an information-theoretic method for measuring the sITH of a tumor, was developed to address the aforementioned technical problems. SpliceHetero was extensively tested in experiments using synthetic data, xenograft tumor data and TCGA pan-cancer data and measured sITH successfully. Also, sITH was shown to be closely related to cancer progression and clonal heterogeneity, along with clinically significant features such as cancer stage, survival outcome, and PAM50 subtype. The last research topic is to develop a machine learning algorithm that defines patient subspaces specific to particular cancer phenotypes based on gene expression data. Since RNA-seq data is high-dimensional data composed of 20,000 or more genes in general, it is not easy to apply a machine learning algorithm. A network that collects information of experimentally verified interaction of proteins is called a Protein Interaction Network (PIN). Tumor2Vec defines the patient subspace by defining the subnetwork communities that interact with each other by applying the Graph Embedding technique to PIN. Tumor2Vec proposed a clinical model by defining a subspace for patients with different lymph node metastases in early oral cancer and found biologically significant features in the PIN subnetwork unit in the process.Chapter 1 Introduction 1 1.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Challenges in decoding and utilizing RNA-seq data . . . . . . . . 5 1.2.1 false-positives in RNA editing calls . . . . . . . . . . . . . 6 1.2.2 Absence of a model for measuring spliceomic intratumor heterogeneity considering complex cancer spliceome . . . 6 1.2.3 Lack of biological interpretation of dimension reduction techniques using gene expression . . . . . . . . . . . . . . 8 1.3 Machine learning techniques to solve difficulties in using RNA-seq 9 1.4 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2 RDDpred: A condition specific machine learning model for filtering false-positive RNA editing calls in RNAseq data 11 2.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 A preliminary study . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Design of experiments for evaluation . . . . . . . . . . . . 18 2.5.2 Evaluation using data from Bahn et al. . . . . . . . . . . 19 2.5.3 Evaluation using data from Peng et al. . . . . . . . . . . . 19 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3 SpliceHetero: An information-theoretic approach for measuring spliceomic intratumor heterogeneity from bulk tumor RNA-seq data 24 3.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 A preliminary study . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 Xenograft tumor data . . . . . . . . . . . . . . . . . . . . 36 3.5.3 TCGA pan-cancer data . . . . . . . . . . . . . . . . . . . 38 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4 Tumor2Vec: A supervised learning algorithm for extracting subnetwork representations of cancer RNAseq data using protein interaction networks 48 4.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.1 Lymph node metastasis in early oral cancer . . . . . . . . 57 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter 5 Conclusion 62 ์ดˆ๋ก 78Docto
    • โ€ฆ
    corecore