371 research outputs found

    Response projected clustering for direct association with physiological and clinical response data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray gene expression data are often analyzed together with corresponding physiological response and clinical metadata of biological subjects, e.g. patients' residual tumor sizes after chemotherapy or glucose levels at various stages of diabetic patients. Current clustering analysis cannot directly incorporate such quantitative metadata into the clustering heatmap of gene expression. It will be quite useful if these clinical response data can be effectively summarized in the high-dimensional clustering display so that important groups of genes can be intuitively discovered with different degrees of relevance to target disease phenotypes.</p> <p>Results</p> <p>We introduced a novel clustering analysis approach, <it>response projected clustering </it>(RPC), which uses a high-dimensional geometrical projection of response data to the gene expression space. The projected response vector, which becomes the origin in the projected space, is then clustered together with the projected gene vectors based on their different degrees of association with the response vector. A bootstrap-counting based RPC analysis is also performed to evaluate statistical tightness of identified gene clusters. Our RPC analysis was applied to the <it>in vitro </it>growth-inhibition and microarray profiling data on the NCI-60 cancer cell lines and the microarray gene expression study of macrophage differentiation in atherogenesis. These RPC applications enabled us to identify many known and novel gene factors and their potential pathway associations which are highly relevant to the drug's chemosensitivity activities and atherogenesis.</p> <p>Conclusion</p> <p>We have shown that RPC can effectively discover gene networks with different degrees of association with clinical metadata. Performed on each gene's response projected vector based on its degree of association with the response data, RPC effectively summarizes individual genes' association with metadata as well as their own expression patterns. Thus, RPC greatly enhances the utility of clustering analysis on investigating high-dimensional microarray gene expression data with quantitative metadata.</p

    Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review

    Get PDF
    A variety of genome-wide profiling techniques are available to probe complementary aspects of genome structure and function. Integrative analysis of heterogeneous data sources can reveal higher-level interactions that cannot be detected based on individual observations. A standard integration task in cancer studies is to identify altered genomic regions that induce changes in the expression of the associated genes based on joint analysis of genome-wide gene expression and copy number profiling measurements. In this review, we provide a comparison among various modeling procedures for integrating genome-wide profiling data of gene copy number and transcriptional alterations and highlight common approaches to genomic data integration. A transparent benchmarking procedure is introduced to quantitatively compare the cancer gene prioritization performance of the alternative methods. The benchmarking algorithms and data sets are available at http://intcomp.r-forge.r-project.orgComment: PDF file including supplementary material. 9 pages. Preprin

    Identification of activation of transcription factors from microarray data

    Get PDF
    Signaling pathways play a critical role in cell survival and development by regulation of transcription factor activity causing necessary gene products to be produced in response to different stimuli. Although the task of detecting activities of signaling pathways is extremely difficult, recent advances in microarray technology promise progress in the field. There are many clustering and pattern recognition algorithms that have been applied to analysis of microarray data. However, these methods lack an ability to address the biological nature of the data and force assignment of one gene to a single co-expression group, while ignoring the fact that many individual genes are regulated by different signaling pathways in response to different stimuli, and therefore the genes should be assigned to multiple groups of coexpression. Another issue in microarray analysis is a low signal-to-noise ratio provided by the technology, yet most of the clustering methods do not even take errors of the measurements into consideration.Bayesian Decomposition is an algorithm that decomposes microarray data into a set of biologically meaningful expression patterns that could be linked to certain signaling pathways and groups of genes that contain these patterns, allowing assignment of one gene to multiple patterns of expression. To address the problem of low signal-to-noise we modified the Bayesian Decomposition algorithm to allow inclusion of prior gene coregulation information to improve statistical power. We also created the Automated Sequence Annotation Pipeline to provide microarray data mining processes with annotation information at all steps and particularly to deduce the coregulation information for a given set of genes from transcription factor database TRANSFAC.We validated enhancements done to Bayesian Decomposition on simulated and real biological data and showed that using coregulation information can improve ability of the method to recover correct results. The designed data mining process that uses the Automated Sequence Annotation Pipeline and the modified Bayesian Decomposition was applied to determine transcription factor activities linked to patient outcome in gastrointestinal stromal tumor (GIST) patients undergoing treatment with imatinib mesylate (IM, Gleevec). The study demonstrates genes that can be potentially used as biomarkers to predict GIST patient response to Gleevec treatment and activity of transcription factors that can contribute to difference in the response.Ph.D., Biomedical Science -- Drexel University, 200

    Uncovering Intratumoral And Intertumoral Heterogeneity Among Single-Cell Cancer Specimens

    Get PDF
    While several tools have been developed to map axes of variation among individual cells, no analogous approaches exist for identifying axes of variation among multicellular biospecimens profiled at single-cell resolution. Developing such an approach is of great translational relevance and interest, as single-cell expression data are now often collected across numerous experimental conditions (e.g., representing different drug perturbation conditions, CRISPR knockdowns, or patients undergoing clinical trials) that need to be compared. In this work, β€œPhenotypic Earth Mover\u27s Distance” (PhEMD) is presented as a solution to this problem. PhEMD is a general method for embedding a β€œmanifold of manifolds,” in which each datapoint in the higher-level manifold (of biospecimens) represents a collection of points that span a lower-level manifold (of cells). PhEMD is applied to a newly-generated, 300-biospecimen mass cytometry drug screen experiment to map small-molecule inhibitors based on their differing effects on breast cancer cells undergoing epithelial–mesenchymal transition (EMT). These experiments highlight EGFR and MEK1/2 inhibitors as strongly halting EMT at an early stage and PI3K/mTOR/Akt inhibitors as enriching for a drug-resistant mesenchymal cell subtype characterized by high expression of phospho-S6. More generally, these experiments reveal that the final mapping of perturbation conditions has low intrinsic dimension and that the network of drugs demonstrates manifold structure, providing insight into how these single-cell experiments should be computational modeled and visualized. In the presented drug-screen experiment, the full spectrum of perturbation effects could be learned by profiling just a small fraction (11%) of drugs. Moreover, PhEMD could be integrated with complementary datasets to infer the phenotypes of biospecimens not directly profiled with single-cell profiling. Together, these findings have major implications for conducting future drug-screen experiments, as they suggest that large-scale drug screens can be conducted by measuring only a small fraction of the drugs using the most expensive high-throughput single-cell technologiesβ€”the effects of other drugs may be inferred by mapping and extending the perturbation space. PhEMD is also applied to patient tumor biopsies to assess intertumoral heterogeneity. Applied to a melanoma dataset and a clear-cell renal cell carcinoma dataset (ccRCC), PhEMD maps tumors similarly to how it maps perturbation conditions as above in order to learn key axes along which tumors vary with respect to their tumor-infiltrating immune cells. In both of these datasets, PhEMD highlights a subset of tumors demonstrating a marked enrichment of exhausted CD8+ T-cells. The wide variability in tumor-infiltrating immune cell abundance and particularly prominent exhausted CD8+ T-cell subpopulation highlights the importance of careful patient stratification when assessing clinical response to T cell-directed immunotherapies. Altogether, this work highlights PhEMD’s potential to facilitate drug discovery and patient stratification efforts by uncovering the network geometry of a large collection of single-cell biospecimens. Our varied experiments demonstrate that PhEMD is highly scalable, compatible with leading batch effect correction techniques, and generalizable to multiple experimental designs, with clear applicability to modern precision oncology efforts

    RNA-seq 데이터λ₯Ό ν™œμš©ν•œ νŒ¨μŠ€μ›¨μ΄ ν™œμ„±λ„μ˜ μ •λŸ‰ν™”μ— κ΄€ν•œ 연ꡬ

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :μžμ—°κ³Όν•™λŒ€ν•™ ν˜‘λ™κ³Όμ • 생물정보학전곡,2019. 8. κΉ€μ„ .RNA-seq 데이터λ₯Ό μ‚¬μš©ν•˜μ—¬ RNA μ „μ‚¬μ²΄μ˜ λ³€ν™”λŸ‰μ„ μΈ‘μ •ν•˜λŠ” 것은 생물정보학 λΆ„μ•Όμ—μ„œ ν•„μˆ˜μ μœΌλ‘œ μˆ˜ν–‰ν•˜κ³  μžˆλŠ” 뢄석 방법 쀑 ν•˜λ‚˜μ΄λ‹€. κ·ΈλŸ¬λ‚˜ RNA-seq은 μΈκ°„μ˜ 2만개 μ΄μƒμ˜ μœ μ „μžλ₯Ό ν¬ν•¨ν•˜λŠ” κ³ μ°¨μ›μ˜ 전사체 데이터λ₯Ό μƒμ„±ν•˜κΈ° λ•Œλ¬Έμ—, μƒλŒ€μ μœΌλ‘œ 적은 μ–‘μ˜ μƒ˜ν”Œλ“€μ„ λΆ„μ„ν•˜κ³ μž ν• λ•ŒλŠ” 데이터 해석에 μžˆμ–΄μ„œ 어렀움이 μžˆλ‹€. λ”°λΌμ„œ, 더 λ‚˜μ€ 생물학적 이해λ₯Ό μœ„ν•΄μ„œλŠ” 생물학적 νŒ¨μŠ€μ›¨μ΄μ™€ 같이 잘 μš”μ•½λ˜κ³  널리 μ‚¬μš©λ˜λŠ” 정보λ₯Ό μ‚¬μš©ν•˜λŠ” 것이 μœ μš©ν•˜λ‹€. κ·ΈλŸ¬λ‚˜ 전사체 데이터λ₯Ό 생물학적 νŒ¨μŠ€μ›¨μ΄λ‘œ μš”μ•½ν•˜λŠ” 것은 λͺ‡ 가지 이유둜 맀우 μ–΄λ €μš΄ μž‘μ—…μ΄λ‹€. 첫째, 전사체 데이터λ₯Ό νŒ¨μŠ€μ›¨μ΄ μ°¨μ›μœΌλ‘œ λ³€ν™˜ν•  λ•Œ μ—„μ²­λ‚œ 정보 손싀이 λ°œμƒν•œλ‹€. 예λ₯Ό λ“€μ–΄, 인간에 μ‘΄μž¬ν•˜λŠ” 전체 μœ μ „μžμ˜ 1/3만이 KEGG νŒ¨μŠ€μ›¨μ΄ λ°μ΄ν„°λ² μ΄μŠ€μ—μ„œ 보고되고 μžˆλ‹€. λ‘˜μ§Έ, 각 νŒ¨μŠ€μ›¨μ΄λŠ” λ§Žμ€ μœ μ „μžλ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμœΌλ―€λ‘œ νŒ¨μŠ€μ›¨μ΄μ˜ ν™œμ„±λ„λ₯Ό μΈ‘μ •ν•˜λ €λ©΄ κ΅¬μ„±ν•˜κ³  μžˆλŠ” μœ μ „μž κ°„μ˜ 관계λ₯Ό κ³ λ €ν•˜λ©΄μ„œ μœ μ „μž λ°œν˜„ 값을 단일 κ°’μœΌλ‘œ μš”μ•½ν•΄μ•Ό ν•œλ‹€. λ³Έ 박사 ν•™μœ„ 논문은 νŒ¨μŠ€μ›¨μ΄ ν™œμ„±λ„ 츑정을 μœ„ν•œ μƒˆλ‘œμš΄ 방법을 κ°œλ°œν•˜κ³  μ—¬λŸ¬ 비ꡐ 기쀀에 따라 기쑴에 보고된 νŒ¨μŠ€μ›¨μ΄ ν™œμ„±λ„ 도ꡬ듀에 λŒ€ν•œ κ΄‘λ²”μœ„ν•œ 평가 μ‹€ν—˜μ„ μˆ˜ν–‰ν•˜κ³ μž ν•œλ‹€. λ˜ν•œ 일반 μ‚¬μš©μžκ°€ μžμ‹ μ˜ 데이터λ₯Ό μ‰½κ²Œ 뢄석할 수 μžˆλ„λ‘ μ•žμ„œ μ–ΈκΈ‰ν•œ 도ꡬ듀을 μ›Ή 기반 μ‹œμŠ€ν…œ ꡬ좕을 톡해 μ‰½κ²Œ μ‚¬μš©ν•  수 μžˆλ„λ‘ ν•˜μ˜€λ‹€. 첫 번째 μ—°κ΅¬μ—μ„œλŠ” 전사체 μœ μ „μž λ°œν˜„μ–‘ 정보λ₯Ό κ·ΈλŒ€λ‘œ μ‚¬μš©ν•˜κ³ , μƒν˜Έμž‘μš© λ„€νŠΈμ›Œν¬ μΈ‘λ©΄μ—μ„œ μœ μ „μž κ°„μ˜ 관계λ₯Ό κ³ λ €ν•˜μ—¬ νŒ¨μŠ€μ›¨μ΄μ˜ κ΄€μ μœΌλ‘œ 전사체 데이터λ₯Ό μš”μ•½ν•˜λŠ” μƒˆλ‘œμš΄ 방법을 κ°œλ°œν•˜μ˜€λ‹€. 이 μ—°κ΅¬μ—μ„œλŠ” λ‹¨λ°±μ§ˆ μƒν˜Έ μž‘μš© λ„€νŠΈμ›Œν¬, νŒ¨μŠ€μ›¨μ΄ λ°μ΄ν„°λ² μ΄μŠ€ 및 RNA-seq 전사체 데이터λ₯Ό ν™œμš©ν•˜μ—¬ 생물학적 νŒ¨μŠ€μ›¨μ΄λ₯Ό μ—¬λŸ¬ 개의 μ‹œμŠ€ν…œμœΌλ‘œ κ΅¬λΆ„ν•˜λŠ” μƒˆλ‘œμš΄ κ°œλ…μ„ μ œμ•ˆν•˜κ³ μž ν•œλ‹€. 각 μ‹œμŠ€ν…œ 및 각 μƒ˜ν”Œλ§ˆλ‹€μ˜ ν™œμ„±ν™” 정도λ₯Ό μΈ‘μ •ν•˜κΈ° μœ„ν•΄ SAS (Subsystem Activation Score)λ₯Ό κ°œλ°œν•˜μ˜€λ‹€. 이 방법은 μƒ˜ν”Œ λ“€κ°„ 및 μœ λ°©μ•” μ•„ν˜•λ“€ μ‚¬μ΄μ—μ„œ μ°¨λ³„μ μœΌλ‘œ ν™œμ„±ν™”λ˜λŠ” 특유의 μœ μ „μ²΄ μƒμ—μ„œμ˜ ν™œμ„±ν™” νŒ¨ν„΄ λ˜λŠ” μ„œλΈŒ μ‹œμŠ€ν…œμ„ ν‘œν˜„ν•  수 μžˆμ—ˆλ‹€. 그런 λ‹€μŒ, λΆ„λ₯˜ 및 νšŒκ·€ 트리 (CART) 뢄석을 μˆ˜ν–‰ν•˜μ—¬ μ˜ˆν›„ λͺ¨λΈλ§μ„ μœ„ν•΄ SAS 정보λ₯Ό μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€. κ·Έ κ²°κ³Ό, 10 개의 κ°€μž₯ μ€‘μš”ν•œ ν•˜μœ„ μ‹œμŠ€ν…œμœΌλ‘œ μ •μ˜ 된 11 개의 ν™˜μž ν•˜μœ„ 그룹은 생쑴 결과에 μžˆμ–΄ μ΅œλŒ€ 뢈일치둜 ν™•μΈλ˜μ—ˆλ‹€. 이 λͺ¨λΈμ€ μœ μ‚¬ν•œ 생쑴 κ²°κ³Όλ₯Ό 가진 ν™˜μž ν•˜μœ„ 그룹을 μ •μ˜ν–ˆμ„λΏλ§Œ μ•„λ‹ˆλΌ κΈ°λŠ₯적으둜 μœ μ΅ν•œ μœ λ°©μ•” μœ μ „μž μ„ΈνŠΈλ₯Ό μ œμ•ˆν•˜λŠ” ν•˜μœ„ μ‹œμŠ€ν…œμ˜ ν™œμ„±ν™” μƒνƒœμ— 따라 κ²°μ •λ˜λŠ” μƒ˜ν”Œ 특이적인 μƒνƒœμ˜ νŒλ‹¨ 경둜λ₯Ό μ œκ³΅ν•œλ‹€. 두 번째 μ—°κ΅¬λŠ” μ „ μ•” (pan-cancer) 데이터 μ„ΈνŠΈλ₯Ό μ‚¬μš©ν•˜μ—¬ λ‹€μ„― 가지 비ꡐ 기쀀에 따라 13 κ°€μ§€μ˜ νŒ¨μŠ€μ›¨μ΄ ν™œμ„±λ„ μΈ‘μ • 도ꡬλ₯Ό μ²΄κ³„μ μœΌλ‘œ 비ꡐ 및 ν‰κ°€ν•˜λŠ” 연ꡬ이닀.ν˜„μ‘΄ν•˜λŠ” νŒ¨μŠ€μ›¨μ΄ ν™œμ„±λ„ μΈ‘μ • 도ꡬ가 많이 μžˆμ§€λ§Œ, μ΄λŸ¬ν•œ 도ꡬ가 μ½”ν˜ΈνŠΈ μˆ˜μ€€μ—μ„œ μœ μš©ν•œ 정보λ₯Ό μ œκ³΅ν•˜λŠ”μ§€μ— λŒ€ν•œ 비ꡐ μ—°κ΅¬λŠ” μ—†λ‹€. 이 μ—°κ΅¬λŠ” 크게 두 가지 뢀뢄에 λŒ€ν•΄μ„œ μ˜λ―Έκ°€ μžˆλ‹€. 첫째, 이 μ—°κ΅¬λŠ” 기쑴의 νŒ¨μŠ€μ›¨μ΄ ν™œμ„±λ„ μΈ‘μ • λ„κ΅¬μ—μ„œ μ‚¬μš©λ˜λŠ” 계산 기법에 λŒ€ν•œ 포괄적인 정보λ₯Ό μ œκ³΅ν•œλ‹€. νŒ¨μŠ€μ›¨μ΄ ν™œμ„±λ„ 츑정은 λ‹€μ–‘ν•œ 접근법을 μ‚¬μš©ν•˜κ³ , μž…λ ₯ λ°μ΄ν„°μ˜ λ³€ν™˜, μƒ˜ν”Œ μ •λ³΄μ˜ μ‚¬μš©, μ½”ν˜ΈνŠΈ μˆ˜μ€€μ˜ 인풋 λ°μ΄ν„°μ˜ ν•„μš”μ„±, μœ μ „μž 관계 및 μ μˆ˜μ²΄κ³„μ˜ μ‚¬μš© λ“±μ—μ„œ λ‹€μ–‘ν•œ μš”κ΅¬ 사항을 κ°€μ •ν•΄μ•Ό ν•œλ‹€. λ‘˜μ§Έ, μ΄λŸ¬ν•œ λ„κ΅¬μ˜ μ„±λŠ₯에 λŒ€ν•œ λ‹€μ„― 가지 비ꡐ 기쀀을 μ‚¬μš©ν•˜μ—¬ κ΄‘λ²”μœ„ν•œ 평가가 μˆ˜ν–‰λ˜μ—ˆλ‹€. 도ꡬ가 μ›λž˜μ˜ μœ μ „μž λ°œν˜„ ν”„λ‘œνŒŒμΌμ˜ νŠΉμ„±μ„ μ–Όλ§ˆλ‚˜ 잘 μœ μ§€ν•˜λŠ”μ§€λ₯Ό μΈ‘μ •ν•˜λŠ” 것뢀터, μœ μ „μž λ°œν˜„ 데이터에 λ…Έμ΄μ¦ˆλ₯Ό μž„μ˜λ‘œ λ„μž…ν•˜μ˜€μ„ λ•Œ μ–Όλ§ˆλ‚˜ λ‘”κ°ν•œμ§€ 등을 μ‘°μ‚¬ν–ˆλ‹€. μž„μƒ μ μš©μ„ μœ„ν•œ λ„κ΅¬μ˜ μœ μš©μ„±μ„ ν‰κ°€ν•˜κΈ° μœ„ν•΄ 세가지 λ³€μˆ˜ (μ’…μ–‘ λŒ€ 정상, 생쑴 및 μ•”μ˜ μ•„ν˜•)에 λŒ€ν•œ λΆ„λ₯˜ μž‘μ—…μ„ μˆ˜ν–‰ν–ˆλ‹€. μ„Έ 번째 μ—°κ΅¬λŠ” μ‚¬μš©μžκ°€ 전사체 데이터λ₯Ό μ œκ³΅ν•˜κ³ , μ•žμ„  μ—°κ΅¬μ—μ„œ λΉ„κ΅ν•œ ν™œμ„±λ„ μΈ‘μ • 도ꡬλ₯Ό μ‚¬μš©ν•˜μ—¬ νŒ¨μŠ€μ›¨μ΄ ν™œμ„±λ„λ₯Ό μΈ‘μ •ν•˜λŠ” ν΄λΌμš°λ“œ 기반 μ‹œμŠ€ν…œ (PathwayCloud)을 κ΅¬μΆ•ν•˜λŠ” 것이닀. μ‚¬μš©μžκ°€ 데이터λ₯Ό μ‹œμŠ€ν…œμ— μ—…λ‘œλ“œν•˜κ³  μ‹€ν–‰ν•  뢄석 도ꡬλ₯Ό μ„ νƒν•˜λ©΄, 이 μ‹œμŠ€ν…œμ€ 각 도ꡬ에 λŒ€ν•œ νŒ¨μŠ€μ›¨μ΄ ν™œμ„±λ„ κ°’κ³Ό μ„ νƒν•œ 도ꡬ에 λŒ€ν•œ μ„±λŠ₯ 비ꡐ μš”μ•½μ„ μžλ™μœΌλ‘œ μˆ˜ν–‰ν•œλ‹€. μ‚¬μš©μžλŠ” λ˜ν•œ 주어진 μƒ˜ν”Œ μ •λ³΄μ˜ μΈ‘λ©΄μ—μ„œ μ–΄λ–€ νŒ¨μŠ€μ›¨μ΄κ°€ μ€‘μš”ν•œμ§€ 쑰사 ν•  수 있으며, KEGG rest APIλ₯Ό ν†΅ν•΄μ„œ 직접 νŒ¨μŠ€μ›¨μ΄μ˜ μ–΄λ–€ μœ μ „μžμ˜ λ³€ν™”κ°€ μœ μ˜λ―Έν•œμ§€λ₯Ό μ‹œκ°μ μœΌλ‘œ 뢄석할 수 μžˆλ‹€. 결둠적으둜, λ³Έ ν•™μœ„ 논문은 κ³ μš©λŸ‰μ˜ μœ μ „μž λ°œν˜„ 데이터λ₯Ό μ‚¬μš©ν•˜μ—¬ 생물학적 νŒ¨μŠ€μ›¨μ΄μ— λŒ€ν•œ 뢄석 방법을 κ°œλ°œν•˜κ³ , λ‹€λ₯Έ μœ ν˜•μ˜ 도ꡬλ₯Ό 포괄적인 κΈ°μ€€μœΌλ‘œ λΉ„κ΅ν•˜κ³ , μ‚¬μš©μžκ°€ 이 도ꡬ듀에 μ‰½κ²Œ μ ‘κ·Όν•  수 μžˆλŠ” μ›Ή 기반 μ‹œμŠ€ν…œμ„ μ œκ³΅ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. 이 μ „λ°˜μ μΈ μ ‘κ·Ό 방식은 생물학적 νŒ¨μŠ€μ›¨μ΄ μΈ‘λ©΄μ—μ„œ μœ μ „μž λ°œν˜„ 데이터λ₯Ό μ΄ν•΄ν•˜λŠ” 데 μ€‘μš”ν–ˆλ‹€.Measuring the dynamics of RNA transcripts using RNA-seq data has become routine in bioinformatics analyses. However, RNA-seq produces high-dimensional transcriptome data on more than 20,000 genes in humans. This makes the interpretation of the data extremely difficult given a relatively small set of samples. Therefore, it is desirable to use well-summarized and widely-used information such as biological pathways for better biological comprehension. However, summarizing transcriptome data in terms of biological pathways is a very challenging task for several reasons. First, there is a huge information loss when transforming transcriptome data to pathway space. For example, in humans, only one third of the entire set of genes being analyzed are present in KEGG pathways. Second, each pathway consists of many genes; thus, measuring pathway activity requires a strategy to summarize expression profiles of component genes into a single value, while considering relationship among the constituent genes. My doctoral study aimed to develop a new method for pathway activity measurement, and to perform extensive evaluation experiments on existing pathway measurement tools in terms of multiple evaluation criteria. In addition, a cloud-based system was constructed to deploy such tools, which facilitates users analyzing their own data easily. The first study is to develop a new method to summarize transcriptome data in terms of pathways by using explicit transcript quantity information and considering relationship among genes in terms of their interactions. In this study, I propose a novel concept of decomposing biological pathways into subsystems by utilizing protein interaction network, pathway information, and RNA-seq data. A subsystem activation score (SAS) was designed to measure the degree of activation for each subsystem and each patient. This method revealed distinctive genome-wide activation patterns or landscapes of subsystems that are differentially activated among samples as well as among breast cancer subtypes. Next, we used SAS information for prognostic modeling by classification and regression tree (CART) analysis. Eleven subgroups of patients, defined by the 10 most significant subsystems, were identified with maximal discrepancy in survival outcome. Our model not only defined patient subgroups with similar survival outcomes, but also provided patient-specific decision paths determined by SAS status, suggesting functionally informative gene sets in breast cancer. The second study aimed to systematically compare and evaluate thirteen different pathway activity inference tools based on five comparison criteria using a pan-cancer data set. Although many pathway activity tools are available, there is no comparative study on how effective these tools are in producing useful information at the cohort level, enabling comparison of many samples. This study has two major contributions. First, this study provides a comprehensive survey on computational techniques used by existing pathway activity inference tools. Existing tools use different strategies and assume different requirements on data: input transformation, use of labels, necessity of cohort-level input data, use of gene relations and scoring metrics. Second, extensive evaluations were conducted using five comparison criteria concerning the performance of these tools. Starting from measuring how well a tool maintains the characteristics of an original gene expression profile, robustness was also investigated by introducing noise into gene expression data. Classification tasks on three clinical variables were performed to evaluate the utility of tools. The third study is to build a cloud-based system where a user provides transcriptome data and measures pathway activities using the tools that were used for the comparative study. When a user uploads input data to the system and selects which preferred analysis tools are to be run, the system automatically generates pathway activity values for each tool as well as a summary of performance comparison for the selected tools. Users can also investigate which pathways are significant in terms of the given sample information and visually inspect genes within a pathway-linked KEGG rest API. In conclusion, in my thesis, I sought to develop an analysis method regarding biological pathways using high throughput gene expression data to compare different types of tools with comprehensive criteria, and to arrange the tools in a cloud-based system that is easily accessible. As pathways aggregate various molecular events among genes in to a single entity, the set of suggested approaches will aid interpretation of high-throughput data as well as facilitate integration of diverse data layers such as miRNA or DNA methylation profiles being taken into consideration.Chapter 1 Introduction 1 1.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Biological pathways . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Gene expression . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Pathway-based analysis . . . . . . . . . . . . . . . . . . . 7 1.1.4 Pathway activity measurement . . . . . . . . . . . . . . . 8 1.2 Challenges in pathway activity measurement . . . . . . . . . . . 9 1.2.1 Calculating effective pathway activity values from RNAseq data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 Lack of comparative criteria to evaluate pathway activity tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.3 Absence of a user-friendly environment of pathway activity inference tools . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2 Measuring pathway activity from RNA-seq data to identify breast cancer subsystems using protein-protein interaction network 14 2.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Breast cancer subsystems . . . . . . . . . . . . . . . . . . 20 2.3.2 Subsystem Activation Score . . . . . . . . . . . . . . . . . 22 2.3.3 Prognostic modeling . . . . . . . . . . . . . . . . . . . . . 23 2.3.4 Hierarchical clustering of patients and subsystems . . . . 24 2.3.5 Tools used in this study . . . . . . . . . . . . . . . . . . . 25 2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.1 Pathways were decomposed into coherent functional units - subsystems . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.2 Landscape of subsystems reflect the breast cancer biology 26 2.4.3 SAS revealed patient clusters associated with PAM50 subtypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.4 Prognostic modeling by subsystems showed 11 patient subgroups with distinct survival outcome . . . . . . . . . 31 2.4.5 Relapse rate and CNVs were enriched to worse prognostic subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3 Comprehensive evaluation of pathway activity measurement tools on pan-cancer data 40 3.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Pathway activity inference Tools . . . . . . . . . . . . . . 45 3.3.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.3 Pathway database . . . . . . . . . . . . . . . . . . . . . . 47 3.3.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Comparative approach . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.1 Radar chart criteria . . . . . . . . . . . . . . . . . . . . . 49 3.4.2 Similarity among the tools . . . . . . . . . . . . . . . . . . 53 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5.1 Distance preservation . . . . . . . . . . . . . . . . . . . . 53 3.5.2 Robustness against noise . . . . . . . . . . . . . . . . . . . 57 3.5.3 Classification: Tumor vs Normal . . . . . . . . . . . . . . 60 3.5.4 Classification: survival information . . . . . . . . . . . . . 62 3.5.5 Classification: cancer subtypes . . . . . . . . . . . . . . . 63 3.5.6 Similarity among the tools . . . . . . . . . . . . . . . . . . 63 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 4 A cloud-based system of pathway activity inference tools using high-throughput gene expression data 68 4.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4.1 Calculating pathway activity values . . . . . . . . . . . . 71 4.4.2 Identification of significant pathways . . . . . . . . . . . . 72 4.4.3 Visualization in KEGG pathways . . . . . . . . . . . . . . 72 4.4.4 Comparison of the tools . . . . . . . . . . . . . . . . . . . 75 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 5 Conclusion 77 초둝 101Docto

    Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records

    Get PDF
    The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and protein-protein interactions datasets to evaluate the validation of the tool. The results show that the tool can effectively perform biological sequence analysis and classification tasks. Applying machine learning algorithms for Electronic medical record (EMR) data analysis is also included in this dissertation. Chronic kidney disease (CKD) is prevalent across the world and well defined by an estimated glomerular filtration rate (eGFR). The progression of kidney disease can be predicted if future eGFR can be accurately estimated using predictive analytics. Thus, I present a prediction model of eGFR that was built using Random Forest regression. The dataset includes demographic, clinical and laboratory information from a regional primary health care clinic. The final model included eGFR, age, gender, body mass index (BMI), obesity, hypertension, and diabetes, which achieved a mean coefficient of determination of 0.95. The estimated eGFRs were used to classify patients into CKD stages with high macro-averaged and micro-averaged metrics
    • …
    corecore