Search CORE

371 research outputs found

Response projected clustering for direct association with physiological and clinical response data

Author: Lee Jae K
Park Taesung
Yi Sung-Gon
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Microarray gene expression data are often analyzed together with corresponding physiological response and clinical metadata of biological subjects, e.g. patients' residual tumor sizes after chemotherapy or glucose levels at various stages of diabetic patients. Current clustering analysis cannot directly incorporate such quantitative metadata into the clustering heatmap of gene expression. It will be quite useful if these clinical response data can be effectively summarized in the high-dimensional clustering display so that important groups of genes can be intuitively discovered with different degrees of relevance to target disease phenotypes. Results We introduced a novel clustering analysis approach, <it>response projected clustering </it>(RPC), which uses a high-dimensional geometrical projection of response data to the gene expression space. The projected response vector, which becomes the origin in the projected space, is then clustered together with the projected gene vectors based on their different degrees of association with the response vector. A bootstrap-counting based RPC analysis is also performed to evaluate statistical tightness of identified gene clusters. Our RPC analysis was applied to the <it>in vitro </it>growth-inhibition and microarray profiling data on the NCI-60 cancer cell lines and the microarray gene expression study of macrophage differentiation in atherogenesis. These RPC applications enabled us to identify many known and novel gene factors and their potential pathway associations which are highly relevant to the drug's chemosensitivity activities and atherogenesis. Conclusion We have shown that RPC can effectively discover gene networks with different degrees of association with clinical metadata. Performed on each gene's response projected vector based on its degree of association with the response data, RPC effectively summarizes individual genes' association with metadata as well as their own expression patterns. Thus, RPC greatly enhances the utility of clustering analysis on investigating high-dimensional microarray gene expression data with quantitative metadata.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review

Author: Akavia
Andrews
Baasiri
Chin
Dai
De Bie
Futreal
H.-U. Klein
Haverty
Hawkins
Hyman
Johnson
Kao
L. Lahti
M. Dugas
M. Schafer
McLendon
Menezes
Mullighan
Mullighan
Myllykangas
Olshen
Ortiz-Estevez
Phillips
Qin
S. Bicciato
Solvang
Soneson
Stranger
van Wieringen
van Wieringen
Publication venue: 'Oxford University Press (OUP)'
Publication date: 20/11/2011
Field of study

A variety of genome-wide profiling techniques are available to probe complementary aspects of genome structure and function. Integrative analysis of heterogeneous data sources can reveal higher-level interactions that cannot be detected based on individual observations. A standard integration task in cancer studies is to identify altered genomic regions that induce changes in the expression of the associated genes based on joint analysis of genome-wide gene expression and copy number profiling measurements. In this review, we provide a comparison among various modeling procedures for integrating genome-wide profiling data of gene copy number and transcriptional alterations and highlight common approaches to genomic data integration. A transparent benchmarking procedure is introduced to quantitatively compare the cancer gene prioritization performance of the alternative methods. The benchmarking algorithms and data sets are available at http://intcomp.r-forge.r-project.orgComment: PDF file including supplementary material. 9 pages. Preprin

arXiv.org e-Print Archive

Crossref

PubMed Central

Wageningen University & Research Publications

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Recommended from our members

Statistical Methods for Integrated Cancer Genomic Data Using a Joint Latent Variable Model

Author: Drill Esther
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

Inspired by the TCGA (The Cancer Genome Atlas), we explore multimodal genomic datasets with integrative methods using a joint latent variable approach. We use iCluster+, an existing clustering method for integrative data, to identify potential subtypes within TCGA sarcoma and mesothelioma tumors, and across a large cohort of 33 dierent TCGA cancer datasets. For classication, motivated to improve the prediction of platinum resistance in high grade serous ovarian cancer (HGSOC) treatment, we propose novel integrative methods, iClassify to perform classication using a joint latent variable model. iClassify provides eective data integration and classication while handling heterogeneous data types, while providing a natural framework to incorporate covariate risk factors and examine genomic driver by covariate risk factor interaction. Feature selection is performed through a thresholding parameter that combines both latent variable and feature coecients. We demonstrate increased accuracy in classication over methods that assume homogeneous data type, such as linear discriminant analysis and penalized logistic regression, and improved feature selection. We apply iClassify to a TCGA cohort of HGSOC patients with three types of genomic data and platinum response data. This methodology has broad applications beyond predicting treatment outcomes and disease progression in cancer, including predicting prognosis and diagnosis in other diseases with major public health implications

Columbia University Academic Commons

Identification of activation of transcription factors from microarray data

Author: Kossenkov Andrei
Publication venue: Drexel University
Publication date: 01/03/2007
Field of study

Signaling pathways play a critical role in cell survival and development by regulation of transcription factor activity causing necessary gene products to be produced in response to different stimuli. Although the task of detecting activities of signaling pathways is extremely difficult, recent advances in microarray technology promise progress in the field. There are many clustering and pattern recognition algorithms that have been applied to analysis of microarray data. However, these methods lack an ability to address the biological nature of the data and force assignment of one gene to a single co-expression group, while ignoring the fact that many individual genes are regulated by different signaling pathways in response to different stimuli, and therefore the genes should be assigned to multiple groups of coexpression. Another issue in microarray analysis is a low signal-to-noise ratio provided by the technology, yet most of the clustering methods do not even take errors of the measurements into consideration.Bayesian Decomposition is an algorithm that decomposes microarray data into a set of biologically meaningful expression patterns that could be linked to certain signaling pathways and groups of genes that contain these patterns, allowing assignment of one gene to multiple patterns of expression. To address the problem of low signal-to-noise we modified the Bayesian Decomposition algorithm to allow inclusion of prior gene coregulation information to improve statistical power. We also created the Automated Sequence Annotation Pipeline to provide microarray data mining processes with annotation information at all steps and particularly to deduce the coregulation information for a given set of genes from transcription factor database TRANSFAC.We validated enhancements done to Bayesian Decomposition on simulated and real biological data and showed that using coregulation information can improve ability of the method to recover correct results. The designed data mining process that uses the Automated Sequence Annotation Pipeline and the modified Bayesian Decomposition was applied to determine transcription factor activities linked to patient outcome in gastrointestinal stromal tumor (GIST) patients undergoing treatment with imatinib mesylate (IM, Gleevec). The study demonstrates genes that can be potentially used as biomarkers to predict GIST patient response to Gleevec treatment and activity of transcription factors that can contribute to difference in the response.Ph.D., Biomedical Science -- Drexel University, 200

Drexel Libraries E-Repository and Archives

Uncovering Intratumoral And Intertumoral Heterogeneity Among Single-Cell Cancer Specimens

Author: Chen William Shelton
Publication venue: EliScholar – A Digital Platform for Scholarly Publishing at Yale
Publication date: 01/01/2020
Field of study

While several tools have been developed to map axes of variation among individual cells, no analogous approaches exist for identifying axes of variation among multicellular biospecimens profiled at single-cell resolution. Developing such an approach is of great translational relevance and interest, as single-cell expression data are now often collected across numerous experimental conditions (e.g., representing different drug perturbation conditions, CRISPR knockdowns, or patients undergoing clinical trials) that need to be compared. In this work, “Phenotypic Earth Mover\u27s Distance” (PhEMD) is presented as a solution to this problem. PhEMD is a general method for embedding a “manifold of manifolds,” in which each datapoint in the higher-level manifold (of biospecimens) represents a collection of points that span a lower-level manifold (of cells). PhEMD is applied to a newly-generated, 300-biospecimen mass cytometry drug screen experiment to map small-molecule inhibitors based on their differing effects on breast cancer cells undergoing epithelial–mesenchymal transition (EMT). These experiments highlight EGFR and MEK1/2 inhibitors as strongly halting EMT at an early stage and PI3K/mTOR/Akt inhibitors as enriching for a drug-resistant mesenchymal cell subtype characterized by high expression of phospho-S6. More generally, these experiments reveal that the final mapping of perturbation conditions has low intrinsic dimension and that the network of drugs demonstrates manifold structure, providing insight into how these single-cell experiments should be computational modeled and visualized. In the presented drug-screen experiment, the full spectrum of perturbation effects could be learned by profiling just a small fraction (11%) of drugs. Moreover, PhEMD could be integrated with complementary datasets to infer the phenotypes of biospecimens not directly profiled with single-cell profiling. Together, these findings have major implications for conducting future drug-screen experiments, as they suggest that large-scale drug screens can be conducted by measuring only a small fraction of the drugs using the most expensive high-throughput single-cell technologies—the effects of other drugs may be inferred by mapping and extending the perturbation space. PhEMD is also applied to patient tumor biopsies to assess intertumoral heterogeneity. Applied to a melanoma dataset and a clear-cell renal cell carcinoma dataset (ccRCC), PhEMD maps tumors similarly to how it maps perturbation conditions as above in order to learn key axes along which tumors vary with respect to their tumor-infiltrating immune cells. In both of these datasets, PhEMD highlights a subset of tumors demonstrating a marked enrichment of exhausted CD8+ T-cells. The wide variability in tumor-infiltrating immune cell abundance and particularly prominent exhausted CD8+ T-cell subpopulation highlights the importance of careful patient stratification when assessing clinical response to T cell-directed immunotherapies. Altogether, this work highlights PhEMD’s potential to facilitate drug discovery and patient stratification efforts by uncovering the network geometry of a large collection of single-cell biospecimens. Our varied experiments demonstrate that PhEMD is highly scalable, compatible with leading batch effect correction techniques, and generalizable to multiple experimental designs, with clear applicability to modern precision oncology efforts

Yale University

RNA-seq 데이터를 활용한 패스웨이 활성도의 정량화에 관한 연구

Author: 임상수
Publication venue: 서울대학교 대학원
Publication date: 01/08/2019
Field of study

학위논문(박사)--서울대학교 대학원 :자연과학대학 협동과정 생물정보학전공,2019. 8. 김선.RNA-seq 데이터를 사용하여 RNA 전사체의 변화량을 측정하는 것은 생물정보학 분야에서 필수적으로 수행하고 있는 분석 방법 중 하나이다. 그러나 RNA-seq은 인간의 2만개 이상의 유전자를 포함하는 고차원의 전사체 데이터를 생성하기 때문에, 상대적으로 적은 양의 샘플들을 분석하고자 할때는 데이터 해석에 있어서 어려움이 있다. 따라서, 더 나은 생물학적 이해를 위해서는 생물학적 패스웨이와 같이 잘 요약되고 널리 사용되는 정보를 사용하는 것이 유용하다. 그러나 전사체 데이터를 생물학적 패스웨이로 요약하는 것은 몇 가지 이유로 매우 어려운 작업이다. 첫째, 전사체 데이터를 패스웨이 차원으로 변환할 때 엄청난 정보 손실이 발생한다. 예를 들어, 인간에 존재하는 전체 유전자의 1/3만이 KEGG 패스웨이 데이터베이스에서 보고되고 있다. 둘째, 각 패스웨이는 많은 유전자로 구성되어 있으므로 패스웨이의 활성도를 측정하려면 구성하고 있는 유전자 간의 관계를 고려하면서 유전자 발현 값을 단일 값으로 요약해야 한다. 본 박사 학위 논문은 패스웨이 활성도 측정을 위한 새로운 방법을 개발하고 여러 비교 기준에 따라 기존에 보고된 패스웨이 활성도 도구들에 대한 광범위한 평가 실험을 수행하고자 한다. 또한 일반 사용자가 자신의 데이터를 쉽게 분석할 수 있도록 앞서 언급한 도구들을 웹 기반 시스템 구축을 통해 쉽게 사용할 수 있도록 하였다. 첫 번째 연구에서는 전사체 유전자 발현양 정보를 그대로 사용하고, 상호작용 네트워크 측면에서 유전자 간의 관계를 고려하여 패스웨이의 관점으로 전사체 데이터를 요약하는 새로운 방법을 개발하였다. 이 연구에서는 단백질 상호 작용 네트워크, 패스웨이 데이터베이스 및 RNA-seq 전사체 데이터를 활용하여 생물학적 패스웨이를 여러 개의 시스템으로 구분하는 새로운 개념을 제안하고자 한다. 각 시스템 및 각 샘플마다의 활성화 정도를 측정하기 위해 SAS (Subsystem Activation Score)를 개발하였다. 이 방법은 샘플 들간 및 유방암 아형들 사이에서 차별적으로 활성화되는 특유의 유전체 상에서의 활성화 패턴 또는 서브 시스템을 표현할 수 있었다. 그런 다음, 분류 및 회귀 트리 (CART) 분석을 수행하여 예후 모델링을 위해 SAS 정보를 사용했습니다. 그 결과, 10 개의 가장 중요한 하위 시스템으로 정의 된 11 개의 환자 하위 그룹은 생존 결과에 있어 최대 불일치로 확인되었다. 이 모델은 유사한 생존 결과를 가진 환자 하위 그룹을 정의했을뿐만 아니라 기능적으로 유익한 유방암 유전자 세트를 제안하는 하위 시스템의 활성화 상태에 따라 결정되는 샘플 특이적인 상태의 판단 경로를 제공한다. 두 번째 연구는 전 암 (pan-cancer) 데이터 세트를 사용하여 다섯 가지 비교 기준에 따라 13 가지의 패스웨이 활성도 측정 도구를 체계적으로 비교 및 평가하는 연구이다.현존하는 패스웨이 활성도 측정 도구가 많이 있지만, 이러한 도구가 코호트 수준에서 유용한 정보를 제공하는지에 대한 비교 연구는 없다. 이 연구는 크게 두 가지 부분에 대해서 의미가 있다. 첫째, 이 연구는 기존의 패스웨이 활성도 측정 도구에서 사용되는 계산 기법에 대한 포괄적인 정보를 제공한다. 패스웨이 활성도 측정은 다양한 접근법을 사용하고, 입력 데이터의 변환, 샘플 정보의 사용, 코호트 수준의 인풋 데이터의 필요성, 유전자 관계 및 점수체계의 사용 등에서 다양한 요구 사항을 가정해야 한다. 둘째, 이러한 도구의 성능에 대한 다섯 가지 비교 기준을 사용하여 광범위한 평가가 수행되었다. 도구가 원래의 유전자 발현 프로파일의 특성을 얼마나 잘 유지하는지를 측정하는 것부터, 유전자 발현 데이터에 노이즈를 임의로 도입하였을 때 얼마나 둔감한지 등을 조사했다. 임상 적용을 위한 도구의 유용성을 평가하기 위해 세가지 변수 (종양 대 정상, 생존 및 암의 아형)에 대한 분류 작업을 수행했다. 세 번째 연구는 사용자가 전사체 데이터를 제공하고, 앞선 연구에서 비교한 활성도 측정 도구를 사용하여 패스웨이 활성도를 측정하는 클라우드 기반 시스템 (PathwayCloud)을 구축하는 것이다. 사용자가 데이터를 시스템에 업로드하고 실행할 분석 도구를 선택하면, 이 시스템은 각 도구에 대한 패스웨이 활성도 값과 선택한 도구에 대한 성능 비교 요약을 자동으로 수행한다. 사용자는 또한 주어진 샘플 정보의 측면에서 어떤 패스웨이가 중요한지 조사 할 수 있으며, KEGG rest API를 통해서 직접 패스웨이의 어떤 유전자의 변화가 유의미한지를 시각적으로 분석할 수 있다. 결론적으로, 본 학위 논문은 고용량의 유전자 발현 데이터를 사용하여 생물학적 패스웨이에 대한 분석 방법을 개발하고, 다른 유형의 도구를 포괄적인 기준으로 비교하고, 사용자가 이 도구들에 쉽게 접근할 수 있는 웹 기반 시스템을 제공하는 것을 목표로 한다. 이 전반적인 접근 방식은 생물학적 패스웨이 측면에서 유전자 발현 데이터를 이해하는 데 중요했다.Measuring the dynamics of RNA transcripts using RNA-seq data has become routine in bioinformatics analyses. However, RNA-seq produces high-dimensional transcriptome data on more than 20,000 genes in humans. This makes the interpretation of the data extremely difficult given a relatively small set of samples. Therefore, it is desirable to use well-summarized and widely-used information such as biological pathways for better biological comprehension. However, summarizing transcriptome data in terms of biological pathways is a very challenging task for several reasons. First, there is a huge information loss when transforming transcriptome data to pathway space. For example, in humans, only one third of the entire set of genes being analyzed are present in KEGG pathways. Second, each pathway consists of many genes; thus, measuring pathway activity requires a strategy to summarize expression profiles of component genes into a single value, while considering relationship among the constituent genes. My doctoral study aimed to develop a new method for pathway activity measurement, and to perform extensive evaluation experiments on existing pathway measurement tools in terms of multiple evaluation criteria. In addition, a cloud-based system was constructed to deploy such tools, which facilitates users analyzing their own data easily. The first study is to develop a new method to summarize transcriptome data in terms of pathways by using explicit transcript quantity information and considering relationship among genes in terms of their interactions. In this study, I propose a novel concept of decomposing biological pathways into subsystems by utilizing protein interaction network, pathway information, and RNA-seq data. A subsystem activation score (SAS) was designed to measure the degree of activation for each subsystem and each patient. This method revealed distinctive genome-wide activation patterns or landscapes of subsystems that are differentially activated among samples as well as among breast cancer subtypes. Next, we used SAS information for prognostic modeling by classification and regression tree (CART) analysis. Eleven subgroups of patients, defined by the 10 most significant subsystems, were identified with maximal discrepancy in survival outcome. Our model not only defined patient subgroups with similar survival outcomes, but also provided patient-specific decision paths determined by SAS status, suggesting functionally informative gene sets in breast cancer. The second study aimed to systematically compare and evaluate thirteen different pathway activity inference tools based on five comparison criteria using a pan-cancer data set. Although many pathway activity tools are available, there is no comparative study on how effective these tools are in producing useful information at the cohort level, enabling comparison of many samples. This study has two major contributions. First, this study provides a comprehensive survey on computational techniques used by existing pathway activity inference tools. Existing tools use different strategies and assume different requirements on data: input transformation, use of labels, necessity of cohort-level input data, use of gene relations and scoring metrics. Second, extensive evaluations were conducted using five comparison criteria concerning the performance of these tools. Starting from measuring how well a tool maintains the characteristics of an original gene expression profile, robustness was also investigated by introducing noise into gene expression data. Classification tasks on three clinical variables were performed to evaluate the utility of tools. The third study is to build a cloud-based system where a user provides transcriptome data and measures pathway activities using the tools that were used for the comparative study. When a user uploads input data to the system and selects which preferred analysis tools are to be run, the system automatically generates pathway activity values for each tool as well as a summary of performance comparison for the selected tools. Users can also investigate which pathways are significant in terms of the given sample information and visually inspect genes within a pathway-linked KEGG rest API. In conclusion, in my thesis, I sought to develop an analysis method regarding biological pathways using high throughput gene expression data to compare different types of tools with comprehensive criteria, and to arrange the tools in a cloud-based system that is easily accessible. As pathways aggregate various molecular events among genes in to a single entity, the set of suggested approaches will aid interpretation of high-throughput data as well as facilitate integration of diverse data layers such as miRNA or DNA methylation profiles being taken into consideration.Chapter 1 Introduction 1 1.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Biological pathways . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Gene expression . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Pathway-based analysis . . . . . . . . . . . . . . . . . . . 7 1.1.4 Pathway activity measurement . . . . . . . . . . . . . . . 8 1.2 Challenges in pathway activity measurement . . . . . . . . . . . 9 1.2.1 Calculating effective pathway activity values from RNAseq data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 Lack of comparative criteria to evaluate pathway activity tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.3 Absence of a user-friendly environment of pathway activity inference tools . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2 Measuring pathway activity from RNA-seq data to identify breast cancer subsystems using protein-protein interaction network 14 2.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Breast cancer subsystems . . . . . . . . . . . . . . . . . . 20 2.3.2 Subsystem Activation Score . . . . . . . . . . . . . . . . . 22 2.3.3 Prognostic modeling . . . . . . . . . . . . . . . . . . . . . 23 2.3.4 Hierarchical clustering of patients and subsystems . . . . 24 2.3.5 Tools used in this study . . . . . . . . . . . . . . . . . . . 25 2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.1 Pathways were decomposed into coherent functional units - subsystems . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.2 Landscape of subsystems reflect the breast cancer biology 26 2.4.3 SAS revealed patient clusters associated with PAM50 subtypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.4 Prognostic modeling by subsystems showed 11 patient subgroups with distinct survival outcome . . . . . . . . . 31 2.4.5 Relapse rate and CNVs were enriched to worse prognostic subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3 Comprehensive evaluation of pathway activity measurement tools on pan-cancer data 40 3.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Pathway activity inference Tools . . . . . . . . . . . . . . 45 3.3.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.3 Pathway database . . . . . . . . . . . . . . . . . . . . . . 47 3.3.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Comparative approach . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.1 Radar chart criteria . . . . . . . . . . . . . . . . . . . . . 49 3.4.2 Similarity among the tools . . . . . . . . . . . . . . . . . . 53 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5.1 Distance preservation . . . . . . . . . . . . . . . . . . . . 53 3.5.2 Robustness against noise . . . . . . . . . . . . . . . . . . . 57 3.5.3 Classification: Tumor vs Normal . . . . . . . . . . . . . . 60 3.5.4 Classification: survival information . . . . . . . . . . . . . 62 3.5.5 Classification: cancer subtypes . . . . . . . . . . . . . . . 63 3.5.6 Similarity among the tools . . . . . . . . . . . . . . . . . . 63 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 4 A cloud-based system of pathway activity inference tools using high-throughput gene expression data 68 4.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4.1 Calculating pathway activity values . . . . . . . . . . . . 71 4.4.2 Identification of significant pathways . . . . . . . . . . . . 72 4.4.3 Visualization in KEGG pathways . . . . . . . . . . . . . . 72 4.4.4 Comparison of the tools . . . . . . . . . . . . . . . . . . . 75 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 5 Conclusion 77 초록 101Docto

SNU Open Repository and Archive

Integrative enrichment analysis: a new computational method to detect dysregulated pathways in heterogeneous samples

Author: A Alexeyenko
A Calon
A Keller
A Ochab-Marcinek
A Subramanian
AL Tarca
AL Tarca
AL Tarca
B Efron
C Wu
CC Wu
D Merico
D Szklarczyk
DN Simon
DR Laybutt
E Edelman
E Glaab
E Lee
Guojun Li
H Kojima
H Maciejewski
H Ogata
H Sun
HP Hammes
IL Aye
J Michaud
J Taneera
J Tomfohr
JJ Goeman
JR Schoenborn
K Tobler
K Yu
KA Steer
M Kanehisa
M Natarajan
M Palfy
M Rebhan
M Rebhan
M Yi
MA Ibrahim
NN Ulusu
PS Wong
RK Curtis
S Aerts
S Durinck
S Hanzelmann
S Tavazoie
SA Gupte
SE Kahn
SL Beau
SY Kim
T Barrett
T Mahdi
T Morikawa
T Morikawa
T Tuomi
T Zeng
Tao Zeng
TS Keshava Prasad
V Saxena
X Yu
Xiangtian Yu
Y Drier
Y Li
Y Li
Z Jiang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records

Author: Gu Shaopeng
Publication venue: Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange
Publication date: 01/01/2019
Field of study

The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and protein-protein interactions datasets to evaluate the validation of the tool. The results show that the tool can effectively perform biological sequence analysis and classification tasks. Applying machine learning algorithms for Electronic medical record (EMR) data analysis is also included in this dissertation. Chronic kidney disease (CKD) is prevalent across the world and well defined by an estimated glomerular filtration rate (eGFR). The progression of kidney disease can be predicted if future eGFR can be accurately estimated using predictive analytics. Thus, I present a prediction model of eGFR that was built using Random Forest regression. The dataset includes demographic, clinical and laboratory information from a regional primary health care clinic. The final model included eGFR, age, gender, body mass index (BMI), obesity, hypertension, and diabetes, which achieved a mean coefficient of determination of 0.95. The estimated eGFRs were used to classify patients into CKD stages with high macro-averaged and micro-averaged metrics

Public Research Access Institutional Repository and Information Exchange