8 research outputs found

    계층적 구조 모형을 이용한 mRNA 발현 자료의 패스웨이 분석

    Get PDF
    학위논문(석사)--서울대학교 대학원 :자연과학대학 협동과정 생물정보학전공,2019. 8. 박태성.Although there have been several analyses for identifying cancer-associated pathways, based on gene expression data, most of these are based on single pathway analyses, and thus do not consider correlations between pathways. In this paper, we propose a hierarchical structural component model of pathway analysis for gene expression data (HisCoM-PAGE), which accounts for the hierarchical structure of genes and pathways, as well as the correlations among pathways. Specifically, HisCoM-PAGE focuses on the survival phenotype and identifies its associated pathways. Moreover, its application to a real biological data analysis of pancreatic cancer data demonstrated that HisCoM-PAGE could successfully identify pathways associated with pancreatic cancer prognosis. Simulation studies comparing the performance of HisCoM-PAGE with other competing methods such as Gene Set Enrichment Analysis (GSEA), Global Test, and Wald-type Test showed HisCoM-PAGE to have the highest power to detect causal pathways.암에 상관관계가 있는 생물학적 기작 곧, 패스웨이를 찾아내기 위한 여러 가지 분석이 있었지만 유전자 발현 데이터를 기반으로 한 분석들의 대부분은 단일 패스웨이 분석에 기초하고 있었다. 이러한 분석 방법의 경우, 패스웨이들 간의 상관 관계를 고려하지 않았다. 본 논문에서는 유전자와 그 상위 단계라고 할 수 있는 패스웨이의 생물학적인 위계 구조를 반영하는 HisCoM-PAGE: 계층적 구조 모형을 이용한 유전자 발현 데이터의 패스웨이 분석 모델을 제안한다. 특히, HisCoM-PAGE는 생존자료 표현형에 초점을 맞추고 예후에 상관관계를 가지는 통계적으로 유의한 패스웨이를 찾아내는 것에 중점을 두었다. 실제 데이터에 대한 적용으로는 췌장암 데이터를 이용하였는데, 이는 췌장암이 여러 암 종 중에서도 예후가 좋지 못한 질병으로, 예후에 대한 연구가 중요하기 때문이다. HisCoM-PAGE 방법을 실제 췌장암 유전자 발현 데이터에 적용하였을 때, HisCoM-PAGE 방법이 췌장암 예후와 관련된 패스웨이를 효과적으로 찾아낼 수 있다는 것을 확인하였다. 또한, 제시한 방법론의 통계적인 검정력을 확인하기 위해서 기존에 패스웨이 방법론으로 제안된 Gene Set Enrichment Analysis(GSEA), Global Test(GT), Adewale Test 와 같은 다른 패스웨이 방법론과 비교하여 시뮬레이션 연구를 진행하였다. 타 방법론과의 비교를 통해서 HisCoM-PAGE가 질환과의 상관 관계를 가지는 통계적으로 유의한 패스웨이를 찾아내는데 높은 검정력을 가지는 것을 확인하였다.1 Introduction 1 2 Materials 6 3 Methodology 9 4 Results 18 5 Discussions 31 Bibliography 34 Abstract in Korean 40Maste

    ROAST: rotation gene set tests for complex microarray experiments

    Get PDF
    Motivation: A gene set test is a differential expression analysis in which a P-value is assigned to a set of genes as a unit. Gene set tests are valuable for increasing statistical power, organizing and interpreting results and for relating expression patterns across different experiments. Existing methods are based on permutation. Methods that rely on permutation of probes unrealistically assume independence of genes, while those that rely on permutation of sample are suitable only for two-group comparisons with a good number of replicates in each group

    Self-Contained Gene-Set Analysis of Expression Data: An Evaluation of Existing and Novel Methods

    Get PDF
    Gene set methods aim to assess the overall evidence of association of a set of genes with a phenotype, such as disease or a quantitative trait. Multiple approaches for gene set analysis of expression data have been proposed. They can be divided into two types: competitive and self-contained. Benefits of self-contained methods include that they can be used for genome-wide, candidate gene, or pathway studies, and have been reported to be more powerful than competitive methods. We therefore investigated ten self-contained methods that can be used for continuous, discrete and time-to-event phenotypes. To assess the power and type I error rate for the various previously proposed and novel approaches, an extensive simulation study was completed in which the scenarios varied according to: number of genes in a gene set, number of genes associated with the phenotype, effect sizes, correlation between expression of genes within a gene set, and the sample size. In addition to the simulated data, the various methods were applied to a pharmacogenomic study of the drug gemcitabine. Simulation results demonstrated that overall Fisher's method and the global model with random effects have the highest power for a wide range of scenarios, while the analysis based on the first principal component and Kolmogorov-Smirnov test tended to have lowest power. The methods investigated here are likely to play an important role in identifying pathways that contribute to complex traits

    A comparative study on gene-set analysis methods for assessing differential expression associated with the survival phenotype

    Get PDF
    Abstract Background Many gene-set analysis methods have been previously proposed and compared through simulation studies and analysis of real datasets for binary phenotypes. We focused on the survival phenotype and compared the performances of Gene Set Enrichment Analysis (GSEA), Global Test (GT), Wald-type Test (WT) and Global Boost Test (GBST) methods in a simulation study and on two ovarian cancer data sets. We considered two versions of GSEA by allowing different weights: GSEA1 uses equal weights, yielding results similar to the Kolmogorov-Smirnov test; while GSEA2's weights are based on the correlation between genes and the phenotype. Results We compared GSEA1, GSEA2, GT, WT and GBST in a simulation study with various settings for the correlation structure of the genes and the association parameter between the survival outcome and the genes. Simulation results indicated that GT, WT and GBST consistently have higher power than GSEA1 and GSEA2 across all scenarios. However, the power of the five tests depends on the combination of correlation structure and association parameter. For the ovarian cancer data set, using the FDR threshold of q Conclusion Simulation studies and a real data example indicate that GT, WT and GBST tend to have high power, whereas GSEA1 and GSEA2 have lower power. We also found that the power of the five tests is much higher when genes are correlated than when genes are independent, when survival is positively associated with genes. It seems that there is a synergistic effect in detecting significant gene sets when significant genes have within-class correlation and the association between survival and genes is positive or negative (i.e., one-direction correlation).</p

    A general modular framework for gene set enrichment analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Analysis of microarray and other high-throughput data on the basis of gene sets, rather than individual genes, is becoming more important in genomic studies. Correspondingly, a large number of statistical approaches for detecting gene set enrichment have been proposed, but both the interrelations and the relative performance of the various methods are still very much unclear.</p> <p>Results</p> <p>We conduct an extensive survey of statistical approaches for gene set analysis and identify a common modular structure underlying most published methods. Based on this finding we propose a general framework for detecting gene set enrichment. This framework provides a meta-theory of gene set analysis that not only helps to gain a better understanding of the relative merits of each embedded approach but also facilitates a principled comparison and offers insights into the relative interplay of the methods.</p> <p>Conclusion</p> <p>We use this framework to conduct a computer simulation comparing 261 different variants of gene set enrichment procedures and to analyze two experimental data sets. Based on the results we offer recommendations for best practices regarding the choice of effective procedures for gene set enrichment analysis.</p

    Using pathway correlation profiles for understanding pathway perturbation

    Get PDF
    Title from PDF of title page (University of Missouri--Columbia, viewed on March 5, 2013).The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file.Dissertation advisor: Dr. Dong XuIncludes bibliographical references.Vita.Ph. D. University of Missouri-Columbia 2012."December 2012."Identifying perturbed or dysregulated pathways is critical to understanding the biological processes that change within an experiment. Previous methods identified important pathways that are significantly enriched among differentially expressed genes; however, these methods cannot account for small, coordinated changes in gene expression that amass across a whole pathway. In order to overcome this limitation, we developed a novel computational approach to identify pathway perturbation based on pathway correlation profiles. In this approach, we can rank the pathways based on the significance of their dysregulation considering al gene-gene pairs. We have shown this successfully for differences between two experimental conditions in Escherichia coli and changes within time series data in Saccharomyces cerevisiae, as well as two estrogen receptor response classes of breast cancer. Overall, our method made significant predictions as to the pathway perturbations that are involved in the experimental conditions. Further, I can use these pathway correlation profiles to better understand pathway dynamics and modules of regulation. I have applied this developed method to the Ribosome pathway for several model organisms and various tissue types, where I was able to isolate alternative regulation patterns for each species and tissue. In addition, I have applied these pathway correlation profiles for the MAPK pathway to help characterize the disease progression of colon cancer from normal tissue, through all four stages, culminating in final metastasis. The pathway correlation profile method allows for more meaningful and biologically significant interpretation of the current data available. In short, we developed a novel computational method for identifying pathway perturbation. This method is a powerful tool that better utilizes gene expression data when studying pathway dynamics in regards to biological processes. Moreover, this method provides hypotheses for understanding the mechanisms within meaningful pathways, and where the pathway dynamics change across conditions.Includes bibliographical reference

    Tests d'association génétique pour des durées de vie en grappes

    Get PDF
    Tableau d’honneur de la Faculté des études supérieures et postdoctorales, 2015-2016Les outils statistiques développés dans cette thèse par articles visent à détecter de nouvelles associations entre des variants génétiques et des données de survie en grappes. Le développement méthodologique en analyse des durées de vie est aujourd'hui ininterrompu avec la prolifération des tests d'association génétique et, de façon ultime, de la médecine personnalisée qui est centrée sur la prévention de la maladie et la prolongation de la vie. Dans le premier article, le problème suivant est traité : tester l'égalité de fonctions de survie en présence d'un biais de sélection et de corrélation intra-grappe lorsque l'hypothèse des risques proportionnels n'est pas valide. Le nouveau test est basé sur une statistique de type Cramérvon Mises. La valeur de p est estimée en utilisant une procédure novatrice de bootstrap semiparamétrique qui implique de générer des observations corrélées selon un devis non-aléatoire. Pour des scénarios de simulations présentant un écart vis-à-vis l'hypothèse nulle avec courbes de survie qui se croisent, la statistique de Cramer-von Mises offre de meilleurs résultats que la statistique de Wald du modèle de Cox à risques proportionnels pondéré. Le nouveau test a été utilisé pour analyser l'association entre un polymorphisme nucléotidique (SNP) candidat et le risque de cancer du sein chez des femmes porteuses d'une mutation sur le gène suppresseur de tumeur BRCA2. Un test d'association sequence kernel (SKAT) pour détecter l'association entre un ensemble de SNPs et des durées de vie en grappes provenant d'études familiales a été développé dans le deuxième article. La statistique de test proposée utilise la matrice de parenté de l'échantillon pour modéliser la corrélation intra-famille résiduelle entre les durées de vie via une copule gaussienne. La procédure de test fait appel à l'imputation multiple pour estimer la contribution des variables réponses de survie censurées à la statistique du score, laquelle est un mélange de distributions du khi-carré. Les résultats de simulations indiquent que le nouveau test du score de type noyau ajusté pour la parenté contrôle de façon adéquate le risque d'erreur de type I. Le nouveau test a été appliqué à un ensemble de SNPs du locus TERT. Le troisième article vise à présenter le progiciel R gyriq, lequel implante une version bonifiée du test d'association génétique développé dans le deuxième article. La matrice noyau identical-by-state (IBS) pondérée a été ajoutée, les tests d'association génétique actuellement disponibles pour des variables réponses d'âge d'apparition ont été brièvement revus de pair avec les logiciels les accompagnant, l'implantation du progiciel a été décrite et illustrée par des exemples.The statistical tools developed in this manuscript-based thesis aim at detecting new associations between genetic variants and clustered survival data. Methodological development in lifetime data analysis is today ongoing with the proliferation of genetic association testing and, ultimately, personalized medicine which focuses on preventing disease and prolonging life. In the first paper, the following problem is considered: testing the equality of survival functions in the presence of selection bias and intracluster correlation when the assumption of proportional hazards does not hold. The new proposed test is based on a Cramér-von Mises type statistic. The p-value is approximated using an innovative semiparametric bootstrap procedure which implies generating correlated observations according to a non-random design. For simulation scenarios of departures from the null hypothesis with crossing survival curves, the Cramer-von Mises statistic clearly outperformed the Wald statistic from the weighted Cox proportional hazards model. The new test was used to analyse the association between a candidate single nucleotide polymorphism (SNP) and breast cancer risk in women carrying a mutation in the BRCA2 tumor suppressor gene. A sequence kernel association test (SKAT) to detect the association between a set of genetic variants and clustered survival outcomes from family studies is developed in the second manuscript. The proposed statistic uses the kinship matrix of the sample to model the residual intra-family correlation between survival outcomes via a Gaussian copula. The test procedure relies on multiple imputation to estimate the contribution of the censored survival outcomes to the score statistic which is a mixture of chi-square distributions. Simulation results show that the new kinship-adjusted kernel score test controls adequately for the type I error rate. The new test was applied to a set of SNPs from the TERT locus. The third manuscript aims at presenting the R package gyriq which implements an enhanced version of the genetic association test developed in the second manuscript. The weighted identical-by-state (IBS) kernel matrix is added, genetic association tests and accompanying software currently available for age-at-onset outcomes are briefly reviewed, the implementation of the package is described, and illustrated through examples
    corecore