11 research outputs found

    Gene set analysis for longitudinal gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene set analysis (GSA) has become a successful tool to interpret gene expression profiles in terms of biological functions, molecular pathways, or genomic locations. GSA performs statistical tests for independent microarray samples at the level of gene sets rather than individual genes. Nowadays, an increasing number of microarray studies are conducted to explore the dynamic changes of gene expression in a variety of species and biological scenarios. In these longitudinal studies, gene expression is repeatedly measured over time such that a GSA needs to take into account the within-gene correlations in addition to possible between-gene correlations.</p> <p>Results</p> <p>We provide a robust nonparametric approach to compare the expressions of longitudinally measured sets of genes under multiple treatments or experimental conditions. The limiting distributions of our statistics are derived when the number of genes goes to infinity while the number of replications can be small. When the number of genes in a gene set is small, we recommend permutation tests based on our nonparametric test statistics to achieve reliable type I error and better power while incorporating unknown correlations between and within-genes. Simulation results demonstrate that the proposed method has a greater power than other methods for various data distributions and heteroscedastic correlation structures. This method was used for an IL-2 stimulation study and significantly altered gene sets were identified.</p> <p>Conclusions</p> <p>The simulation study and the real data application showed that the proposed gene set analysis provides a promising tool for longitudinal microarray analysis. R scripts for simulating longitudinal data and calculating the nonparametric statistics are posted on the North Dakota INBRE website <url>http://ndinbre.org/programs/bioinformatics.php</url>. Raw microarray data is available in Gene Expression Omnibus (National Center for Biotechnology Information) with accession number GSE6085.</p

    Variance component score test for time-course gene set analysis of longitudinal RNA-seq data

    Get PDF
    As gene expression measurement technology is shifting from microarrays to sequencing, the statistical tools available for their analysis must be adapted since RNA-seq data are measured as counts. Recently, it has been proposed to tackle the count nature of these data by modeling log-count reads per million as continuous variables, using nonparametric regression to account for their inherent heteroscedasticity. Adopting such a framework, we propose tcgsaseq, a principled, model-free and efficient top-down method for detecting longitudinal changes in RNA-seq gene sets. Considering gene sets defined a priori, tcgsaseq identifies those whose expression vary over time, based on an original variance component score test accounting for both covariates and heteroscedasticity without assuming any specific parametric distribution for the transformed counts. We demonstrate that despite the presence of a nonparametric component, our test statistic has a simple form and limiting distribution, and both may be computed quickly. A permutation version of the test is additionally proposed for very small sample sizes. Applied to both simulated data and two real datasets, the proposed method is shown to exhibit very good statistical properties, with an increase in stability and power when compared to state of the art methods ROAST, edgeR and DESeq2, which can fail to control the type I error under certain realistic settings. We have made the method available for the community in the R package tcgsaseq.Comment: 23 pages, 6 figures, typo corrections & acceptance acknowledgemen

    Discovery of relevant response in infected potato plants from time series of gene expression data

    Get PDF
    The paper presents a methodology for analyzing time series of gene expression data collected from the leaves of potato virus Y (PVY) infected and non-infected potato plants, with the aim to identify significant differences between the two sets of potato plants’ characteristic for various time points. We aim at identifying differentially- expressed genes whose expression values are statistically significantly different in the set of PVY infected potato plants compared to non- infected plants, and which demonstrate also statistically significant changes of expression values of genes of PVY infected potato plants in time. The novelty of the approach includes stratified data randomization used in estimating the statistical properties of gene expression of the samples in the control set of non-infected potato plants. A novel estimate that computes the relative minimal distance between the samples has been defined that enables reliable identification of the differences between the target and control datasets when these sets are small. The relevance of the outcomes is demonstrated by visualizing the relative minimal distance of gene expression changes in time for three different types of potato leaves for the genes that have been identified as relevant by the proposed methodology

    Gene Set Testing by Distance Correlation

    Get PDF
    Pathways are the functional building blocks of complex diseases such as cancers. Pathway-level studies may provide insights on some important biological processes. Gene set test is an important tool to study the differential expression of a gene set between two groups, e.g., cancer vs normal. The differential expression of a gene set could be due to the difference in mean, variability, or both. However, most existing gene set tests only target the mean difference but overlook other types of differential expression. In this thesis, we propose to use the recently developed distance correlation for gene set testing. To assess the distance correlation test, simulation studies under different settings are conducted for a comprehensive comparison with the popular Hotelling’s T^2 test and rotation gene set test (ROAST). The three gene set tests are also applied to two real datasets for further comparisons. Based on our simulation studies and real data applications, it is found that the distance correlation test has overall better statistical performance than Hotelling’s T^2 test and ROAST test, especially for detecting the difference in variability. This thesis begins with introductions to the problem of gene set testing, and then introduces the prevailing Hotelling’s T^2 test and ROAST test. Chapter 2 is a detailed review of the concepts and properties of distance correlation. The results from simulation studies and real data applications were summarized in Chapters 3 and 4 respectively. In Chapter 5, we conclude the thesis with some discussion and future perspectives

    Genomic analysis of macrophage gene signatures during idiopathic pulmonary fibrosis development

    Get PDF
    Idiopathic Pulmonary Fibrosis (IPF) is a chronic, progressive, irreversible lung disease. After diagnosis, the interstitial condition commonly presents 3-5 years of life expectancy if untreated. Despite the limited capacity of recapitulating IPF, animal models have been useful for identifying related pathways relevant for drug discovery and diagnostic tools development. Using these techniques, several immune-related mechanisms have been implicated to IPF. For instance, subpopulations of macrophages and monocytes-derived cells are recognized as centrally active in pulmonary immunological processes. One of the most used technologies is high-throughput gene expression analysis, which has been available for almost two decades now. The “omics” revolution has presented major impacts on macrophage and pulmonary fibrosis research. The present study aims to investigate macrophage dynamics within the context of IPF at the transcriptomic level. Using publicly available gene-expression data, we applied modern data science approaches to (1) understand longitudinal profiles within IPF models; (2) investigate correlation between macrophage genomic dynamics and IPF development; and (3) apply longitudinal profiles uncovered through multivariate data analysis to the development of new sets of predictors able to classify IPF and control samples accordingly. Principal Component Analysis and Hierarchical Clustering showed that our pipeline was able to construct a complex set of biomarker candidates that together outperformed gene expression alone in separating treatment groups in an IPF animal model dataset. We further assessed the predictive performance of our candidates on publicly available gene expression data from IPF patients. Once again, the constructed biomarker candidates were significantly differentiated between IPF and control samples. The data presented in this work strongly suggest that longitudinal data analysis holds major unappreciated potentials for translational medicine research

    Tópicos em análise de experimentos longitudinais para aplicações em estudos de sinais biopotenciais

    Get PDF
    Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Estatística, 2103.Este trabalho busca revisar e comparar numericamente diferentes metodologias de análise de dados longitudinais com estrutura assintótica. Especificamente, são estudados testes de ausência de efeito simples com base nos trabalhos de Wang (2004), von Borries (2008) e Zhang (2008). As curvas de poder desses testes são construídas para diferentes cenários de simulação e a partir disso, constata-se que o teste de Zhang (2008) apresenta resultados superiores, mesmo nos casos em que os testes de von Borries (2008) e Wang (2004) eram tidos como adequados. Como conseqüência, esses testes são adaptados ao algoritmo de agrupamento PPCLUSTEL e utilizados na análise de dados de microarranjo, eletroencefalografia e eletromiografia. Os softwares SAS e Gnuplot são adotados na obtenção dos resultados. ______________________________________________________________________________ ABSTRACTThis work looks at different methodologies for analyzing longitudinal data with asymptotic structure. The studies focus specically on tests of no simple e ect based on the works of Wang (2004), von Borries (2008) and Zhang (2008). The power curves of these tests are then built for di erent simulation scenarios and from these curves it can be seen that Zhang's (2008) tests presents superior results, even in the cases where von Borries's (2008) and Wang's (2004) tests were considered adequate. Therefore, these tests are adapted to the clustering algorithms PPCLUSTEL and used in the analysis of microarray, electroencephalography and electromyography data. SAS and Gnuplot software were adopted for obtaining the results

    DNA methylation as a biomarker for age-related cognitive impairment

    Get PDF
    PhD ThesisDue to the ageing population, the number of patients diagnosed with age-related diseases such as stroke and Parkinson’s disease are on the rise. In both post-stroke dementia (PSD) and mild cognitive impairment in Parkinson’s disease (PD-MCI), the mechanisms resulting in cognitive decline are unknown. This project aims to identify a biomarker which could predict those patients most at risk of developing cognitive decline, which would subsequently assist healthcare professionals in recommending early treatment and care. Epigenetics is an emerging field in which biomarkers have previously been useful in prognostication of cancers and prediction of cardiovascular disease. In this study, 30 patients from a PSD cohort (COGFAST) and 48 patients from a PD-MCI cohort (ICICLE) were analysed using the Illumina HumanMethylation450 BeadChip to identify differentially methylated positions which could predict patients who would later develop cognitive decline. Top hits were validated using Pyrosequencing to confirm DNA methylation differences in a replication cohort. Individual CpG sites within APOB and NGF were identified as potential blood-based biomarkers for PSD and one CpG site within CHCHD5 was highlighted as a potential blood-based biomarker for PD-MCI. In addition, methylation at one CpG site within NGF and a CpG site (cg18837178) within a non-coding RNA, were found to be associated with Braak staging (degree of brain pathology) using DNA from two brain regions. NGF deregulation has previously been associated with Alzheimer’s disease, and this finding indicates it may also have a role in the development of PSD. These novel findings represent the first steps towards the identification of blood-based biomarkers to assist with diagnosis of PSD and PD-MCI, but require further validation in a larger independent cohort. The differentially methylated genes identified may also give insight into some of the mechanisms involved in these complex diseases, potentially leading to the future development of targeted preventative treatments.Medical Research Council and Newcastle Universit

    Time-Course Gene Set Analysis for Longitudinal Gene Expression Data

    Get PDF
    International audienceGene set analysis methods, which consider predefined groups of genes in the analysis of genomic data, have been successfully applied for analyzing gene expression data in cross-sectional studies. The time-course gene set analysis (TcGSA) introduced here is an extension of gene set analysis to longitudinal data. The proposed method relies on random effects modeling with maximum likelihood estimates. It allows to use all available repeated measurements while dealing with unbalanced data due to missing at random (MAR) measurements. TcGSA is a hypothesis driven method that identifies a priori defined gene sets with significant expression variations over time, taking into account the potential heterogeneity of expression within gene sets. When biological conditions are compared, the method indicates if the time patterns of gene sets significantly differ according to these conditions. The interest of the method is illustrated by its application to two real life datasets: an HIV therapeutic vaccine trial (DALIA-1 trial), and data from a recent study on influenza and pneumococcal vaccines. In the DALIA-1 trial TcGSA revealed a significant change in gene expression over time within 69 gene sets during vaccination, while a standard univariate individual gene analysis corrected for multiple testing as well as a standard a Gene Set Enrichment Analysis (GSEA) for time series both failed to detect any significant pattern change over time. When applied to the second illustrative data set, TcGSA allowed the identification of 4 gene sets finally found to be linked with the influenza vaccine too although they were found to be associated to the pneumococcal vaccine only in previous analyses. In our simulation study TcGSA exhibits good statistical properties, and an increased power compared to other approaches for analyzing time-course expression patterns of gene sets. The method is made available for the community through an R package
    corecore