52 research outputs found

    A processual model for functional analyses of carcinogenesis in the prospective cohort design

    Published version also available at http://dx.doi.org/10.1016/j.mehy.2015.07.006Traditionally, the prospective design has been chosen for risk factor analyses of lifestyle and cancer using mainly estimation by survival analysis methods. With new technologies, epidemiologists can expand their prospective studies to include functional genomics given either as transcriptomics, mRNA and microRNA, or epigenetics in blood or other biological materials. The novel functional analyses should not be assessed using classical survival analyses since the main goal is not risk estimation, but the analysis of functional genomics as part of the dynamic carcinogenic process over time, i.e., a ‘‘processual’’ approach. In the risk factor model, time to event is analysed as a function of exposure variables known at start of follow-up (fixed covariates) or changing over the follow-up period (time-dependent covariates). In the processual model, transcriptomics or epigenetics is considered as functions of time and exposures. The success of this novel approach depends on the development of new statistical methods with the capacity of describing and analysing the time-dependent curves or trajectories for tens of thousands of genes simultaneously. This approach also focuses on multilevel or integrative analyses introducing novel statistical methods in epidemiology. The processual approach as part of systems epidemiology might represent in a near future an alternative to human in vitro studies using human biological material for understanding the mechanisms and pathways involved in carcinogenesis

    A new statistical method for curve group analysis of longitudinal gene expression data illustrated for breast cancer in the NOWAC postgenome cohort as a proof of principle

    International audienceA new statistical method for curve group analysis of longitudinal gene expression data illustrated for breast cancer in the NOWAC postgenome cohort as a proof of principle Abstract Background: The understanding of changes in temporal processes related to human carcinogenesis is limited. One approach for prospective functional genomic studies is to compile trajectories of differential expression of genes, based on measurements from many case-control pairs. We propose a new statistical method that does not assume any parametric shape for the gene trajectories. Methods: The trajectory of a gene is defined as the curve representing the changes in gene expression levels in the blood as a function of time to cancer diagnosis. In a nested case–control design it consists of differences in gene expression levels between cases and controls. Genes can be grouped into curve groups, each curve group corresponding to genes with a similar development over time. The proposed new statistical approach is based on a set of hypothesis testing that can determine whether or not there is development in gene expression levels over time, and whether this development varies among different strata. Curve group analysis may reveal significant differences in gene expression levels over time among the different strata considered. This new method was applied as a " proof of concept " to breast cancer in the Norwegian Women and Cancer (NOWAC) postgenome cohort, using blood samples collected prospectively that were specifically preserved for transcriptomic analyses (PAX tube). Cohort members diagnosed with invasive breast cancer through 2009 were identified through linkage to the Cancer Registry of Norway, and for each case a random control from the postgenome cohort was also selected, matched by birth year and time of blood sampling, to create a case-control pair. After exclusions, 441 case-control pairs were available for analyses, in which we considered strata of lymph node status at time of diagnosis and time of diagnosis with respect to breast cancer screening visits. Results: The development of gene expression levels in the NOWAC postgenome cohort varied in the last years before breast cancer diagnosis, and this development differed by lymph node status and participation in the Norwegian Breast Cancer Screening Program. The differences among the investigated strata appeared larger in the year before breast cancer diagnosis compared to earlier years.ConclusionsThis approach shows good properties in term of statistical power and type 1 error under minimal assumptions. When applied to a real data set it was able to discriminate between groups of genes with non-linear similar patterns before diagnosis

    Survival models with preclustered gene groups as covariates

    <p>Abstract</p> <p>Background</p> <p>An important application of high dimensional gene expression measurements is the risk prediction and the interpretation of the variables in the resulting survival models. A major problem in this context is the typically large number of genes compared to the number of observations (individuals). Feature selection procedures can generate predictive models with high prediction accuracy and at the same time low model complexity. However, interpretability of the resulting models is still limited due to little knowledge on many of the remaining selected genes. Thus, we summarize genes as gene groups defined by the hierarchically structured Gene Ontology (GO) and include these gene groups as covariates in the hazard regression models. Since expression profiles within GO groups are often heterogeneous, we present a new method to obtain subgroups with coherent patterns. We apply preclustering to genes within GO groups according to the correlation of their gene expression measurements.</p> <p>Results</p> <p>We compare Cox models for modeling disease free survival times of breast cancer patients. Besides classical clinical covariates we consider genes, GO groups and preclustered GO groups as additional genomic covariates. Survival models with preclustered gene groups as covariates have similar prediction accuracy as models built only with single genes or GO groups.</p> <p>Conclusions</p> <p>The preclustering information enables a more detailed analysis of the biological meaning of covariates selected in the final models. Compared to models built only with single genes there is additional functional information contained in the GO annotation, and compared to models using GO groups as covariates the preclustering yields coherent representative gene expression profiles.</p

    Survival prediction from clinico-genomic models - a comparative study

    <p>Abstract</p> <p>Background</p> <p>Survival prediction from high-dimensional genomic data is an active field in today's medical research. Most of the proposed prediction methods make use of genomic data alone without considering established clinical covariates that often are available and known to have predictive value. Recent studies suggest that combining clinical and genomic information may improve predictions, but there is a lack of systematic studies on the topic. Also, for the widely used Cox regression model, it is not obvious how to handle such combined models.</p> <p>Results</p> <p>We propose a way to combine classical clinical covariates with genomic data in a clinico-genomic prediction model based on the Cox regression model. The prediction model is obtained by a simultaneous use of both types of covariates, but applying dimension reduction only to the high-dimensional genomic variables. We describe how this can be done for seven well-known prediction methods: variable selection, unsupervised and supervised principal components regression and partial least squares regression, ridge regression, and the lasso. We further perform a systematic comparison of the performance of prediction models using clinical covariates only, genomic data only, or a combination of the two. The comparison is done using three survival data sets containing both clinical information and microarray gene expression data. Matlab code for the clinico-genomic prediction methods is available at <url>http://www.med.uio.no/imb/stat/bmms/software/clinico-genomic/</url>.</p> <p>Conclusions</p> <p>Based on our three data sets, the comparison shows that established clinical covariates will often lead to better predictions than what can be obtained from genomic data alone. In the cases where the genomic models are better than the clinical, ridge regression is used for dimension reduction. We also find that the clinico-genomic models tend to outperform the models based on only genomic data. Further, clinico-genomic models and the use of ridge regression gives for all three data sets better predictions than models based on the clinical covariates alone.</p

    Investigating the prediction ability of survival models based on both clinical and omics data: two case studies

    In biomedical literature numerous prediction models for clinical outcomes have been developed based either on clinical data or, more recently, on high-throughput molecular data (omics data). Prediction models based on both types of data, however, are less common, although some recent studies suggest that a suitable combination of clinical and molecular information may lead to models with better predictive abilities. This is probably due to the fact that it is not straightforward to combine data with different characteristics and dimensions (poorly characterized high dimensional omics data, well-investigated low dimensional clinical data). In this paper we analyze two publicly available datasets related to breast cancer and neuroblastoma, respectively, in order to show some possible ways to combine clinical and omics data into a prediction model of time-to-event outcome. Different strategies and statistical methods are exploited. The results are compared and discussed according to different criteria, including the discriminative ability of the models, computed on a validation dataset

    A selection operator for summary association statistics reveals allelic heterogeneity of complex traits

    A general objective of genetic studies is to understand the genetic basis of complex traits such as height, body mass index (BMI), disease endpoints, etc. Such researches have been facilitated due to the completion of the human genome project and developments of high-throughput technologies. With the help of high-throughput genotyping and sequencing technologies, the information on millions of genetic markers can be measured for each individual. The most widely used strategy to detect the associations between genetic variants and a complex trait is genome-wide association study (GWAS). Because the genetic architecture of most complex traits is highly polygenic, the signal to noise ratio is usually tiny. Thus, especially in human populations, GWAS often requires large samples to obtain sufficient power. Unfortunately, given the restrictions on sharing individual-level data, it is often not feasible to pool data from different cohorts. Despite that, in each cohort, it is possible to report and share GWAS summary statistics, such as sample sizes, allele frequencies, estimates of genetic effect sizes, and their standard errors for the genetic markers across the genome. Therefore one recent focus in statistical methodology development for genetic studies has been on meta-analysis techniques using summary-level data. The objective of this thesis is to develop novel statistical genetics methods based on GWAS summary statistics and to apply these methods to better understand the genetic architecture underlying complex traits. In Study I, we developed a Selection Operator for JOint analyzing multiple SNPs (SOJO). We mathematically proved and empirically showed that the least absolute shrinkage and selection operator (LASSO) could be achieved using GWAS summary-level data. Compared to the stepwise selection procedures, SOJO performs better in variable selection. SOJO is useful for detecting additional variants with independent effects and assessing the magnitude of allelic heterogeneity within loci. In Study II, we developed a High-Definition Likelihood (HDL) method to improve the accuracy in genetic correlation estimation using GWAS summary statistics. Compared to the stateof- the-art method LD Score regression (LDSC), HDL achieves higher statistical power to detect genetic correlations between phenotypes by fully accounting for linkage disequilibrium (LD) information across the genome. In Study III, we introduced a four-level strategy for replication of loci detected by multi-trait GWAS methods. The four methods provide different degrees of replication strength, useful for providing additional evidence when a locus has been discovered and replicated by multivariate analysis of variance (MANOVA) or other multi-trait methods. The replication methods only require summary association statistics and are straightforward to be applied to multi-trait GWAS analyses. In Study IV, using GWAS summary statistics, we developed a method named Genetic Correlation Contrast for Causality (G3C) as a more robust test for the existence and direction of causal relationships between phenotypes. In contrast to Mendelian Randomization (MR), G3C does not rely on the assumption of no horizontal pleiotropy. G3C takes full advantage of genome-wide genetic association data and account for underlying genetic correlations between complex traits

    SignS: a parallelized, open-source, freely available, web-based tool for gene selection and molecular signatures for survival and censored data

    <p>Abstract</p> <p>Background</p> <p>Censored data are increasingly common in many microarray studies that attempt to relate gene expression to patient survival. Several new methods have been proposed in the last two years. Most of these methods, however, are not available to biomedical researchers, leading to many re-implementations from scratch of ad-hoc, and suboptimal, approaches with survival data.</p> <p>Results</p> <p>We have developed SignS (Signatures for Survival data), an open-source, freely-available, web-based tool and R package for gene selection, building molecular signatures, and prediction with survival data. SignS implements four methods which, according to existing reviews, perform well and, by being of a very different nature, offer complementary approaches. We use parallel computing via MPI, leading to large decreases in user waiting time. Cross-validation is used to asses predictive performance and stability of solutions, the latter an issue of increasing concern given that there are often several solutions with similar predictive performance. Biological interpretation of results is enhanced because genes and signatures in models can be sent to other freely-available on-line tools for examination of PubMed references, GO terms, and KEGG and Reactome pathways of selected genes.</p> <p>Conclusion</p> <p>SignS is the first web-based tool for survival analysis of expression data, and one of the very few with biomedical researchers as target users. SignS is also one of the few bioinformatics web-based applications to extensively use parallelization, including fault tolerance and crash recovery. Because of its combination of methods implemented, usage of parallel computing, code availability, and links to additional data bases, SignS is a unique tool, and will be of immediate relevance to biomedical researchers, biostatisticians and bioinformaticians.</p

    Transcriptomic signals in blood prior to lung cancer focusing on time to diagnosis and metastasis

    Recent studies have indicated that there are functional genomic signals that can be detected in blood years before cancer diagnosis. This study aimed to assess gene expression in prospective blood samples from the Norwegian Women and Cancer cohort focusing on time to lung cancer diagnosis and metastatic cancer using a nested case–control design. We employed several approaches to statistically analyze the data and the methods indicated that the case–control differences were subtle but most distinguishable in metastatic case–control pairs in the period 0–3 years prior to diagnosis. The genes of interest along with estimated blood cell populations could indicate disruption of immunological processes in blood. The genes identified from approaches focusing on alterations with time to diagnosis were distinct from those focusing on the case–control differences. Our results support that explorative analyses of prospective blood samples could indicate circulating signals of disease-related processes