3,750 research outputs found

    A robust mean and variance test with application to high-dimensional phenotypes

    Get PDF
    Most studies of continuous health-related outcomes examine differences in mean levels (location) of the outcome by exposure. However, identifying effects on the variability (scale) of an outcome, and combining tests of mean and variability (location-and-scale), could provide additional insights into biological mechanisms. A joint test could improve power for studies of high-dimensional phenotypes, such as epigenome-wide association studies of DNA methylation at CpG sites. One possible cause of heterogeneity of variance is a variable interacting with exposure in its effect on outcome, so a joint test of mean and variability could help in the identification of effect modifiers. Here, we review a scale test, based on the Brown-Forsythe test, for analysing variability of a continuous outcome with respect to both categorical and continuous exposures, and develop a novel joint location-and-scale score (JLSsc) test. These tests were compared to alternatives in simulations and used to test associations of mean and variability of DNA methylation with gender and gestational age using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES). In simulations, the Brown-Forsythe and JLSsc tests retained correct type I error rates when the outcome was not normally distributed in contrast to the other approaches tested which all had inflated type I error rates. These tests also identified > 7500 CpG sites for which either mean or variability in cord blood methylation differed according to gender or gestational age. The Brown-Forsythe test and JLSsc are robust tests that can be used to detect associations not solely driven by a mean effect

    Analysing multiple types of molecular profiles simultaneously: Connecting the needles in the haystack

    Get PDF
    Background: It has been shown that a random-effects framework can be used to test the association between a gene's expression level and the number of DNA copies of a set of genes. This gene-set modelling framework was later applied to find associations between mRNA expression and microRNA expression, by defining the gene sets using target prediction information. Methods and results: Here, we extend the model introduced by Menezes et al. 2009 to consider the effect of not just copy number, but also of other molecular profiles such as methylation changes and loss-of-heterozigosity (LOH), on gene expression levels. We will consider again sets of measurements, to improve robustness of results and increase the power to find associations. Our approach can be used genome-wide to find associations and yields a test to help separate true associations from noise. We apply our method to colon and to breast cancer samples, for which genome-wide copy number, methylation and gene expression profiles are available. Our findings include interesting gene expression-regulating mechanisms, which may involve only one of copy number or methylation, or both for the same samples. We even are able to find effects due to different molecular mechanisms in different samples. Conclusions: Our method can equally well be applied to cases where other types of molecular (high-dimensional) data are collected, such as LOH, SNP genotype and microRNA expression data. Computationally efficient, it represents a flexible and powerful tool to study associations between high-dimensional datasets. The method is freely available via the SIM BioConductor package

    Statistical Inference for High-Dimensional Genetic Data

    Get PDF
    This dissertation focuses on three types of high-dimensional genetic data: protein sequences, DNA methylation data, and microRNA expression data. The four major parts are presented in Chapters 2-5, respectively. In Chapter 2, we develop a new clustering method for protein sequences. First, we reduce the dimensionality based on entropy. Second, the sequences are clustered using the Hamming distance vectors of chosen sites. We apply this new method to an influenza A H3N2 HA data set, which consists of 1960 viral sequences. Our method aggregates these sequences into 23 clusters. Based on the temporal evolution pattern of these clusters, we find that the dominant clusters change from time to time and are often different from the clusters housing vaccine strains. In Chapter 3, we conduct systematic simulation studies and real data analysis to compare the performance of seven statistical tests for equal-variance hypothesis. Our results show that Brown-Forsythe test and trimmed-mean-based-Levene's test have better performance on DNA methylation data in comparison with other tests. Detection of differential DNA methylation and differential variability have received a lot of attention in the literature. In Chapter 4, we derive the asymptotic distribution of a joint score test (AW), proposed by Anh and Wang (2013). Furthermore, we propose three improved joint score tests, namely iAW.Lev, iAW.BF, and iAW.TM. Systematic simulation studies show that at least one of the proposed tests performs better than the existing tests for data with outliers or from non-normal distributions. The real data analyses demonstrate that the three proposed tests have higher true validation rates than the existing tests. Besides DNA methylation, microRNA regulation is another important epigenetic mechanism. In Chapter 5, we propose a novel model-based clustering method to detect differentially variable (DV) miRNAs. We impose biologically meaningful structures on covariance matrices for each cluster of miRNAs. Simulation studies show that the proposed method performs better than other model-based methods when miRNA expression levels are from a multivariate normal distribution. In real data analysis, the proposed method has a higher validation rate than other methods

    Genomic, Pathway Network, and Immunologic Features Distinguishing Squamous Carcinomas

    Get PDF
    This integrated, multiplatform PanCancer Atlas study co-mapped and identified distinguishing molecular features of squamous cell carcinomas (SCCs) from five sites associated with smokin

    Association Analysis Using Set-Based Approaches in the Post-GWAS Era

    Get PDF
    Genotyping arrays have greatly facilitated genetic epidemiological studies into genetic risk factors for numerous complex diseases such as psychiatric disorders. The use of genome-wide association analysis (GWAS) is unequivocally established. More recently, DNA methylation arrays have enabled genome-wide profiling of the methylome, in addition to contemporary genetic epidemiology study design. An example of one such study is the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) Lipidomics Study, which identified methylation markers (CpG markers) and single nucleotide polymorphisms (SNPs), associated with the change in triglyceride levels after drug intervention. Genotyping and methylation arrays assay several hundred thousand markers; however, single-marker association analysis suffers greatly from the burden of multiple testing. Set-based (SNP or CpG set) association approaches offer great flexibility, thus allowing the joint testing of a set of variants. For instance, a polygenic risk score (PRS) is a set-based approach, which, in addition to the strongly associated SNPs identified by large-scale GWAS, recruits SNPs with moderate to weak effects. The genotype information of the SNP set in the PRS is taken from an independent sample (target sample) and is then weighted by individual SNP effects derived from a relevant GWAS performed on a separate sample (discovery sample) into a cumulative score for each individual in the target sample. The resulting score, based on a SNP set or the PRS, is then regressed on the target phenotype. Such a regression model is evaluated by the amount of variance explained (R2) by the PRS in the target phenotype. Another strategy of set-based association analysis is kernel machine regression (KMR): a semi-parametric regression approach, in which the effects of markers within a set (CpG set or SNP set) are modelled via a kernel function and thus evaluated by a single-component variance test. A kernel function computes pairwise genomic similarity between the individuals, that is, the inner product of a set of variants under analysis, maybe comprising a gene or a biological pathway. For my first article, I performed a simulation study to evaluate the performance of PRS in correlated discovery and target traits by considering various sample sizes of the target sample, namely n=200, 500, and 1000. The PRS for correlated traits can be viewed as a situation of calculating schizophrenia-PRS for psychosocial endophenotypes such as global assessment functioning (GAF) score or positive and negative syndrome scale (PANSS) score. Considering such a situation, I simulated four correlated target traits that had varying degrees of correlation (r2) with the discovery trait, i.e., r2= 1.00, 0.8, 0.6, and 0.4. The results demonstrated that the average R2 estimates by the PRS roughly decreased by the square of the correlation between the target traits. In addition, the range of estimated R2 is most inflated in the sample size of the target trait n=200. Thus, the simulation findings alert researchers conducting clinical studies with endophenotypes to the fact that they need to pay attention to two important factors: first, the sample size of the target trait and secondly, the shared amount of genetic correlation between the target and discovery traits. In my second article, I implemented a KMR approach for set-based association testing of a CpG set. KMR has been successfully employed on SNP sets. In preparation of the second article, I used real and simulated datasets (based on a real dataset) provided by the Genetic Analysis Workshop 20 (GAW20) from the GOLDN study. GOLDN is a longitudinal study with individuals recruited from pedigrees. In my analysis, I only used independent individuals, which restricted the sample size in the real and simulated datasets to n<200. CpG sets were devised using the evidence of association reported by the GOLDN study in the real data set. For simulated datasets, true causal CpGs were provided by GAW20. Thus, I formulated candidate genomic regions of varying lengths while keeping the associated CpG(s) inside the region. The results replicated the evidence of association reported by GOLDN in the real data, and in simulated datasets albeit nominally. Moreover, in the simulated data, causal SNPs exert their full effect on the phenoytpes given when the causal CpG loci had no methylation (B-value=0). Thus, I also considered modelling an interaction term along with the main effects. The results yielded significant association. As part of the discussion, simulation results on the performance of the linear kernel for a CpG set with original (B-values) and logit transformed methylation values (M-values) indicated that logit transformation results in a loss of power. There, I also considered analysing an additive kernel that combines the genotype kernel and the methylation kernel and then tests for association with the phenotype. The initial simulations suggest that an additive kernel with a CpG set including hypo, semi, and hypermethylated sites simultaneously might not improve the model over only including a SNP set. However, it appears fruitful to investigate further the situation in which only one type of methylation state is present in a CpG set
    • …
    corecore