2,154 research outputs found

    Mining Pure, Strict Epistatic Interactions from High-Dimensional Datasets: Ameliorating the Curse of Dimensionality

    Get PDF
    Background: The interaction between loci to affect phenotype is called epistasis. It is strict epistasis if no proper subset of the interacting loci exhibits a marginal effect. For many diseases, it is likely that unknown epistatic interactions affect disease susceptibility. A difficulty when mining epistatic interactions from high-dimensional datasets concerns the curse of dimensionality. There are too many combinations of SNPs to perform an exhaustive search. A method that could locate strict epistasis without an exhaustive search can be considered the brass ring of methods for analyzing high-dimensional datasets. Methodology/Findings: A SNP pattern is a Bayesian network representing SNP-disease relationships. The Bayesian score for a SNP pattern is the probability of the data given the pattern, and has been used to learn SNP patterns. We identified a bound for the score of a SNP pattern. The bound provides an upper limit on the Bayesian score of any pattern that could be obtained by expanding a given pattern. We felt that the bound might enable the data to say something about the promise of expanding a 1-SNP pattern even when there are no marginal effects. We tested the bound using simulated datasets and semi-synthetic high-dimensional datasets obtained from GWAS datasets. We found that the bound was able to dramatically reduce the search time for strict epistasis. Using an Alzheimer's dataset, we showed that it is possible to discover an interaction involving the APOE gene based on its score because of its large marginal effect, but that the bound is most effective at discovering interactions without marginal effects. Conclusions/Significance: We conclude that the bound appears to ameliorate the curse of dimensionality in high-dimensional datasets. This is a very consequential result and could be pivotal in our efforts to reveal the dark matter of genetic disease risk from high-dimensional datasets. © 2012 Jiang, Neapolitan

    An associative classification based approach for detecting SNP-SNP interactions in high dimensional genome

    Get PDF
    There have been many studies that depict genotype phenotype relationships by identifying genetic variants associated with a specific disease. Researchers focus more attention on interactions between SNPs that are strongly associated with disease in the absence of main effect. In this context, a number of machine learning and data mining tools are applied to identify the combinations of multi-locus SNPs in higher order data.However, none of the current models can identify useful SNPSNP interactions for high dimensional genome data. Detecting these interactions is challenging due to bio-molecular complexities and computational limitations. The goal of this research was to implement associative classification and study its effectiveness for detecting the epistasis in balanced and imbalanced datasets. The proposed approach was evaluated for two locus epistasis interactions using simulated data. The datasets were generated for 5 different penetrance functions by varying heritability, minor allele frequency and sample size. In total, 23,400 datasets were generated and several experiments are conducted to identify the disease causal SNP interactions. The accuracy of classification by the proposed approach wascompared with the previous approaches. Though associative classification showed only relatively small improvement in accuracy for balanced datasets, it outperformed existing approaches in higher order multi-locus interactions in imbalanced datasets

    Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

    Full text link
    Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Methods in statistical genomics

    Get PDF
    The objective of this book is to describe procedures for analyzing genome-wide association studies (GWAS). Some of the material is unpublished and contains commentary and unpublished research; other chapters (Chapters 4 through 7) have been published in other journals. Each previously published chapter investigates a different genomics model, but all focus on identifying the strengths and limitations of various statistical procedures that have been applied to different GWAS scenarios.Publishe

    Discovering Higher-order SNP Interactions in High-dimensional Genomic Data

    Get PDF
    In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise

    Evaluation and extension of a kernel-based method for gene-gene interaction tests of common variants

    Full text link
    Interaction is likely to play a significant role in complex diseases, and various methods are available for identifying interactions between variants in genome-wide association studies (GWAS). Kernel-based variance component methods such as SKAT are flexible and computationally efficient methods for identifying marginal associations. A kernel-based variance component method, called the Gene-centric Gene-Gene Interaction with Smoothing-sPline ANOVA model (SPA3G) was proposed to identify gene-gene interactions for a quantitative trait. For interaction testing, the SPA3G method performs better than some SNP-based approaches under many scenarios. In this thesis, we evaluate the properties of the SPA3G method and extend SPA3G using alternative p-value approximations and interaction kernels. This thesis focuses on common variants only. Our simulation results show that the allele matching interaction kernel, combined with the method of moments p-value approximation, leads to inflated type I error in small samples. For small samples, we propose a Principal Component (PC)-based interaction kernel and computing p-values with a 3-moment adjustment that yield more appropriate type I error. We also propose a weighted PC kernel that has higher power than competing approaches when interaction effects are sparse. By combining the two proposed kernels, we develop omnibus methods that obtain near-optimal power in most settings. Finally, we illustrate how to analyze the interaction between selected gene pairs on the age at natural menopause (ANM) from the Framingham Heart Study

    Evaluation and extension of a kernel-based method for gene-gene interaction tests of common variants

    Full text link
    Interaction is likely to play a significant role in complex diseases, and various methods are available for identifying interactions between variants in genome-wide association studies (GWAS). Kernel-based variance component methods such as SKAT are flexible and computationally efficient methods for identifying marginal associations. A kernel-based variance component method, called the Gene-centric Gene-Gene Interaction with Smoothing-sPline ANOVA model (SPA3G) was proposed to identify gene-gene interactions for a quantitative trait. For interaction testing, the SPA3G method performs better than some SNP-based approaches under many scenarios. In this thesis, we evaluate the properties of the SPA3G method and extend SPA3G using alternative p-value approximations and interaction kernels. This thesis focuses on common variants only. Our simulation results show that the allele matching interaction kernel, combined with the method of moments p-value approximation, leads to inflated type I error in small samples. For small samples, we propose a Principal Component (PC)-based interaction kernel and computing p-values with a 3-moment adjustment that yield more appropriate type I error. We also propose a weighted PC kernel that has higher power than competing approaches when interaction effects are sparse. By combining the two proposed kernels, we develop omnibus methods that obtain near-optimal power in most settings. Finally, we illustrate how to analyze the interaction between selected gene pairs on the age at natural menopause (ANM) from the Framingham Heart Study
    corecore