5,260 research outputs found

    Detecting high-order interactions of single nucleotide polymorphisms using genetic programming

    Get PDF
    Motivation: Not individual single nucleotide polymorphisms (SNPs), but high-order interactions of SNPs are assumed to be responsible for complex diseases such as cancer. Therefore, one of the major goals of genetic association studies concerned with such genotype data is the identification of these high-order interactions. This search is additionally impeded by the fact that these interactions often are only explanatory for a relatively small subgroup of patients. Most of the feature selection methods proposed in the literature, unfortunately, fail at this task, since they can either only identify individual variables or interactions of a low order, or try to find rules that are explanatory for a high percentage of the observations. In this paper, we present a procedure based on genetic programming and multi-valued logic that enables the identification of high-order interactions of categorical variables such as SNPs. This method called GPAS (Genetic Programming for Association Studies) cannot only be used for feature selection, but can also be employed for discrimination. Results: In an application to the genotype data from the GENICA study, an association study concerned with sporadic breast cancer, GPAS is able to identify high-order interactions of SNPs leading to a considerably increased breast cancer risk for different subsets of patients that are not found by other feature selection methods. As an application to a subset of the HapMap data shows, GPAS is not restricted to association studies comprising several ten SNPs, but can also be employed to analyze whole-genome data. --

    Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

    Full text link
    Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Bioinformatics challenges for genome-wide association studies

    Get PDF
    Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods

    RFreak-An R-package for evolutionary computation

    Get PDF
    RFreak is an R package providing a framework for evolutionary computation. By enwrapping the functionality of an evolutionary algorithm kit written in Java, it offers an easy way to do evolutionary computation in R. In addition, application examples where an evolutionary approach is promising in computational statistics are included and described in this paper. The package is thus further supporting the use of evolutionary computation in computational statistics. --R,evolutionary algorithms,evolutionary computation,association study,robust regression

    Variants within the MMP3 gene are associated with achilles tendinopathy: possible interaction with the COL5A1 gene

    Get PDF
    Objectives: Sequence variation within the COL5A1 and TNC genes are known to associate with Achilles tendinopathy. The primary aim of this case-control genetic association study was to investigate whether variants within the matrix metalloproteinase 3 (MMP3) gene also contributed to both Achilles tendinopathy and Achilles tendon rupture in a Caucasian population. A secondary aim was to establish whether variants within the MMP3 gene interacted with the COL5A1 rs12722 variant to raise risk of these pathologies. Methods: 114 subjects with symptoms of Achilles tendon pathology and 98 healthy controls were genotyped for MMP3 variants rs679620, rs591058 and rs650108. Results: As single markers, significant associations were found between the GG genotype of rs679620 (OR = 2.5, 95% CI 1.2 to 4.90, p = 0.010), the CC genotype of rs591058 (OR = 2.3, 95% CI 1.1 to 4.50, p = 0.023) and the AA genotype of rs650108 (OR = 4.9, 95% CI 1.0 to 24.1, p = 0.043) and risk of Achilles tendinopathy. The ATG haplotype (rs679620, rs591058, and rs650108) was under-represented in the tendinopathy group when compared to the control group (41% vs 53%, p = 0.038). Finally, the G allele of rs679620 and the T allele of COL5A1 rs12722 significantly interacted to raise risk of AT (p = 0.006). No associations were found between any of the MMP3 markers and Achilles tendon rupture. Conclusion: Variants within the MMP3 gene are associated with Achilles tendinopathy. Furthermore, the MMP3 gene variant rs679620 and the COL5A1 marker rs12722 interact to modify the risk of tendinopathy. These data further support a genetic contribution to a common sports related injur

    GPNN: Power Studies and Applications of a Neural Network Method for Detecting Gene-Gene Interactions in Studies of Human Disease

    Get PDF
    The identification and characterization of genes that influence the risk of common, complex multifactorial disease primarily through interactions with other genes and environmental factors remains a statistical and computational challenge in genetic epidemiology. We have previously introduced a genetic programming optimized neural network (GPNN) as a method for optimizing the architecture of a neural network to improve the identification of gene combinations associated with disease risk. The goal of this study was to evaluate the power of GPNN for identifying high-order gene-gene interactions. We were also interested in applying GPNN to a real data analysis in Parkinson\u27s disease

    Statistical methods of SNP data analysis with applications

    Get PDF
    Various statistical methods important for genetic analysis are considered and developed. Namely, we concentrate on the multifactor dimensionality reduction, logic regression, random forests and stochastic gradient boosting. These methods and their new modifications, e.g., the MDR method with "independent rule", are used to study the risk of complex diseases such as cardiovascular ones. The roles of certain combinations of single nucleotide polymorphisms and external risk factors are examined. To perform the data analysis concerning the ischemic heart disease and myocardial infarction the supercomputer SKIF "Chebyshev" of the Lomonosov Moscow State University was employed
    corecore