5,260 research outputs found
Detecting high-order interactions of single nucleotide polymorphisms using genetic programming
Motivation: Not individual single nucleotide polymorphisms (SNPs), but high-order interactions of SNPs are assumed to be responsible for complex diseases such as cancer. Therefore, one of the major goals of genetic association studies concerned with such genotype data is the identification of these high-order interactions. This search is additionally impeded by the fact that these interactions often are only explanatory for a relatively small subgroup of patients. Most of the feature selection methods proposed in the literature, unfortunately, fail at this task, since they can either only identify individual variables or interactions of a low order, or try to find rules that are explanatory for a high percentage of the observations. In this paper, we present a procedure based on genetic programming and multi-valued logic that enables the identification of high-order interactions of categorical variables such as SNPs. This method called GPAS (Genetic Programming for Association Studies) cannot only be used for feature selection, but can also be employed for discrimination. Results: In an application to the genotype data from the GENICA study, an association study concerned with sporadic breast cancer, GPAS is able to identify high-order interactions of SNPs leading to a considerably increased breast cancer risk for different subsets of patients that are not found by other feature selection methods. As an application to a subset of the HapMap data shows, GPAS is not restricted to association studies comprising several ten SNPs, but can also be employed to analyze whole-genome data. --
Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases
Recent advances of information technology in biomedical sciences and other
applied areas have created numerous large diverse data sets with a high
dimensional feature space, which provide us a tremendous amount of information
and new opportunities for improving the quality of human life. Meanwhile, great
challenges are also created driven by the continuous arrival of new data that
requires researchers to convert these raw data into scientific knowledge in
order to benefit from it. Association studies of complex diseases using SNP
data have become more and more popular in biomedical research in recent years.
In this paper, we present a review of recent statistical advances and
challenges for analyzing correlated high dimensional SNP data in genomic
association studies for complex diseases. The review includes both general
feature reduction approaches for high dimensional correlated data and more
specific approaches for SNPs data, which include unsupervised haplotype
mapping, tag SNP selection, and supervised SNPs selection using statistical
testing/scoring, statistical modeling and machine learning methods with an
emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics
Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Bioinformatics challenges for genome-wide association studies
Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods
RFreak-An R-package for evolutionary computation
RFreak is an R package providing a framework for evolutionary computation. By enwrapping the functionality of an evolutionary algorithm kit written in Java, it offers an easy way to do evolutionary computation in R. In addition, application examples where an evolutionary approach is promising in computational statistics are included and described in this paper. The package is thus further supporting the use of evolutionary computation in computational statistics. --R,evolutionary algorithms,evolutionary computation,association study,robust regression
Variants within the MMP3 gene are associated with achilles tendinopathy: possible interaction with the COL5A1 gene
Objectives: Sequence variation within the COL5A1 and TNC genes are known to associate with Achilles tendinopathy. The primary aim of this case-control genetic association study was to investigate whether variants within the matrix metalloproteinase 3 (MMP3) gene also contributed to both Achilles tendinopathy and Achilles tendon rupture in a Caucasian population. A secondary aim was to establish whether variants within the MMP3 gene interacted with the COL5A1 rs12722 variant to raise risk of these pathologies.
Methods: 114 subjects with symptoms of Achilles tendon pathology and 98 healthy controls were genotyped for MMP3 variants rs679620, rs591058 and rs650108.
Results: As single markers, significant associations were found between the GG genotype of rs679620 (OR = 2.5, 95% CI 1.2 to 4.90, p = 0.010), the CC genotype of rs591058 (OR = 2.3, 95% CI 1.1 to 4.50, p = 0.023) and the AA genotype of rs650108 (OR = 4.9, 95% CI 1.0 to 24.1, p = 0.043) and risk of Achilles tendinopathy. The ATG haplotype (rs679620, rs591058, and rs650108) was under-represented in the tendinopathy group when compared to the control group (41% vs 53%, p = 0.038). Finally, the G allele of rs679620 and the T allele of COL5A1 rs12722 significantly interacted to raise risk of AT (p = 0.006). No associations were found between any of the MMP3 markers and Achilles tendon rupture.
Conclusion: Variants within the MMP3 gene are associated with Achilles tendinopathy. Furthermore, the MMP3 gene variant rs679620 and the COL5A1 marker rs12722 interact to modify the risk of tendinopathy. These data further support a genetic contribution to a common sports related injur
GPNN: Power Studies and Applications of a Neural Network Method for Detecting Gene-Gene Interactions in Studies of Human Disease
The identification and characterization of genes that influence the risk of common, complex multifactorial disease primarily through interactions with other genes and environmental factors remains a statistical and computational challenge in genetic epidemiology. We have previously introduced a genetic programming optimized neural network (GPNN) as a method for optimizing the architecture of a neural network to improve the identification of gene combinations associated with disease risk. The goal of this study was to evaluate the power of GPNN for identifying high-order gene-gene interactions. We were also interested in applying GPNN to a real data analysis in Parkinson\u27s disease
Statistical methods of SNP data analysis with applications
Various statistical methods important for genetic analysis are considered and
developed. Namely, we concentrate on the multifactor dimensionality reduction,
logic regression, random forests and stochastic gradient boosting. These
methods and their new modifications, e.g., the MDR method with "independent
rule", are used to study the risk of complex diseases such as cardiovascular
ones. The roles of certain combinations of single nucleotide polymorphisms and
external risk factors are examined. To perform the data analysis concerning the
ischemic heart disease and myocardial infarction the supercomputer SKIF
"Chebyshev" of the Lomonosov Moscow State University was employed
Recommended from our members
GenEpi: gene-based epistasis discovery using machine learning.
BackgroundGenome-wide association studies (GWAS) provide a powerful means to identify associations between genetic variants and phenotypes. However, GWAS techniques for detecting epistasis, the interactions between genetic variants associated with phenotypes, are still limited. We believe that developing an efficient and effective GWAS method to detect epistasis will be a key for discovering sophisticated pathogenesis, which is especially important for complex diseases such as Alzheimer's disease (AD).ResultsIn this regard, this study presents GenEpi, a computational package to uncover epistasis associated with phenotypes by the proposed machine learning approach. GenEpi identifies both within-gene and cross-gene epistasis through a two-stage modeling workflow. In both stages, GenEpi adopts two-element combinatorial encoding when producing features and constructs the prediction models by L1-regularized regression with stability selection. The simulated data showed that GenEpi outperforms other widely-used methods on detecting the ground-truth epistasis. As real data is concerned, this study uses AD as an example to reveal the capability of GenEpi in finding disease-related variants and variant interactions that show both biological meanings and predictive power.ConclusionsThe results on simulation data and AD demonstrated that GenEpi has the ability to detect the epistasis associated with phenotypes effectively and efficiently. The released package can be generalized to largely facilitate the studies of many complex diseases in the near future
- …