3,481 research outputs found

    Statistical Modeling of Epistasis and Linkage Decay using Logic Regression

    Get PDF
    Logic regression has been recognized as a tool that can identify and model non-additive genetic interactions using Boolean logic groups. Logic regression, TASSEL-GLM and SAS-GLM were compared for analytical precision using a previously characterized model system to identify the best genetic model explaining epistatic interaction of vernalization-sensitivity in barley. A genetic model containing two molecular markers identified in vernalization response in barley was selected using logic regression while both TASSEL-GLM and SAS-GLM included spurious associations in their models. The results also suggest the logic regression can be used to identify dominant/recessive relationships between epistatic alleles through its use of conjugate
operators

    Bayesian model search and multilevel inference for SNP association studies

    Full text link
    Technological advances in genotyping have given rise to hypothesis-based association studies of increasing scope. As a result, the scientific hypotheses addressed by these studies have become more complex and more difficult to address using existing analytic methodologies. Obstacles to analysis include inference in the face of multiple comparisons, complications arising from correlations among the SNPs (single nucleotide polymorphisms), choice of their genetic parametrization and missing data. In this paper we present an efficient Bayesian model search strategy that searches over the space of genetic markers and their genetic parametrization. The resulting method for Multilevel Inference of SNP Associations, MISA, allows computation of multilevel posterior probabilities and Bayes factors at the global, gene and SNP level, with the prior distribution on SNP inclusion in the model providing an intrinsic multiplicity correction. We use simulated data sets to characterize MISA's statistical power, and show that MISA has higher power to detect association than standard procedures. Using data from the North Carolina Ovarian Cancer Study (NCOCS), MISA identifies variants that were not identified by standard methods and have been externally ``validated'' in independent studies. We examine sensitivity of the NCOCS results to prior choice and method for imputing missing data. MISA is available in an R package on CRAN.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS322 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Statistical methods of SNP data analysis with applications

    Get PDF
    Various statistical methods important for genetic analysis are considered and developed. Namely, we concentrate on the multifactor dimensionality reduction, logic regression, random forests and stochastic gradient boosting. These methods and their new modifications, e.g., the MDR method with "independent rule", are used to study the risk of complex diseases such as cardiovascular ones. The roles of certain combinations of single nucleotide polymorphisms and external risk factors are examined. To perform the data analysis concerning the ischemic heart disease and myocardial infarction the supercomputer SKIF "Chebyshev" of the Lomonosov Moscow State University was employed

    Imputing missing genotypes with weighted k nearest neighbors

    Get PDF
    Motivation: Missing values are a common problem in genetic association studies concerned with single nucleotide polymorphisms (SNPs). Since most statistical methods cannot handle missing values, they have to be removed prior to the actual analysis. Considering only complete observations, however, often leads to an immense loss of information. Therefore, procedures are needed that can be used to replace such missing values. In this article, we propose a method based on weighted k nearest neighbors that can be employed for imputing such missing genotypes. Results: In a comparison to other imputation approaches, our procedure called KNNcatImpute shows the lowest rates of falsely imputed genotypes when applied to the SNP data from the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. Moreover, in contrast to other imputation methods that take all variables into account when replacing missing values of a particular variable, KNNcatImpute is not restricted to association studies comprising several ten to a few hundred SNPs, but can also be applied to data from whole-genome studies, as an application to a subset of the HapMap data shows. --

    Multiple Quantitative Trait Analysis Using Bayesian Networks

    Full text link
    Models for genome-wide prediction and association studies usually target a single phenotypic trait. However, in animal and plant genetics it is common to record information on multiple phenotypes for each individual that will be genotyped. Modeling traits individually disregards the fact that they are most likely associated due to pleiotropy and shared biological basis, thus providing only a partial, confounded view of genetic effects and phenotypic interactions. In this paper we use data from a Multiparent Advanced Generation Inter-Cross (MAGIC) winter wheat population to explore Bayesian networks as a convenient and interpretable framework for the simultaneous modeling of multiple quantitative traits. We show that they are equivalent to multivariate genetic best linear unbiased prediction (GBLUP), and that they are competitive with single-trait elastic net and single-trait GBLUP in predictive performance. Finally, we discuss their relationship with other additive-effects models and their advantages in inference and interpretation. MAGIC populations provide an ideal setting for this kind of investigation because the very low population structure and large sample size result in predictive models with good power and limited confounding due to relatedness.Comment: 28 pages, 1 figure, code at http://www.bnlearn.com/research/genetics1

    Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multifactorial diseases such as cancer and cardiovascular diseases are caused by the complex interplay between genes and environment. The detection of these interactions remains challenging due to computational limitations. Information theoretic approaches use computationally efficient directed search strategies and thus provide a feasible solution to this problem. However, the power of information theoretic methods for interaction analysis has not been systematically evaluated. In this work, we compare power and Type I error of an information-theoretic approach to existing interaction analysis methods.</p> <p>Methods</p> <p>The <it>k-</it>way interaction information (KWII) metric for identifying variable combinations involved in gene-gene interactions (GGI) was assessed using several simulated data sets under models of genetic heterogeneity driven by susceptibility increasing loci with varying allele frequency, penetrance values and heritability. The power and proportion of false positives of the KWII was compared to multifactor dimensionality reduction (MDR), restricted partitioning method (RPM) and logistic regression.</p> <p>Results</p> <p>The power of the KWII was considerably greater than MDR on all six simulation models examined. For a given disease prevalence at high values of heritability, the power of both RPM and KWII was greater than 95%. For models with low heritability and/or genetic heterogeneity, the power of the KWII was consistently greater than RPM; the improvements in power for the KWII over RPM ranged from 4.7% to 14.2% at for α = 0.001 in the three models at the lowest heritability values examined. KWII performed similar to logistic regression.</p> <p>Conclusions</p> <p>Information theoretic models are flexible and have excellent power to detect GGI under a variety of conditions that characterize complex diseases.</p

    Identifying interacting genetic variations by fish-swarm logic regression

    Get PDF
    Understanding associations between genotypes and complex traits is a fundamental problem in human genetics. A major open problem in mapping phenotypes is that of identifying a set of interacting genetic variants, which might contribute to complex traits. Logic regression (LR) is a powerful multivariant association tool. Several LR-based approaches have been successfully applied to different datasets. However, these approaches are not adequate with regard to accuracy and efficiency. In this paper, we propose a new LR-based approach, called fish-swarm logic regression (FSLR), which improves the logic regression process by incorporating swarm optimization. In our approach, a school of fish agents are conducted in parallel. Each fish agent holds a regression model, while the school searches for better models through various preset behaviors. A swarm algorithm improves the accuracy and the efficiency by speeding up the convergence and preventing it from dropping into local optimums. We apply our approach on a real screening dataset and a series of simulation scenarios. Compared to three existing LR-based approaches, our approach outperforms them by having lower type I and type II error rates, being able to identify more preset causal sites, and performing at faster speeds

    An associative classification based approach for detecting SNP-SNP interactions in high dimensional genome

    Get PDF
    There have been many studies that depict genotype phenotype relationships by identifying genetic variants associated with a specific disease. Researchers focus more attention on interactions between SNPs that are strongly associated with disease in the absence of main effect. In this context, a number of machine learning and data mining tools are applied to identify the combinations of multi-locus SNPs in higher order data.However, none of the current models can identify useful SNPSNP interactions for high dimensional genome data. Detecting these interactions is challenging due to bio-molecular complexities and computational limitations. The goal of this research was to implement associative classification and study its effectiveness for detecting the epistasis in balanced and imbalanced datasets. The proposed approach was evaluated for two locus epistasis interactions using simulated data. The datasets were generated for 5 different penetrance functions by varying heritability, minor allele frequency and sample size. In total, 23,400 datasets were generated and several experiments are conducted to identify the disease causal SNP interactions. The accuracy of classification by the proposed approach wascompared with the previous approaches. Though associative classification showed only relatively small improvement in accuracy for balanced datasets, it outperformed existing approaches in higher order multi-locus interactions in imbalanced datasets
    corecore