265 research outputs found

    A Classification and Characterization of Two-Locus, Pure, Strict, Epistatic Models for Simulation and Detection

    Get PDF
    BackgroundThe statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multi-locus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, two-locus epistatic models, (2) Investigate the effect of model ‘architecture’ on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models

    Predicting the Difficulty of Pure, Strict, Epistatic Models: Metrics for Simulated Model Selection

    Get PDF
    Background: Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the detectability of a model. In order to design a simulation study which efficiently takes architecture into account, a reliable metric is needed for model selection. Results: We evaluate three metrics as predictors of relative model detection difficulty derived from previous works: (1) Penetrance table variance (PTV), (2) customized odds ratio (COR), and (3) our own Ease of Detection Measure (EDM), calculated from the penetrance values and respective genotype frequencies of each simulated genetic model. We evaluate the reliability of these metrics across three very different data search algorithms, each with the capacity to detect epistatic interactions. We find that a model’s EDM and COR are each stronger predictors of model detection success than heritability. Conclusions: This study formally identifies and evaluates metrics which quantify model detection difficulty. We utilize these metrics to intelligently select models from a population of potential architectures. This allows for an improved simulation study design which accounts for differences in detection difficulty attributed to model architecture. We implement the calculation and utilization of EDM and COR into GAMETES, an algorithm which rapidly and precisely generates pure, strict, n-locus epistatic models

    Detecting Gene-Gene Interactions Using a Permutation-Based Random Forest Method

    Get PDF
    Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a permutation-based methodology relying on a machine learning method, random forest (RF), to detect gene-gene interactions. Our approach called permuted random forest (pRF) which identified the top interacting single nucleotide polymorphism (SNP) pairs by estimating how much the power of a random forest classification model is influenced by removing pairwise interactions

    Discovering causal interactions using Bayesian network scoring and information gain

    Get PDF
    Background: The problem of learning causal influences from data has recently attracted much attention. Standard statistical methods can have difficulty learning discrete causes, which interacting to affect a target, because the assumptions in these methods often do not model discrete causal relationships well. An important task then is to learn such interactions from data. Motivated by the problem of learning epistatic interactions from datasets developed in genome-wide association studies (GWAS), researchers conceived new methods for learning discrete interactions. However, many of these methods do not differentiate a model representing a true interaction from a model representing non-interacting causes with strong individual affects. The recent algorithm MBS-IGain addresses this difficulty by using Bayesian network learning and information gain to discover interactions from high-dimensional datasets. However, MBS-IGain requires marginal effects to detect interactions containing more than two causes. If the dataset is not high-dimensional, we can avoid this shortcoming by doing an exhaustive search. Results: We develop Exhaustive-IGain, which is like MBS-IGain but does an exhaustive search. We compare the performance of Exhaustive-IGain to MBS-IGain using low-dimensional simulated datasets based on interactions with marginal effects and ones based on interactions without marginal effects. Their performance is similar on the datasets based on marginal effects. However, Exhaustive-IGain compellingly outperforms MBS-IGain on the datasets based on 3 and 4-cause interactions without marginal effects. We apply Exhaustive-IGain to investigate how clinical variables interact to affect breast cancer survival, and obtain results that agree with judgements of a breast cancer oncologist. Conclusions: We conclude that the combined use of information gain and Bayesian network scoring enables us to discover higher order interactions with no marginal effects if we perform an exhaustive search. We further conclude that Exhaustive-IGain can be effective when applied to real data

    Discovering Higher-order SNP Interactions in High-dimensional Genomic Data

    Get PDF
    In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise

    Efficient strategies for epistasis detection in genome-wide data

    Get PDF
    Genome-Wide Association Studies have been carried out with SNP array technology since 2005, identifying thousands of loci for a great many traits and diseases. There are now large data sources, such as UK biobank, that provide medical and genetic data of hundreds-of-thousands of people. However, there is a shortfall in the heritability explained for the phenotypes that have been assessed. One of the explanations for this deficit is interactions between genes, called epistasis, that are not detected and so part of the causation missed. In this thesis, I carry out a comprehensive review of the large number of available epistasis detection tools in the literature. This is followed by a simulation benchmarking study to assess the ability of a representative group of these tools to detect epistatic interactions. From these tools, BOOST, MDR and MPI3SNP found the most interactions in this simulation study. Next, I set out three possible strategies for searching in biobank scale data in order to find a best practices workflow. These were exhaustive searching, an approach tailored to the tools' strengths and by splitting the data into linkage disequilibrium-based haplotype blocks and reducing the computational load. A simulation study was devised that found a mixed approach, using both BOOST and MDR for different types of interactions. The final pipeline initially uses the BOOST algorithm to find pure epistatic interactions and filter out insignificant pairs of SNPs. Those remaining variants with large single-locus effect sizes are assessed with MDR for impure interactions. Those interactions that are identified are assessed for significance, effect size and heritability explained. Finally, validation is carried out across each interacting pair, incorporating numerous sources of a priori knowledge. This was applied to Atrial Fibrillation, Alzheimer's Disease and Parkinson's Disease, three diseases that have previously been assessed for interactions. Although no statistically significant results were identified, this approach demonstrated an increased amount of heritability explained, showing that some of the missing heritability could be accounted for this way. A downstream analysis method was devised, finding genes in linkage with the interacting loci, applying a number of functional annotations and searching STRING-db for evidence of known interactions. Finally, the study was extended to examine rare variants in rare disease congenital hypothyroidism. As a systemic disorder, it could potentially have pathological interacting mutations. After variant calling, four de novo variants were identified, potentially explaining the condition. Six related interactions were found, with one not present in the parents, so possibly explaining the condition. The mutations, present in TG and PDIA4 have evidence of an interaction in STRING-db and both being involved in thyroid hormone synthesis in the KEGG database. These contributions provide a novel, tested pipeline for identifying epistasis from GWAS data, as well as a corpus of simulated data for future researchers. A robust methodology is applied for testing resulting interactions statistically, as well as an approach for validating interactions by incorporating numerous data sources to find significant commonalities between variants

    Computational Methods for Compositional Epistasis Detection

    Get PDF
    In genetics, the term “epistasis” refers to the phenomenon that the effect of one gene or single-nucleotide polymorphism (SNP) is dependent on the presence of others. Various possibilities of epistasis exist, and the understanding of them is limited. In recent years, failure of replication for single-locus effects in genome-wide association studies (GWAS) motivates the exploration of epistasis for human complex disease. This thesis is thus dedicated to the study of computational approaches for two-way compositional epistasis (SNP-SNP interaction) detection. Epistasis of this sort is best described by disease models, which can be simply understood as disease probability patterns associated with the genotype combinations of SNP-pairs. Because the epistasis detection problem requires determination of proper disease models to capture the compositional epistasis effect, it is more complicated than a typical variable selection task. Three projects are pursued in this thesis. The first two target epistasis that is characterized by a set of “two-locus, two-allele, two-phenotype and complete-penetrance” (TTTC) disease model, and the third one extends to more general epistasis. There are theoretically 2^9 = 512 TTTC disease models. For a given SNP-pair, the first step of the problem is to find a proper TTTC model to capture its epistasis effect. It is found that existing methods that use data to determine best-fitting disease models prior to screening may be too greedy. Motivated by this, the first project proposes a less greedy strategy by limiting the search of disease models to a set of prototypes. The prototypes are determined a priori. Specifically, a distance metric is defined and used to cluster all disease models, and then a “representative” from each cluster is selected to form the prototypes. Compared to the existing approaches, the proposed method provides a more satisfying balance between precision and recall in epistasis detection. If one uses data to determine a best-fitting disease model for a pair of SNPs, the nominal statistical evidence of association between the SNP-pair and the disease outcome is inflated. Therefore, the second project aims to directly correct inflation of this type. To make it feasible for genome-wide studies, a first-order correction method is proposed that can be applied in practice with no additional computational cost. Simulation studies are performed on two popular existing methods, which show that the correction is quite effective in improving an overall epistasis detection. The TTTC disease models can be viewed as coding two risk levels, i.e., high and low risk. Compared to them, some other disease models code multiple risk levels, which capture more general epistasis patterns. Two methods are proposed in the third project, which are centered on epistasis detection using multi-level risk disease models. One method is inspired by the fused lasso under a regression-based framework, and adopts the post-model selection test to account for inflation incurred during disease model searching. The other one makes sequential split of the genotype combinations of a SNP-pair and uses a stopping criterion to determine the final disease model; after that, it also applies a first-order correction to the testing statistic to effectively account for inflation. It is shown that the two methods with totally different starting framework are equivalent in terms of the disease model searching process. Subsequent simulation studies show that use of multi-level disease models achieves better detection efficiency in terms of a balance between precision and recall than the two-level ones. In summary, it is a rather complicated task to uncover the underlying mechanism of locus interaction effects, and endeavours are only beginning to be made. The epistasis detection methods in this thesis are practically useful at genome-wide level, which complements the single SNP screening in genome-wide association studies. What’s more, the method of first-order correction for inflation is simple and effective, which is practically valuable for the epistasis detection methods involving inflated testing statistics
    corecore