366 research outputs found

    Global tests of P-values for multifactor dimensionality reduction models in selection of optimal number of target genes

    Get PDF
    Background: Multifactor Dimensionality Reduction (MDR) is a popular and successful data mining method developed to characterize and detect nonlinear complex gene-gene interactions (epistasis) that are associated with disease susceptibility. Because MDR uses a combinatorial search strategy to detect interaction, several filtration techniques have been developed to remove genes (SNPs) that have no interactive effects prior to analysis. However, the cutoff values implemented for these filtration methods are arbitrary, therefore different choices of cutoff values will lead to different selections of genes (SNPs). Methods: We suggest incorporating a global test of p-values to filtration procedures to identify the optimal number of genes/SNPs for further MDR analysis and demonstrate this approach using a ReliefF filter technique. We compare the performance of different global testing procedures in this context, including the Kolmogorov-Smirnov test, the inverse chi-square test, the inverse normal test, the logit test, the Wilcoxon test and Tippett’s test. Additionally we demonstrate the approach on a real data application with a candidate gene study of drug response in Juvenile Idiopathic Arthritis. Results: Extensive simulation of correlated p-values show that the inverse chi-square test is the most appropriate approach to be incorporated with the screening approach to determine the optimal number of SNPs for the final MDR analysis. The Kolmogorov-Smirnov test has high inflation of Type I errors when p-values are highly correlated or when p-values peak near the center of histogram. Tippett’s test has very low power when the effect size of GxG interactions is small. Conclusions: The proposed global tests can serve as a screening approach prior to individual tests to prevent false discovery. Strong power in small sample sizes and well controlled Type I error in absence of GxG interactions make global tests highly recommended in epistasis studies. Keywords: P-value, Global tests, ReliefF, Multifactor dimensionality reductio

    Risk score modeling of multiple gene to gene interactions using aggregated-multifactor dimensionality reduction

    Get PDF
    BACKGROUND: Multifactor Dimensionality Reduction (MDR) has been widely applied to detect gene-gene (GxG) interactions associated with complex diseases. Existing MDR methods summarize disease risk by a dichotomous predisposing model (high-risk/low-risk) from one optimal GxG interaction, which does not take the accumulated effects from multiple GxG interactions into account. RESULTS: We propose an Aggregated-Multifactor Dimensionality Reduction (A-MDR) method that exhaustively searches for and detects significant GxG interactions to generate an epistasis enriched gene network. An aggregated epistasis enriched risk score, which takes into account multiple GxG interactions simultaneously, replaces the dichotomous predisposing risk variable and provides higher resolution in the quantification of disease susceptibility. We evaluate this new A-MDR approach in a broad range of simulations. Also, we present the results of an application of the A-MDR method to a data set derived from Juvenile Idiopathic Arthritis patients treated with methotrexate (MTX) that revealed several GxG interactions in the folate pathway that were associated with treatment response. The epistasis enriched risk score that pooled information from 82 significant GxG interactions distinguished MTX responders from non-responders with 82% accuracy. CONCLUSIONS: The proposed A-MDR is innovative in the MDR framework to investigate aggregated effects among GxG interactions. New measures (pOR, pRR and pChi) are proposed to detect multiple GxG interactions

    Detecting purely epistatic multi-locus interactions by an omnibus permutation test on ensembles of two-locus analyses

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Purely epistatic multi-locus interactions cannot generally be detected via single-locus analysis in case-control studies of complex diseases. Recently, many two-locus and multi-locus analysis techniques have been shown to be promising for the epistasis detection. However, exhaustive multi-locus analysis requires prohibitively large computational efforts when problems involve large-scale or genome-wide data. Furthermore, there is no explicit proof that a combination of multiple two-locus analyses can lead to the correct identification of multi-locus interactions.</p> <p>Results</p> <p>The proposed 2LOmb algorithm performs an omnibus permutation test on ensembles of two-locus analyses. The algorithm consists of four main steps: two-locus analysis, a permutation test, global <it>p</it>-value determination and a progressive search for the best ensemble. 2LOmb is benchmarked against an exhaustive two-locus analysis technique, a set association approach, a correlation-based feature selection (CFS) technique and a tuned ReliefF (TuRF) technique. The simulation results indicate that 2LOmb produces a low false-positive error. Moreover, 2LOmb has the best performance in terms of an ability to identify all causative single nucleotide polymorphisms (SNPs) and a low number of output SNPs in purely epistatic two-, three- and four-locus interaction problems. The interaction models constructed from the 2LOmb outputs via a multifactor dimensionality reduction (MDR) method are also included for the confirmation of epistasis detection. 2LOmb is subsequently applied to a type 2 diabetes mellitus (T2D) data set, which is obtained as a part of the UK genome-wide genetic epidemiology study by the Wellcome Trust Case Control Consortium (WTCCC). After primarily screening for SNPs that locate within or near 372 candidate genes and exhibit no marginal single-locus effects, the T2D data set is reduced to 7,065 SNPs from 370 genes. The 2LOmb search in the reduced T2D data reveals that four intronic SNPs in <it>PGM1 </it>(phosphoglucomutase 1), two intronic SNPs in <it>LMX1A </it>(LIM homeobox transcription factor 1, alpha), two intronic SNPs in <it>PARK2 </it>(Parkinson disease (autosomal recessive, juvenile) 2, parkin) and three intronic SNPs in <it>GYS2 </it>(glycogen synthase 2 (liver)) are associated with the disease. The 2LOmb result suggests that there is no interaction between each pair of the identified genes that can be described by purely epistatic two-locus interaction models. Moreover, there are no interactions between these four genes that can be described by purely epistatic multi-locus interaction models with marginal two-locus effects. The findings provide an alternative explanation for the aetiology of T2D in a UK population.</p> <p>Conclusion</p> <p>An omnibus permutation test on ensembles of two-locus analyses can detect purely epistatic multi-locus interactions with marginal two-locus effects. The study also reveals that SNPs from large-scale or genome-wide case-control data which are discarded after single-locus analysis detects no association can still be useful for genetic epidemiology studies.</p

    DETECTING CANCER-RELATED GENES AND GENE-GENE INTERACTIONS BY MACHINE LEARNING METHODS

    Get PDF
    To understand the underlying molecular mechanisms of cancer and therefore to improve pathogenesis, prevention, diagnosis and treatment of cancer, it is necessary to explore the activities of cancer-related genes and the interactions among these genes. In this dissertation, I use machine learning and computational methods to identify differential gene relations and detect gene-gene interactions. To identify gene pairs that have different relationships in normal versus cancer tissues, I develop an integrative method based on the bootstrapping K-S test to evaluate a large number of microarray datasets. The experimental results demonstrate that my method can find meaningful alterations in gene relations. For gene-gene interaction detection, I propose to use two Bayesian Network based methods: DASSO-MB (Detection of ASSOciations using Markov Blanket) and EpiBN (Epistatic interaction detection using Bayesian Network model) to address the two critical challenges: searching and scoring. DASSO-MB is based on the concept of Markov Blanket in Bayesian Networks. In EpiBN, I develop a new scoring function, which can reflect higher-order gene-gene interactions and detect the true number of disease markers, and apply a fast Branch-and-Bound (B&B) algorithm to learn the structure of Bayesian Network. Both DASSO-MB and EpiBN outperform some other commonly-used methods and are scalable to genome-wide data

    Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic Medicine and Biomedical Research

    Get PDF
    Advances in sequencing technology have significantly contributed to shaping the area of genetics and enabled the identification of genetic variants associated with complex traits through genome-wide association studies. This has provided insights into genetic medicine, in which case, genetic factors influence variability in disease and treatment outcomes. On the other side, the missing or hidden heritability has suggested that the host quality of life and other environmental factors may also influence differences in disease risk and drug/treatment responses in genomic medicine, and orient biomedical research, even though this may be highly constrained by genetic capabilities. It is expected that combining these different factors can yield a paradigm-shift of personalized medicine and lead to a more effective medical treatment. With existing “big data” initiatives and high-performance computing infrastructures, there is a need for data-driven learning algorithms and models that enable the selection and prioritization of relevant genetic variants (post-genomic medicine) and trigger effective translation into clinical practice. In this chapter, we survey and discuss existing machine learning algorithms and post-genomic analysis models supporting the process of identifying valuable markers

    THREE METHODS TO INCREASE THE LIKELY TO IDENTIFY GENE INVOLVED IN COMPLEX DISEASE

    Get PDF
    The large part of human pathology is composed by complex disease, such as heart disease, obesity, cancer, diabetes, and many common psychiatric and neurological conditions. The common feature of all these conditions is the multifactorial etiology that involves both genetic and environmental factors. The common disease-common variant (CDCV) hypothesis posits that common, interacting alleles underlie most common diseases, in association with environmental factors. Furthermore, according to the thrift genotype, such alleles have been subjected to selective pressure, mainly those involved in metabolic disease such as T2DM and obesity. Although the concept of gene-environment interaction is central to ecogenetics, and has long been recognized by geneticists (Haldane 1946), there are relatively few detailed descriptions of gene–environment interaction in biomedical literature. This lacking may be explained by difficulties in collecting environmental information of enough quality and by great difficulties in analyze them. Indeed, when the number of factors to analyze is large, become overwhelming the course of dimensionality and the multiple testing problems. In the present thesis the hypothesis that knowledge-driven approaches may improve the ability to identify genes involved in complex disease was checked. Three approaches have been presented, each of them leading to the identification of a factor or of a interaction of factors. As the study a complex disease is composed by three steps: (1) selection of candidate genes, (2) collecting of genetic and non-genetic information and (3) statistical analysis of data, it is showed that each of these steps may be improved by consideration of the biological background. The first study, regarded the possibility to exploit evolutionary information to identify genes involved in type 2 diabetes. This hypothesis was based on the thrifty genotype hypothesis. A gene was identified, ACO1, and was successfully associated to the disease. In the second study, we analyses the case of a gene, PPAGγ that have been inconsistency associated with obesity. We hypothesized that the inconsistence of association may be due to its relationship with environment. Then we jointly analyzed the genotype of the gene and comprehensive nutritional information about a cohort and proved an interaction. The genotype of PPARγ modulated the response to the diet. Ala-carriers gained more weight than ProPro individuals when had the same caloric intake. In the third study, we implemented a software tool to create simulated populations based on gene-environment interactions. The system was based on genetic information to simulate realistic populations. We used these simulated populations to collect information on statistical methods more frequently used to study case-controls samples. Afterward, we built an ensemble of these methods and applied it to a real sample. We showed that ensemble had better performances of each single methods in condition of small sample size

    Neural networks for genetic epidemiology: past, present, and future

    Get PDF
    During the past two decades, the field of human genetics has experienced an information explosion. The completion of the human genome project and the development of high throughput SNP technologies have created a wealth of data; however, the analysis and interpretation of these data have created a research bottleneck. While technology facilitates the measurement of hundreds or thousands of genes, statistical and computational methodologies are lacking for the analysis of these data. New statistical methods and variable selection strategies must be explored for identifying disease susceptibility genes for common, complex diseases. Neural networks (NN) are a class of pattern recognition methods that have been successfully implemented for data mining and prediction in a variety of fields. The application of NN for statistical genetics studies is an active area of research. Neural networks have been applied in both linkage and association analysis for the identification of disease susceptibility genes
    • …
    corecore