15 research outputs found

    Fractal Characterizations of MAX Statistical Distribution in Genetic Association Studies

    Full text link
    Two non-integer parameters are defined for MAX statistics, which are maxima of dd simpler test statistics. The first parameter, dMAXd_{MAX}, is the fractional number of tests, representing the equivalent numbers of independent tests in MAX. If the dd tests are dependent, dMAX<dd_{MAX} < d. The second parameter is the fractional degrees of freedom kk of the chi-square distribution χk2\chi^2_k that fits the MAX null distribution. These two parameters, dMAXd_{MAX} and kk, can be independently defined, and kk can be non-integer even if dMAXd_{MAX} is an integer. We illustrate these two parameters using the example of MAX2 and MAX3 statistics in genetic case-control studies. We speculate that kk is related to the amount of ambiguity of the model inferred by the test. In the case-control genetic association, tests with low kk (e.g. k=1k=1) are able to provide definitive information about the disease model, as versus tests with high kk (e.g. k=2k=2) that are completely uncertain about the disease model. Similar to Heisenberg's uncertain principle, the ability to infer disease model and the ability to detect significant association may not be simultaneously optimized, and kk seems to measure the level of their balance

    Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays

    Full text link
    Volcano plot displays unstandardized signal (e.g. log-fold-change) against noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from the t test). We review the basic and an interactive use of the volcano plot, and its crucial role in understanding the regularized t-statistic. The joint filtering gene selection criterion based on regularized statistics has a curved discriminant line in the volcano plot, as compared to the two perpendicular lines for the "double filtering" criterion. This review attempts to provide an unifying framework for discussions on alternative measures of differential expression, improved methods for estimating variance, and visual display of a microarray analysis result. We also discuss the possibility to apply volcano plots to other fields beyond microarray.Comment: 8 figure

    Association Mapping Approach into Type 2 Diabetes using Biomarkers and Clinical Data

    Get PDF
    The global growth in incidence of Type 2 Diabetes (T2D) has become a major international health concern. As such, understanding the aetiology of Type 2 Diabetes is vital. This paper investigates a variety of statistical method-ologies at various level of complexity to analyse genotype data and identify bi-omarkers that show evidence of increase susceptibility to T2D and related traits. A critical overview of several selected statistical methods for population-based association mapping particularly case-control genetic association analysis is pre-sented. A discussion on a dataset accessed in this paper that includes 3435 female subjects for cases and controls with genotype information across 879071 Single Nucleotide Polymorphism (SNPs) is presented. Quality control steps into the dataset through pre-processing phase are performed to remove samples and markers that failed the quality control test. Association analysis is discussed to address which statistical method can be appropriate to the dataset. Our genetic association analysis produces promising results and indicated that Allelic asso-ciation test showed one SNP above the genome-wide significance threshold of 5×10−8 which is rs10519107 (Odds Ratio (OR)=0.7409,P−Value (P)=1.813×10−9), While, there are several SNPs above the suggestive association threshold of 5×10−6 these SNPs could worth further investigation. Furthermore, Logistic Regression analysis adjusted for multiple confounder factors indicated that none of the genotyped SNPs has passed genome-wide significance threshold of 5×10−8 . Nevertheless, four SNPs (rs10519107, rs4368343, rs6848779, rs11729955) have passed suggestive association threshold

    Exploring Case-Control Genetic Association Tests Using Phase Diagrams

    Full text link
    Background: By a new concept called "phase diagram", we compare two commonly used genotype-based tests for case-control genetic analysis, one is a Cochran-Armitage trend test (CAT test at x=0.5x=0.5, or CAT0.5) and another (called MAX2) is the maximization of two chi-square test results: one from the two-by-two genotype count table that combines the baseline homozygotes and heterozygotes, and another from the table that combines heterozygotes with risk homozygotes. CAT0.5 is more suitable for multiplicative disease models and MAX2 is better for dominant/recessive models. Methods: We define the CAT0.5-MAX2 phase diagram on the disease model space such that regions where MAX2 is more powerful than CAT0.5 are separated from regions where the CAT0.5 is more powerful, and the task is to choose the appropriate parameterization to make the separation possible. Results: We find that using the difference of allele frequencies (δp\delta_p) and the difference of Hardy-Weinberg disequilibrium coefficients (δϵ\delta_\epsilon) can separate the two phases well, and the phase boundaries are determined by the angle tan−1(δp/δϵ)tan^{-1}(\delta_p/\delta_\epsilon), which is an improvement over the disease model selection using δϵ\delta_\epsilon only. Conclusions: We argue that phase diagrams similar to the one for CAT0.5-MAX2 have graphical appeals in understanding power performance of various tests, clarifying simulation schemes, summarizing case-control datasets, and guessing the possible mode of inheritance

    SAERMA: Stacked Autoencoder Rule Mining Algorithm for the Interpretation of Epistatic Interactions in GWAS for Extreme Obesity

    Get PDF
    One of the most important challenges in the analysis of high-throughput genetic data is the development of efficient computational methods to identify statistically significant Single Nucleotide Polymorphisms (SNPs). Genome-wide association studies (GWAS) use single-locus analysis where each SNP is independently tested for association with phenotypes. The limitation with this approach, however, is its inability to explain genetic variation in complex diseases. Alternative approaches are required to model the intricate relationships between SNPs. Our proposed approach extends GWAS by combining deep learning stacked autoencoders (SAEs) and association rule mining (ARM) to identify epistatic interactions between SNPs. Following traditional GWAS quality control and association analysis, the most significant SNPs are selected and used in the subsequent analysis to investigate epistasis. SAERMA controls the classification results produced in the final fully connected multi-layer feedforward artificial neural network (MLP) by manipulating the interestingness measures, support and confidence, in the rule generation process. The best classification results were achieved with 204 SNPs compressed to 100 units (77% AUC, 77% SE, 68% SP, 53% Gini, logloss=0.58, and MSE=0.20), although it was possible to achieve 73% AUC (77% SE, 63% SP, 45% Gini, logloss=0.62, and MSE=0.21) with 50 hidden units - both supported by close model interpretation

    Deep Learning Classification of Polygenic Obesity using Genome Wide Association Study SNPs

    Get PDF
    In this paper, association results from genome-wide association studies (GWAS) are combined with a deep learning framework to test the predictive capacity of statistically significant single nucleotide polymorphism (SNPs) associated with obesity phenotype. Our approach demonstrates the potential of deep learning as a powerful framework for GWAS analysis that can capture information about SNPs and the important interactions between them. Basic statistical methods and techniques for the analysis of genetic SNP data from population-based genome-wide studies have been considered. Statistical association testing between individual SNPs and obesity was conducted under an additive model using logistic regression. Four subsets of loci after quality-control (QC) and association analysis were selected: P-values lower than 1x10-5 (5 SNPs), 1x10-4 (32 SNPs), 1x10-3 (248 SNPs) and 1x10-2 (2465 SNPs). A deep learning classifier is initialised using these sets of SNPs and fine-tuned to classify obese and non-obese observations. Using a deep learning classifier model and genetic variants with P-value < 1x10-2 (2465 SNPs) it was possible to obtain results (SE=0.9604, SP=0.9712, Gini=0.9817, LogLoss=0.1150, AUC=0.9908 and MSE=0.0300). As the P-value increased, an evident deterioration in performance was observed. Results demonstrate that single SNP analysis fails to capture the cumulative effect of less significant variants and their overall contribution to the outcome in disease prediction, which is captured using a deep learning framework

    Copy-number-variation and copy-number-alteration region detection by cumulative plots

    Get PDF
    Background: Regions with copy number variations (in germline cells) or copy number alteration (in somatic cells) are of great interest for human disease gene mapping and cancer studies. They represent a new type of mutation and are larger-scaled than the single nucleotide polymorphisms. Using genotyping microarray for copy number variation detection has become standard, and there is a need for improving analysis methods. Results: We apply the cumulative plot to the detection of regions with copy number variation/alteration, on samples taken from a chronic lymphocytic leukemia patient. Two sets of whole-genome genotyping of 317k single nucleotide polymorphisms, one from the normal cell and another from the cancer cell, are analyzed. We demonstrate the utility of cumulative plot in detecting a 9Mb (9 x 10^6 bases) hemizygous deletion and 1Mb homozygous deletion on chromosome 13. We also show the possibility to detect smaller copy number variation/alteration regions below the 100kb range. Conclusions: As a graphic tool, the cumulative plot is an intuitive and a scale-free (window-less) way for detecting copy number variation/alteration regions, especially when such regions are small
    corecore