15 research outputs found
Fractal Characterizations of MAX Statistical Distribution in Genetic Association Studies
Two non-integer parameters are defined for MAX statistics, which are maxima
of simpler test statistics. The first parameter, , is the
fractional number of tests, representing the equivalent numbers of independent
tests in MAX. If the tests are dependent, . The second
parameter is the fractional degrees of freedom of the chi-square
distribution that fits the MAX null distribution. These two
parameters, and , can be independently defined, and can be
non-integer even if is an integer. We illustrate these two parameters
using the example of MAX2 and MAX3 statistics in genetic case-control studies.
We speculate that is related to the amount of ambiguity of the model
inferred by the test. In the case-control genetic association, tests with low
(e.g. ) are able to provide definitive information about the disease
model, as versus tests with high (e.g. ) that are completely uncertain
about the disease model. Similar to Heisenberg's uncertain principle, the
ability to infer disease model and the ability to detect significant
association may not be simultaneously optimized, and seems to measure the
level of their balance
Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays
Volcano plot displays unstandardized signal (e.g. log-fold-change) against
noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from
the t test). We review the basic and an interactive use of the volcano plot,
and its crucial role in understanding the regularized t-statistic. The joint
filtering gene selection criterion based on regularized statistics has a curved
discriminant line in the volcano plot, as compared to the two perpendicular
lines for the "double filtering" criterion. This review attempts to provide an
unifying framework for discussions on alternative measures of differential
expression, improved methods for estimating variance, and visual display of a
microarray analysis result. We also discuss the possibility to apply volcano
plots to other fields beyond microarray.Comment: 8 figure
Association Mapping Approach into Type 2 Diabetes using Biomarkers and Clinical Data
The global growth in incidence of Type 2 Diabetes (T2D) has become a major international health concern. As such, understanding the aetiology of Type 2 Diabetes is vital. This paper investigates a variety of statistical method-ologies at various level of complexity to analyse genotype data and identify bi-omarkers that show evidence of increase susceptibility to T2D and related traits. A critical overview of several selected statistical methods for population-based association mapping particularly case-control genetic association analysis is pre-sented. A discussion on a dataset accessed in this paper that includes 3435 female subjects for cases and controls with genotype information across 879071 Single Nucleotide Polymorphism (SNPs) is presented. Quality control steps into the dataset through pre-processing phase are performed to remove samples and markers that failed the quality control test. Association analysis is discussed to address which statistical method can be appropriate to the dataset. Our genetic association analysis produces promising results and indicated that Allelic asso-ciation test showed one SNP above the genome-wide significance threshold of 5×10−8 which is rs10519107 (Odds Ratio (OR)=0.7409,P−Value (P)=1.813×10−9), While, there are several SNPs above the suggestive association threshold of 5×10−6 these SNPs could worth further investigation. Furthermore, Logistic Regression analysis adjusted for multiple confounder factors indicated that none of the genotyped SNPs has passed genome-wide significance threshold of 5×10−8 . Nevertheless, four SNPs (rs10519107, rs4368343, rs6848779, rs11729955) have passed suggestive association threshold
Exploring Case-Control Genetic Association Tests Using Phase Diagrams
Background: By a new concept called "phase diagram", we compare two commonly
used genotype-based tests for case-control genetic analysis, one is a
Cochran-Armitage trend test (CAT test at , or CAT0.5) and another
(called MAX2) is the maximization of two chi-square test results: one from the
two-by-two genotype count table that combines the baseline homozygotes and
heterozygotes, and another from the table that combines heterozygotes with risk
homozygotes. CAT0.5 is more suitable for multiplicative disease models and MAX2
is better for dominant/recessive models.
Methods: We define the CAT0.5-MAX2 phase diagram on the disease model space
such that regions where MAX2 is more powerful than CAT0.5 are separated from
regions where the CAT0.5 is more powerful, and the task is to choose the
appropriate parameterization to make the separation possible.
Results: We find that using the difference of allele frequencies ()
and the difference of Hardy-Weinberg disequilibrium coefficients
() can separate the two phases well, and the phase boundaries
are determined by the angle , which is an
improvement over the disease model selection using only.
Conclusions: We argue that phase diagrams similar to the one for CAT0.5-MAX2
have graphical appeals in understanding power performance of various tests,
clarifying simulation schemes, summarizing case-control datasets, and guessing
the possible mode of inheritance
SAERMA: Stacked Autoencoder Rule Mining Algorithm for the Interpretation of Epistatic Interactions in GWAS for Extreme Obesity
One of the most important challenges in the analysis of high-throughput genetic data is the development of efficient computational methods to identify statistically significant Single Nucleotide Polymorphisms (SNPs). Genome-wide association studies (GWAS) use single-locus analysis where each SNP is independently tested for association with phenotypes. The limitation with this approach, however, is its inability to explain genetic variation in complex diseases. Alternative approaches are required to model the intricate relationships between SNPs. Our proposed approach extends GWAS by combining deep learning stacked autoencoders (SAEs) and association rule mining (ARM) to identify epistatic interactions between SNPs. Following traditional GWAS quality control and association analysis, the most significant SNPs are selected and used in the subsequent analysis to investigate epistasis. SAERMA controls the classification results produced in the final fully connected multi-layer feedforward artificial neural network (MLP) by manipulating the interestingness measures, support and confidence, in the rule generation process. The best classification results were achieved with 204 SNPs compressed to 100 units (77% AUC, 77% SE, 68% SP, 53% Gini, logloss=0.58, and MSE=0.20), although it was possible to achieve 73% AUC (77% SE, 63% SP, 45% Gini, logloss=0.62, and MSE=0.21) with 50 hidden units - both supported by close model interpretation
Deep Learning Classification of Polygenic Obesity using Genome Wide Association Study SNPs
In this paper, association results from genome-wide association studies (GWAS) are combined with a deep learning framework to test the predictive capacity of statistically significant single nucleotide polymorphism (SNPs) associated with obesity phenotype. Our approach demonstrates the potential of deep learning as a powerful framework for GWAS analysis that can capture information about SNPs and the important interactions between them. Basic statistical methods and techniques for the analysis of genetic SNP data from population-based genome-wide studies have been considered. Statistical association testing between individual SNPs and obesity was conducted under an additive model using logistic regression. Four subsets of loci after quality-control (QC) and association analysis were selected: P-values lower than 1x10-5 (5 SNPs), 1x10-4 (32 SNPs), 1x10-3 (248 SNPs) and 1x10-2 (2465 SNPs). A deep learning classifier is initialised using these sets of SNPs and fine-tuned to classify obese and non-obese observations. Using a deep learning classifier model and genetic variants with P-value < 1x10-2 (2465 SNPs) it was possible to obtain results (SE=0.9604, SP=0.9712, Gini=0.9817, LogLoss=0.1150, AUC=0.9908 and MSE=0.0300). As the P-value increased, an evident deterioration in performance was observed. Results demonstrate that single SNP analysis fails to capture the cumulative effect of less significant variants and their overall contribution to the outcome in disease prediction, which is captured using a deep learning framework
Copy-number-variation and copy-number-alteration region detection by cumulative plots
Background: Regions with copy number variations (in germline cells) or copy
number alteration (in somatic cells) are of great interest for human disease
gene mapping and cancer studies. They represent a new type of mutation and are
larger-scaled than the single nucleotide polymorphisms. Using genotyping
microarray for copy number variation detection has become standard, and there
is a need for improving analysis methods. Results: We apply the cumulative plot
to the detection of regions with copy number variation/alteration, on samples
taken from a chronic lymphocytic leukemia patient. Two sets of whole-genome
genotyping of 317k single nucleotide polymorphisms, one from the normal cell
and another from the cancer cell, are analyzed. We demonstrate the utility of
cumulative plot in detecting a 9Mb (9 x 10^6 bases) hemizygous deletion and 1Mb
homozygous deletion on chromosome 13. We also show the possibility to detect
smaller copy number variation/alteration regions below the 100kb range.
Conclusions: As a graphic tool, the cumulative plot is an intuitive and a
scale-free (window-less) way for detecting copy number variation/alteration
regions, especially when such regions are small