224 research outputs found
A Quadratically Regularized Functional Canonical Correlation Analysis for Identifying the Global Structure of Pleiotropy with NGS Data
Investigating the pleiotropic effects of genetic variants can increase
statistical power, provide important information to achieve deep understanding
of the complex genetic structures of disease, and offer powerful tools for
designing effective treatments with fewer side effects. However, the current
multiple phenotype association analysis paradigm lacks breadth (number of
phenotypes and genetic variants jointly analyzed at the same time) and depth
(hierarchical structure of phenotype and genotypes). A key issue for high
dimensional pleiotropic analysis is to effectively extract informative internal
representation and features from high dimensional genotype and phenotype data.
To explore multiple levels of representations of genetic variants, learn their
internal patterns involved in the disease development, and overcome critical
barriers in advancing the development of novel statistical methods and
computational algorithms for genetic pleiotropic analysis, we proposed a new
framework referred to as a quadratically regularized functional CCA (QRFCCA)
for association analysis which combines three approaches: (1) quadratically
regularized matrix factorization, (2) functional data analysis and (3)
canonical correlation analysis (CCA). Large-scale simulations show that the
QRFCCA has a much higher power than that of the nine competing statistics while
retaining the appropriate type 1 errors. To further evaluate performance, the
QRFCCA and nine other statistics are applied to the whole genome sequencing
dataset from the TwinsUK study. We identify a total of 79 genes with rare
variants and 67 genes with common variants significantly associated with the 46
traits using QRFCCA. The results show that the QRFCCA substantially outperforms
the nine other statistics.Comment: 64 pages including 12 figure
Changes from Classical Statistics to Modern Statistics and Data Science
A coordinate system is a foundation for every quantitative science,
engineering, and medicine. Classical physics and statistics are based on the
Cartesian coordinate system. The classical probability and hypothesis testing
theory can only be applied to Euclidean data. However, modern data in the real
world are from natural language processing, mathematical formulas, social
networks, transportation and sensor networks, computer visions, automations,
and biomedical measurements. The Euclidean assumption is not appropriate for
non Euclidean data. This perspective addresses the urgent need to overcome
those fundamental limitations and encourages extensions of classical
probability theory and hypothesis testing , diffusion models and stochastic
differential equations from Euclidean space to non Euclidean space. Artificial
intelligence such as natural language processing, computer vision, graphical
neural networks, manifold regression and inference theory, manifold learning,
graph neural networks, compositional diffusion models for automatically
compositional generations of concepts and demystifying machine learning
systems, has been rapidly developed. Differential manifold theory is the
mathematic foundations of deep learning and data science as well. We urgently
need to shift the paradigm for data analysis from the classical Euclidean data
analysis to both Euclidean and non Euclidean data analysis and develop more and
more innovative methods for describing, estimating and inferring non Euclidean
geometries of modern real datasets. A general framework for integrated analysis
of both Euclidean and non Euclidean data, composite AI, decision intelligence
and edge AI provide powerful innovative ideas and strategies for fundamentally
advancing AI. We are expected to marry statistics with AI, develop a unified
theory of modern statistics and drive next generation of AI and data science.Comment: 37 page
Implication of next-generation sequencing on association studies
<p>Abstract</p> <p>Background</p> <p>Next-generation sequencing technologies can effectively detect the entire spectrum of genomic variation and provide a powerful tool for systematic exploration of the universe of common, low frequency and rare variants in the entire genome. However, the current paradigm for genome-wide association studies (GWAS) is to catalogue and genotype common variants (5% < MAF). The methods and study design for testing the association of low frequency (0.5% < MAF ≤ 5%) and rare variation (MAF ≤ 0.5%) have not been thoroughly investigated. The 1000 Genomes Project represents one such endeavour to characterize the human genetic variation pattern at the MAF = 1% level as a foundation for association studies. In this report, we explore different strategies and study designs for the near future GWAS in the post-era, based on both low coverage pilot data and exon pilot data in 1000 Genomes Project.</p> <p>Results</p> <p>We investigated the linkage disequilibrium (LD) pattern among common and low frequency SNPs and its implication for association studies. We found that the LD between low frequency alleles and low frequency alleles, and low frequency alleles and common alleles are much weaker than the LD between common and common alleles. We examined various tagging designs with and without statistical imputation approaches and compare their power against de novo resequencing in mapping causal variants under various disease models. We used the low coverage pilot data which contain ~14 M SNPs as a hypothetical genotype-array platform (Pilot 14 M) to interrogate its impact on the selection of tag SNPs, mapping coverage and power of association tests. We found that even after imputation we still observed 45.4% of low frequency SNPs which were untaggable and only 67.7% of the low frequency variation was covered by the Pilot 14 M array.</p> <p>Conclusions</p> <p>This suggested GWAS based on SNP arrays would be ill-suited for association studies of low frequency variation.</p
- …