277 research outputs found
Optimal Methods for Using Posterior Probabilities in Association Testing
Objective: The use of haplotypes to impute the genotypes of unmeasured single nucleotide variants continues to rise in popularity. Simulation results suggest that the use of the dosage as a one-dimensional summary statistic of imputation posterior probabilities may be optimal both in terms of statistical power and computational efficiency; however, little theoretical understanding is available to explain and unify these simulation results. In our analysis, we provide a theoretical foundation for the use of the dosage as a one-dimensional summary statistic of genotype posterior probabilities from any technology. Methods: We analytically evaluate the dosage, mode and the more general set of all one-dimensional summary statistics of two-dimensional (three posterior probabilities that must sum to 1) genotype posterior probability vectors. Results: We prove that the dosage is an optimal one-dimensional summary statistic under a typical linear disease model and is robust to violations of this model. Simulation results confirm our theoretical findings. Conclusions: Our analysis provides a strong theoretical basis for the use of the dosage as a one-dimensional summary statistic of genotype posterior probability vectors in related tests of genetic association across a wide variety of genetic disease models
Taming Nonconvexity in Kernel Feature Selection---Favorable Properties of the Laplace Kernel
Kernel-based feature selection is an important tool in nonparametric
statistics. Despite many practical applications of kernel-based feature
selection, there is little statistical theory available to support the method.
A core challenge is the objective function of the optimization problems used to
define kernel-based feature selection are nonconvex. The literature has only
studied the statistical properties of the \emph{global optima}, which is a
mismatch, given that the gradient-based algorithms available for nonconvex
optimization are only able to guarantee convergence to local minima. Studying
the full landscape associated with kernel-based methods, we show that feature
selection objectives using the Laplace kernel (and other kernels) come
with statistical guarantees that other kernels, including the ubiquitous
Gaussian kernel (or other kernels) do not possess. Based on a sharp
characterization of the gradient of the objective function, we show that
kernels eliminate unfavorable stationary points that appear when using
an kernel. Armed with this insight, we establish statistical
guarantees for kernel-based feature selection which do not require
reaching the global minima. In particular, we establish model-selection
consistency of -kernel-based feature selection in recovering main
effects and hierarchical interactions in the nonparametric setting with samples.Comment: 33 pages main text
Kernel Learning in Ridge Regression "Automatically" Yields Exact Low Rank Solution
We consider kernels of the form
parametrized by . For such kernels, we study a variant of the kernel
ridge regression problem which simultaneously optimizes the prediction function
and the parameter of the reproducing kernel Hilbert space. The
eigenspace of the learned from this kernel ridge regression problem
can inform us which directions in covariate space are important for prediction.
Assuming that the covariates have nonzero explanatory power for the response
only through a low dimensional subspace (central mean subspace), we find that
the global minimizer of the finite sample kernel learning objective is also low
rank with high probability. More precisely, the rank of the minimizing
is with high probability bounded by the dimension of the central mean subspace.
This phenomenon is interesting because the low rankness property is achieved
without using any explicit regularization of , e.g., nuclear norm
penalization.
Our theory makes correspondence between the observed phenomenon and the
notion of low rank set identifiability from the optimization literature. The
low rankness property of the finite sample solutions exists because the
population kernel learning objective grows "sharply" when moving away from its
minimizers in any direction perpendicular to the central mean subspace.Comment: Add code links and correct a figur
Geometric Framework for Evaluating Rare Variant Tests of Association
The wave of next-generation sequencing data has arrived. However, many questions still remain about how to best analyze sequence data, particularly the contribution of rare genetic variants to human disease. Numerous statistical methods have been proposed to aggregate association signals across multiple rare variant sites in an effort to increase statistical power; however, the precise relation between the tests is often not well understood. We present a geometric representation for rare variant data in which rare allele counts in case and control samples are treated as vectors in Euclidean space. The geometric framework facilitates a rigorous classification of existing rare variant tests into two broad categories: tests for a difference in the lengths of the case and control vectors, and joint tests for a difference in either the lengths or angles of the two vectors. We demonstrate that genetic architecture of a trait, including the number and frequency of risk alleles, directly relates to the behavior of the length and joint tests. Hence, the geometric framework allows prediction of which tests will perform best under different disease models. Furthermore, the structure of the geometric framework immediately suggests additional classes and types of rare variant tests. We consider two general classes of tests which show robustness to noncausal and protective variants. The geometric framework introduces a novel and unique method to assess current rare variant methodology and provides guidelines for both applied and theoretical researchers
Powerful Method for Including Genotype Uncertainty in Tests of Hardy-Weinberg Equilibrium
The use of posterior probabilities to summarize genotype uncertainty is pervasive across genotype, sequencing and imputation platforms. Prior work in many contexts has shown the utility of incorporating genotype uncertainty (posterior probabilities) in downstream statistical tests. Typical approaches to incorporating genotype uncertainty when testing Hardy-Weinberg equilibrium tend to lack calibration in the type I error rate, especially as genotype uncertainty increases. We propose a new approach in the spirit of genomic control that properly calibrates the type I error rate, while yielding improved power to detect deviations from Hardy-Weinberg Equilibrium. We demonstrate the improved performance of our method on both simulated and real genotypes
Recommended from our members
Comment: A Fruitful Resolution to Simpsonās Paradox via Multiresolution Inference
Simpsonās Paradox is really a Simple Paradox if one at all. Peeling away the paradox is as easy (or hard) as avoiding a comparison of apples and oranges, a concept requiring no mention of causality. We show how the commonly adopted notation has committed the gross-ery mistake of tagging unlike fruit with alike labels. Hence, the āfruitfulā question to ask is not āDo we condition on the third variable?ā but rather āAre two fruits, which appear similar, actually similar at their core?.ā We introduce the concept of intrinsic similarity to escape this bind. The notion of ācoreā depends on how deep one looksāthe multi resolution inference framework provides a natural way to define intrinsic similarity at the resolution appropriate for the treatment. To harvest the fruits of this insight, we will need to estimate intrinsic similarity, which often results in an indirect conditioning on the āthird variable.ā A ripening estimation theory shows that the standard treatment comparisons, unconditional or conditional on the third variable, are low hanging fruit but often rotten. We pose assumptions to pluck away higher-resolution (more conditional) comparisonsāthe multiresolution framework allows us to rigorously assess the price of these assumptions against the resulting yield. One such assessment gives us Simpsonās Warning: less conditioning is most likely to lead to serious bias when Simpsonās Paradox appears.Statistic
Rorc restrains the potency of ST2+ regulatory T cells in ameliorating intestinal graft-versus-host disease
Soluble stimulation-2 (ST2) is increased during graft-versus-host disease (GVHD), while Tregs that express ST2 prevent GVHD through unknown mechanisms. Transplantation of Foxp3- T cells and Tregs that were collected and sorted from different Foxp3 reporter mice indicated that in mice that developed GVHD, ST2+ Tregs were thymus derived and predominantly localized to the intestine. ST2-/- Treg transplantation was associated with reduced total intestinal Treg frequency and activation. ST2-/- versus WT intestinal Treg transcriptomes showed decreased Treg functional markers and, reciprocally, increased Rorc expression. Rorc-/- T cells transplantation enhanced the frequency and function of intestinal ST2+ Tregs and reduced GVHD through decreased gut-infiltrating soluble ST2-producing type 1 and increased IL-4/IL-10-producing type 2 T cells. Cotransfer of ST2+ Tregs sorted from Rorc-/- mice with WT CD25-depleted T cells decreased GVHD severity and mortality, increased intestinal ST2+KLRG1+ Tregs, and decreased type 1 T cells after transplantation, indicating an intrinsic mechanism. Ex vivo IL-33-stimulated Tregs (TregIL-33) expressed higher amphiregulin and displayed better immunosuppression, and adoptive transfer prevented GVHD better than control Tregs or TregIL-33 cultured with IL-23/IL-17. Amphiregulin blockade by neutralizing antibody in vivo abolished the protective effect of TregIL-33. Our data show that inverse expression of ST2 and RORĪ³t in intestinal Tregs determines GVHD and that TregIL-33 has potential as a cellular therapy avenue for preventing GVHD
Hierarchical accompanying and inhibiting patterns on the spatial arrangement of taxis' local hotspots
Due to the large volume of recording, the complete spontaneity, and the
flexible pick-up and drop-off locations, taxi data portrays a realistic and
detailed picture of urban space use to a certain extent. The spatial
arrangement of pick-up and drop-off hotspots reflects the organizational space,
which has received attention in urban structure studies. Previous studies
mainly explore the hotspots at a large scale by visual analysis or some simple
indexes, where the hotspots usually cover the entire central business district,
train stations, or dense residential areas, reaching a radius of hundreds or
even thousands of meters. However, the spatial arrangement patterns of
small-scale hotspots, reflecting the specific popular pick-up and drop-off
locations, have not received much attention. Using two taxi trajectory datasets
in Wuhan and Beijing, China, this study quantitatively explores the spatial
arrangement of fine-grained pick-up and drop-off local hotspots with different
levels of popularity, where the sizes are adaptively set as 90m*90m in Wuhan
and 105m*105m in Beijing according to the local hotspot identification method.
Results show that popular hotspots tend to be surrounded by less popular
hotspots, but the existence of less popular hotspots is inhibited in regions
with a large number of popular hotspots. We use the terms hierarchical
accompany and inhibiting patterns for these two spatial configurations.
Finally, to uncover the underlying mechanism, a KNN-based model is proposed to
reproduce the spatial distribution of other less popular hotspots according to
the most popular ones. These findings help decision-makers construct reasonable
urban minimum units for precise traffic and disease control, as well as plan a
more humane spatial arrangement of points of interest
- ā¦