93,013 research outputs found
Efficient test for nonlinear dependence of two continuous variables
The power comparison of simulation study across Gaussian noise levels (mean = 0, variance = 1/9, 1/4, 4 and 9). (XLSX 11 kb
knnAUC: an open-source R package for detecting nonlinear dependence between one continuous variable and one binary variable
Abstract
Background
Testing the dependence of two variables is one of the fundamental tasks in statistics. In this work, we developed an open-source R package (knnAUC) for detecting nonlinear dependence between one continuous variable X and one binary dependent variables Y (0 or 1).
Results
We addressed this problem by using knnAUC (k-nearest neighbors AUC test, the R package is available at
https://sourceforge.net/projects/knnauc/
). In the knnAUC software framework, we first resampled a dataset to get the training and testing dataset according to the sample ratio (from 0 to 1), and then constructed a k-nearest neighbors algorithm classifier to get the yhat estimator (the probability of y = 1) of testy (the true label of testing dataset). Finally, we calculated the AUC (area under the curve of receiver operating characteristic) estimator and tested whether the AUC estimator is greater than 0.5. To evaluate the advantages of knnAUC compared to seven other popular methods, we performed extensive simulations to explore the relationships between eight different methods and compared the false positive rates and statistical power using both simulated and real datasets (Chronic hepatitis B datasets and kidney cancer RNA-seq datasets).
Conclusions
We concluded that knnAUC is an efficient R package to test non-linear dependence between one continuous variable and one binary dependent variable especially in computational biology area.https://deepblue.lib.umich.edu/bitstream/2027.42/146514/1/12859_2018_Article_2427.pd
Quotient correlation: A sample based alternative to Pearson's correlation
The quotient correlation is defined here as an alternative to Pearson's
correlation that is more intuitive and flexible in cases where the tail
behavior of data is important. It measures nonlinear dependence where the
regular correlation coefficient is generally not applicable. One of its most
useful features is a test statistic that has high power when testing nonlinear
dependence in cases where the Fisher's -transformation test may fail to
reach a right conclusion. Unlike most asymptotic test statistics, which are
either normal or , this test statistic has a limiting gamma
distribution (henceforth, the gamma test statistic). More than the common
usages of correlation, the quotient correlation can easily and intuitively be
adjusted to values at tails. This adjustment generates two new concepts--the
tail quotient correlation and the tail independence test statistics, which are
also gamma statistics. Due to the fact that there is no analogue of the
correlation coefficient in extreme value theory, and there does not exist an
efficient tail independence test statistic, these two new concepts may open up
a new field of study. In addition, an alternative to Spearman's rank
correlation, a rank based quotient correlation, is also defined. The advantages
of using these new concepts are illustrated with simulated data and a real data
analysis of internet traffic.Comment: Published in at http://dx.doi.org/10.1214/009053607000000866 the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information
Conditional independence testing is a fundamental problem underlying causal
discovery and a particularly challenging task in the presence of nonlinear and
high-dimensional dependencies. Here a fully non-parametric test for continuous
data based on conditional mutual information combined with a local permutation
scheme is presented. Through a nearest neighbor approach, the test efficiently
adapts also to non-smooth distributions due to strongly nonlinear dependencies.
Numerical experiments demonstrate that the test reliably simulates the null
distribution even for small sample sizes and with high-dimensional conditioning
sets. The test is better calibrated than kernel-based tests utilizing an
analytical approximation of the null distribution, especially for non-smooth
densities, and reaches the same or higher power levels. Combining the local
permutation scheme with the kernel tests leads to better calibration, but
suffers in power. For smaller sample sizes and lower dimensions, the test is
faster than random fourier feature-based kernel tests if the permutation scheme
is (embarrassingly) parallelized, but the runtime increases more sharply with
sample size and dimensionality. Thus, more theoretical research to analytically
approximate the null distribution and speed up the estimation for larger sample
sizes is desirable.Comment: 17 pages, 12 figures, 1 tabl
Nonlinear Models Using Dirichlet Process Mixtures
We introduce a new nonlinear model for classification, in which we model the
joint distribution of response variable, y, and covariates, x,
non-parametrically using Dirichlet process mixtures. We keep the relationship
between y and x linear within each component of the mixture. The overall
relationship becomes nonlinear if the mixture contains more than one component.
We use simulated data to compare the performance of this new approach to a
simple multinomial logit (MNL) model, an MNL model with quadratic terms, and a
decision tree model. We also evaluate our approach on a protein fold
classification problem, and find that our model provides substantial
improvement over previous methods, which were based on Neural Networks (NN) and
Support Vector Machines (SVM). Folding classes of protein have a hierarchical
structure. We extend our method to classification problems where a class
hierarchy is available. We find that using the prior information regarding the
hierarchical structure of protein folds can result in higher predictive
accuracy
Design Issues for Generalized Linear Models: A Review
Generalized linear models (GLMs) have been used quite effectively in the
modeling of a mean response under nonstandard conditions, where discrete as
well as continuous data distributions can be accommodated. The choice of design
for a GLM is a very important task in the development and building of an
adequate model. However, one major problem that handicaps the construction of a
GLM design is its dependence on the unknown parameters of the fitted model.
Several approaches have been proposed in the past 25 years to solve this
problem. These approaches, however, have provided only partial solutions that
apply in only some special cases, and the problem, in general, remains largely
unresolved. The purpose of this article is to focus attention on the
aforementioned dependence problem. We provide a survey of various existing
techniques dealing with the dependence problem. This survey includes
discussions concerning locally optimal designs, sequential designs, Bayesian
designs and the quantile dispersion graph approach for comparing designs for
GLMs.Comment: Published at http://dx.doi.org/10.1214/088342306000000105 in the
Statistical Science (http://www.imstat.org/sts/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …