Search CORE

93,013 research outputs found

Efficient test for nonlinear dependence of two continuous variables

Author: Hongbao Cao
Li Jin
Momiao Xiong
Yi Li
Yi Wang
Yin Yao Shugart
Publication venue: Springer Nature
Publication date: 01/01/2015
Field of study

The power comparison of simulation study across Gaussian noise levels (mean = 0, variance = 1/9, 1/4, 4 and 9). (XLSX 11 kb

Springer - Publisher Connector

FigShare

Efficient test for nonlinear dependence of two continuous variables

Author: A Gretton
AC Aitken
B Li
B Murrell
B Stroustrup
CF Dietrich
CGAR Network
D Albanese
D Reshef
DN Reshef
DN Reshef
DS Burke
F Galton
FE Croxton
GE Wilding
GJ Székely
GJ Székely
H Kirikoshi
H Scheffe
Hongbao Cao
J Cohen
J Jiang
JB Kinney
JL Myers
K Pearson
L Grosse
L Tierney
Li Jin
MG Kendall
Momiao Xiong
MR Kosorok
MZ Dieter
N Lockyer
NS Altman
P Good
P Huber
PS Horn
R Natrajan
SA Ha
SJ Devlin
T Zhang
WS Cleveland
Y Tanaka
Yi Li
Yi Wang
Yin Yao Shugart
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

knnAUC: an open-source R package for detecting nonlinear dependence between one continuous variable and one binary variable

Author: Hao Meng
Jin Li
Li Yi
Liu Jie
Liu Xiaoyu
Ma Yanyun
Shugart Yin Y
Wang Jiucun
Wang Yi
Xiong Momiao
Yuan Zhenghong
Zhou Weichen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2018
Field of study

Abstract Background Testing the dependence of two variables is one of the fundamental tasks in statistics. In this work, we developed an open-source R package (knnAUC) for detecting nonlinear dependence between one continuous variable X and one binary dependent variables Y (0 or 1). Results We addressed this problem by using knnAUC (k-nearest neighbors AUC test, the R package is available at https://sourceforge.net/projects/knnauc/ ). In the knnAUC software framework, we first resampled a dataset to get the training and testing dataset according to the sample ratio (from 0 to 1), and then constructed a k-nearest neighbors algorithm classifier to get the yhat estimator (the probability of y = 1) of testy (the true label of testing dataset). Finally, we calculated the AUC (area under the curve of receiver operating characteristic) estimator and tested whether the AUC estimator is greater than 0.5. To evaluate the advantages of knnAUC compared to seven other popular methods, we performed extensive simulations to explore the relationships between eight different methods and compared the false positive rates and statistical power using both simulated and real datasets (Chronic hepatitis B datasets and kidney cancer RNA-seq datasets). Conclusions We concluded that knnAUC is an efficient R package to test non-linear dependence between one continuous variable and one binary dependent variable especially in computational biology area.https://deepblue.lib.umich.edu/bitstream/2027.42/146514/1/12859_2018_Article_2427.pd

Directory of Open Access Journals

Deep Blue Documents at the University of Michigan

Quotient correlation: A sample based alternative to Pearson's correlation

Author: Zhang Zhengjun
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2008
Field of study

The quotient correlation is defined here as an alternative to Pearson's correlation that is more intuitive and flexible in cases where the tail behavior of data is important. It measures nonlinear dependence where the regular correlation coefficient is generally not applicable. One of its most useful features is a test statistic that has high power when testing nonlinear dependence in cases where the Fisher's

Z

-transformation test may fail to reach a right conclusion. Unlike most asymptotic test statistics, which are either normal or

\chi^2

, this test statistic has a limiting gamma distribution (henceforth, the gamma test statistic). More than the common usages of correlation, the quotient correlation can easily and intuitively be adjusted to values at tails. This adjustment generates two new concepts--the tail quotient correlation and the tail independence test statistics, which are also gamma statistics. Due to the fact that there is no analogue of the correlation coefficient in extreme value theory, and there does not exist an efficient tail independence test statistic, these two new concepts may open up a new field of study. In addition, an alternative to Spearman's rank correlation, a rank based quotient correlation, is also defined. The advantages of using these new concepts are illustrated with simulated data and a real data analysis of internet traffic.Comment: Published in at http://dx.doi.org/10.1214/009053607000000866 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information

Author: Runge Jakob
Publication venue
Publication date: 05/09/2017
Field of study

Conditional independence testing is a fundamental problem underlying causal discovery and a particularly challenging task in the presence of nonlinear and high-dimensional dependencies. Here a fully non-parametric test for continuous data based on conditional mutual information combined with a local permutation scheme is presented. Through a nearest neighbor approach, the test efficiently adapts also to non-smooth distributions due to strongly nonlinear dependencies. Numerical experiments demonstrate that the test reliably simulates the null distribution even for small sample sizes and with high-dimensional conditioning sets. The test is better calibrated than kernel-based tests utilizing an analytical approximation of the null distribution, especially for non-smooth densities, and reaches the same or higher power levels. Combining the local permutation scheme with the kernel tests leads to better calibration, but suffers in power. For smaller sample sizes and lower dimensions, the test is faster than random fourier feature-based kernel tests if the permutation scheme is (embarrassingly) parallelized, but the runtime increases more sharply with sample size and dimensionality. Thus, more theoretical research to analytically approximate the null distribution and speed up the estimation for larger sample sizes is desirable.Comment: 17 pages, 12 figures, 1 tabl

arXiv.org e-Print Archive

Institute of Transport Research:Publications

Nonlinear Models Using Dirichlet Process Mixtures

Author: Neal Radford M.
Shahbaba Babak
Publication venue
Publication date: 01/01/2007
Field of study

We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component. We use simulated data to compare the performance of this new approach to a simple multinomial logit (MNL) model, an MNL model with quadratic terms, and a decision tree model. We also evaluate our approach on a protein fold classification problem, and find that our model provides substantial improvement over previous methods, which were based on Neural Networks (NN) and Support Vector Machines (SVM). Folding classes of protein have a hierarchical structure. We extend our method to classification problems where a class hierarchy is available. We find that using the prior information regarding the hierarchical structure of protein folds can result in higher predictive accuracy

arXiv.org e-Print Archive

CiteSeerX

Design Issues for Generalized Linear Models: A Review

Author: Ghosh Malay
Khuri André I.
Mukherjee Bhramar
Sinha Bikas K.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2006
Field of study

Generalized linear models (GLMs) have been used quite effectively in the modeling of a mean response under nonstandard conditions, where discrete as well as continuous data distributions can be accommodated. The choice of design for a GLM is a very important task in the development and building of an adequate model. However, one major problem that handicaps the construction of a GLM design is its dependence on the unknown parameters of the fitted model. Several approaches have been proposed in the past 25 years to solve this problem. These approaches, however, have provided only partial solutions that apply in only some special cases, and the problem, in general, remains largely unresolved. The purpose of this article is to focus attention on the aforementioned dependence problem. We provide a survey of various existing techniques dealing with the dependence problem. This survey includes discussions concerning locally optimal designs, sequential designs, Bayesian designs and the quantile dispersion graph approach for comparing designs for GLMs.Comment: Published at http://dx.doi.org/10.1214/088342306000000105 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref