670 research outputs found
Measuring dependence powerfully and equitably
Given a high-dimensional data set we often wish to find the strongest
relationships within it. A common strategy is to evaluate a measure of
dependence on every variable pair and retain the highest-scoring pairs for
follow-up. This strategy works well if the statistic used is equitable [Reshef
et al. 2015a], i.e., if, for some measure of noise, it assigns similar scores
to equally noisy relationships regardless of relationship type (e.g., linear,
exponential, periodic).
In this paper, we introduce and characterize a population measure of
dependence called MIC*. We show three ways that MIC* can be viewed: as the
population value of MIC, a highly equitable statistic from [Reshef et al.
2011], as a canonical "smoothing" of mutual information, and as the supremum of
an infinite sequence defined in terms of optimal one-dimensional partitions of
the marginals of the joint distribution. Based on this theory, we introduce an
efficient approach for computing MIC* from the density of a pair of random
variables, and we define a new consistent estimator MICe for MIC* that is
efficiently computable. In contrast, there is no known polynomial-time
algorithm for computing the original equitable statistic MIC. We show through
simulations that MICe has better bias-variance properties than MIC. We then
introduce and prove the consistency of a second statistic, TICe, that is a
trivial side-product of the computation of MICe and whose goal is powerful
independence testing rather than equitability.
We show in simulations that MICe and TICe have good equitability and power
against independence respectively. The analyses here complement a more in-depth
empirical evaluation of several leading measures of dependence [Reshef et al.
2015b] that shows state-of-the-art performance for MICe and TICe.Comment: Yakir A. Reshef and David N. Reshef are co-first authors, Pardis C.
Sabeti and Michael M. Mitzenmacher are co-last authors. This paper, together
with arXiv:1505.02212, subsumes arXiv:1408.4908. v3 includes new analyses and
expositio
An Empirical Study of Leading Measures of Dependence
In exploratory data analysis, we are often interested in identifying
promising pairwise associations for further analysis while filtering out
weaker, less interesting ones. This can be accomplished by computing a measure
of dependence on all variable pairs and examining the highest-scoring pairs,
provided the measure of dependence used assigns similar scores to equally noisy
relationships of different types. This property, called equitability, is
formalized in Reshef et al. [2015b]. In addition to equitability, measures of
dependence can also be assessed by the power of their corresponding
independence tests as well as their runtime.
Here we present extensive empirical evaluation of the equitability, power
against independence, and runtime of several leading measures of dependence.
These include two statistics introduced in Reshef et al. [2015a]: MICe, which
has equitability as its primary goal, and TICe, which has power against
independence as its goal. Regarding equitability, our analysis finds that MICe
is the most equitable method on functional relationships in most of the
settings we considered, although mutual information estimation proves the most
equitable at large sample sizes in some specific settings. Regarding power
against independence, we find that TICe, along with Heller and Gorfine's S^DDP,
is the state of the art on the relationships we tested. Our analyses also show
a trade-off between power against independence and equitability consistent with
the theory in Reshef et al. [2015b]. In terms of runtime, MICe and TICe are
significantly faster than many other measures of dependence tested, and
computing either one makes computing the other trivial. This suggests that a
fast and useful strategy for achieving a combination of power against
independence and equitability may be to filter relationships by TICe and then
to examine the MICe of only the significant ones.Comment: David N. Reshef and Yakir A. Reshef are co-first authors, Pardis C.
Sabeti and Michael M. Mitzenmacher are co-last author
Validation of Association
Recognizing, quantifying and visualizing associations between two variables
is increasingly important. This paper investigates how a new function-valued
measure of dependence, the quantile dependence function, can be used to
construct tests for independence and to provide an easily interpretable
diagnostic plot of existing departures from the null model. The dependence
function is designed to detect general dependence structure between variables
in quantiles of the joint distribution. It gives an insight into how the
dependence structures changes in different parts of the joint distribution. We
define new estimators of the dependence function, discuss some of their
properties, and apply them to construct new tests of independence. Numerical
evidence is given on the test's benefits against three recognized independence
tests introduced in the previous years. In real-data analysis, we illustrate
the use of our tests and the graphical presentation of the underlying
dependence structure.Comment: 40 pages, 3 figures, 1 tabl
Equitability, interval estimation, and statistical power
For analysis of a high-dimensional dataset, a common approach is to test a
null hypothesis of statistical independence on all variable pairs using a
non-parametric measure of dependence. However, because this approach attempts
to identify any non-trivial relationship no matter how weak, it often
identifies too many relationships to be useful. What is needed is a way of
identifying a smaller set of relationships that merit detailed further
analysis.
Here we formally present and characterize equitability, a property of
measures of dependence that aims to overcome this challenge. Notionally, an
equitable statistic is a statistic that, given some measure of noise, assigns
similar scores to equally noisy relationships of different types [Reshef et al.
2011]. We begin by formalizing this idea via a new object called the
interpretable interval, which functions as an interval estimate of the amount
of noise in a relationship of unknown type. We define an equitable statistic as
one with small interpretable intervals.
We then draw on the equivalence of interval estimation and hypothesis testing
to show that under moderate assumptions an equitable statistic is one that
yields well powered tests for distinguishing not only between trivial and
non-trivial relationships of all kinds but also between non-trivial
relationships of different strengths. This means that equitability allows us to
specify a threshold relationship strength and to search for relationships
of all kinds with strength greater than . Thus, equitability can be
thought of as a strengthening of power against independence that enables
fruitful analysis of data sets with a small number of strong, interesting
relationships and a large number of weaker ones. We conclude with a
demonstration of how our two equivalent characterizations of equitability can
be used to evaluate the equitability of a statistic in practice.Comment: Yakir A. Reshef and David N. Reshef are co-first authors, Pardis C.
Sabeti and Michael M. Mitzenmacher are co-last authors. This paper, together
with arXiv:1505.02212, subsumes arXiv:1408.490
Theoretical Foundations of Equitability and the Maximal Information Coefficient
The maximal information coefficient (MIC) is a tool for finding the strongest
pairwise relationships in a data set with many variables (Reshef et al., 2011).
MIC is useful because it gives similar scores to equally noisy relationships of
different types. This property, called {\em equitability}, is important for
analyzing high-dimensional data sets.
Here we formalize the theory behind both equitability and MIC in the language
of estimation theory. This formalization has a number of advantages. First, it
allows us to show that equitability is a generalization of power against
statistical independence. Second, it allows us to compute and discuss the
population value of MIC, which we call MIC_*. In doing so we generalize and
strengthen the mathematical results proven in Reshef et al. (2011) and clarify
the relationship between MIC and mutual information. Introducing MIC_* also
enables us to reason about the properties of MIC more abstractly: for instance,
we show that MIC_* is continuous and that there is a sense in which it is a
canonical "smoothing" of mutual information. We also prove an alternate,
equivalent characterization of MIC_* that we use to state new estimators of it
as well as an algorithm for explicitly computing it when the joint probability
density function of a pair of random variables is known. Our hope is that this
paper provides a richer theoretical foundation for MIC and equitability going
forward.
This paper will be accompanied by a forthcoming companion paper that performs
extensive empirical analysis and comparison to other methods and discusses the
practical aspects of both equitability and the use of MIC and its related
statistics.Comment: 46 pages, 3 figures, 2 tables. This paper has been subsumed by
arXiv:1505.02213 and arXiv:1505.02212. Please cite those papers instea
Place Matters for Health in Alameda County: Ensuring Opportunities for Good Health for All, A Report on Health Inequities in Alameda County, California
This report provides a comprehensive analysis of the range of social, economic, and environmental conditions in Alameda County and documents their relationship to the health status of the county's residents
A Visual Analytics System for Multi-model Comparison on Clinical Data Predictions
There is a growing trend of applying machine learning methods to medical
datasets in order to predict patients' future status. Although some of these
methods achieve high performance, challenges still exist in comparing and
evaluating different models through their interpretable information. Such
analytics can help clinicians improve evidence-based medical decision making.
In this work, we develop a visual analytics system that compares multiple
models' prediction criteria and evaluates their consistency. With our system,
users can generate knowledge on different models' inner criteria and how
confidently we can rely on each model's prediction for a certain patient.
Through a case study of a publicly available clinical dataset, we demonstrate
the effectiveness of our visual analytics system to assist clinicians and
researchers in comparing and quantitatively evaluating different machine
learning methods.Comment: This is the author's version of the article that has been accepted to
PacificVis 2020 Visualization Meets AI Worksho
Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics
Projection pursuit is used to find interesting low-dimensional projections of
high-dimensional data by optimizing an index over all possible projections.
Most indexes have been developed to detect departure from known distributions,
such as normality, or to find separations between known groups. Here, we are
interested in finding projections revealing potentially complex bivariate
patterns, using new indexes constructed from scagnostics and a maximum
information coefficient, with a purpose to detect unusual relationships between
model parameters describing physics phenomena. The performance of these indexes
is examined with respect to ideal behaviour, using simulated data, and then
applied to problems from gravitational wave astronomy. The implementation
builds upon the projection pursuit tools available in the R package, tourr,
with indexes constructed from code in the R packages, scagnostics, minerva and
mbgraphic.Comment: 39 pages, 13 figure
Four simple axioms of dependence measures
Recently new methods for measuring and testing dependence have appeared in the literature. One way to evaluate and compare these measures with each other and with classical ones is to consider what are reasonable and natural axioms that should hold for any measure of dependence. We propose four natural axioms for dependence measures and establish which axioms hold or fail to hold for several widely applied methods. All of the proposed axioms are satisfied by distance correlation. We prove that if a dependence measure is defined for all bounded nonconstant real valued random variables and is invariant with respect to all one-to-one measurable transformations of the real line, then the dependence measure cannot be weakly continuous. This implies that the classical maximal correlation cannot be continuous and thus its application is problematic. The recently introduced maximal information coefficient has the same disadvantage. The lack of weak continuity means that as the sample size increases the empirical values of a dependence measure do not necessarily converge to the population value
- …