Search CORE

670 research outputs found

Measuring dependence powerfully and equitably

Author: Finucane Hilary K.
Mitzenmacher Michael M.
Reshef David N.
Reshef Yakir A.
Sabeti Pardis C.
Publication venue
Publication date: 06/07/2016
Field of study

Given a high-dimensional data set we often wish to find the strongest relationships within it. A common strategy is to evaluate a measure of dependence on every variable pair and retain the highest-scoring pairs for follow-up. This strategy works well if the statistic used is equitable [Reshef et al. 2015a], i.e., if, for some measure of noise, it assigns similar scores to equally noisy relationships regardless of relationship type (e.g., linear, exponential, periodic). In this paper, we introduce and characterize a population measure of dependence called MIC*. We show three ways that MIC* can be viewed: as the population value of MIC, a highly equitable statistic from [Reshef et al. 2011], as a canonical "smoothing" of mutual information, and as the supremum of an infinite sequence defined in terms of optimal one-dimensional partitions of the marginals of the joint distribution. Based on this theory, we introduce an efficient approach for computing MIC* from the density of a pair of random variables, and we define a new consistent estimator MICe for MIC* that is efficiently computable. In contrast, there is no known polynomial-time algorithm for computing the original equitable statistic MIC. We show through simulations that MICe has better bias-variance properties than MIC. We then introduce and prove the consistency of a second statistic, TICe, that is a trivial side-product of the computation of MICe and whose goal is powerful independence testing rather than equitability. We show in simulations that MICe and TICe have good equitability and power against independence respectively. The analyses here complement a more in-depth empirical evaluation of several leading measures of dependence [Reshef et al. 2015b] that shows state-of-the-art performance for MICe and TICe.Comment: Yakir A. Reshef and David N. Reshef are co-first authors, Pardis C. Sabeti and Michael M. Mitzenmacher are co-last authors. This paper, together with arXiv:1505.02212, subsumes arXiv:1408.4908. v3 includes new analyses and expositio

arXiv.org e-Print Archive

An Empirical Study of Leading Measures of Dependence

Author: Mitzenmacher Michael M.
Reshef David N.
Reshef Yakir A.
Sabeti Pardis C.
Publication venue
Publication date: 12/05/2015
Field of study

In exploratory data analysis, we are often interested in identifying promising pairwise associations for further analysis while filtering out weaker, less interesting ones. This can be accomplished by computing a measure of dependence on all variable pairs and examining the highest-scoring pairs, provided the measure of dependence used assigns similar scores to equally noisy relationships of different types. This property, called equitability, is formalized in Reshef et al. [2015b]. In addition to equitability, measures of dependence can also be assessed by the power of their corresponding independence tests as well as their runtime. Here we present extensive empirical evaluation of the equitability, power against independence, and runtime of several leading measures of dependence. These include two statistics introduced in Reshef et al. [2015a]: MICe, which has equitability as its primary goal, and TICe, which has power against independence as its goal. Regarding equitability, our analysis finds that MICe is the most equitable method on functional relationships in most of the settings we considered, although mutual information estimation proves the most equitable at large sample sizes in some specific settings. Regarding power against independence, we find that TICe, along with Heller and Gorfine's S^DDP, is the state of the art on the relationships we tested. Our analyses also show a trade-off between power against independence and equitability consistent with the theory in Reshef et al. [2015b]. In terms of runtime, MICe and TICe are significantly faster than many other measures of dependence tested, and computing either one makes computing the other trivial. This suggests that a fast and useful strategy for achieving a combination of power against independence and equitability may be to filter relationships by TICe and then to examine the MICe of only the significant ones.Comment: David N. Reshef and Yakir A. Reshef are co-first authors, Pardis C. Sabeti and Michael M. Mitzenmacher are co-last author

arXiv.org e-Print Archive

Validation of Association

Author: Bogdan Ćmiel
Teresa Ledwina
Publication venue
Publication date: 13/04/2019
Field of study

Recognizing, quantifying and visualizing associations between two variables is increasingly important. This paper investigates how a new function-valued measure of dependence, the quantile dependence function, can be used to construct tests for independence and to provide an easily interpretable diagnostic plot of existing departures from the null model. The dependence function is designed to detect general dependence structure between variables in quantiles of the joint distribution. It gives an insight into how the dependence structures changes in different parts of the joint distribution. We define new estimators of the dependence function, discuss some of their properties, and apply them to construct new tests of independence. Numerical evidence is given on the test's benefits against three recognized independence tests introduced in the previous years. In real-data analysis, we illustrate the use of our tests and the graphical presentation of the underlying dependence structure.Comment: 40 pages, 3 figures, 1 tabl

arXiv.org e-Print Archive

Equitability, interval estimation, and statistical power

Author: Mitzenmacher Michael M.
Reshef David N.
Reshef Yakir A.
Sabeti Pardis C.
Publication venue
Publication date: 12/05/2015
Field of study

For analysis of a high-dimensional dataset, a common approach is to test a null hypothesis of statistical independence on all variable pairs using a non-parametric measure of dependence. However, because this approach attempts to identify any non-trivial relationship no matter how weak, it often identifies too many relationships to be useful. What is needed is a way of identifying a smaller set of relationships that merit detailed further analysis. Here we formally present and characterize equitability, a property of measures of dependence that aims to overcome this challenge. Notionally, an equitable statistic is a statistic that, given some measure of noise, assigns similar scores to equally noisy relationships of different types [Reshef et al. 2011]. We begin by formalizing this idea via a new object called the interpretable interval, which functions as an interval estimate of the amount of noise in a relationship of unknown type. We define an equitable statistic as one with small interpretable intervals. We then draw on the equivalence of interval estimation and hypothesis testing to show that under moderate assumptions an equitable statistic is one that yields well powered tests for distinguishing not only between trivial and non-trivial relationships of all kinds but also between non-trivial relationships of different strengths. This means that equitability allows us to specify a threshold relationship strength

x_0

and to search for relationships of all kinds with strength greater than

x_0

. Thus, equitability can be thought of as a strengthening of power against independence that enables fruitful analysis of data sets with a small number of strong, interesting relationships and a large number of weaker ones. We conclude with a demonstration of how our two equivalent characterizations of equitability can be used to evaluate the equitability of a statistic in practice.Comment: Yakir A. Reshef and David N. Reshef are co-first authors, Pardis C. Sabeti and Michael M. Mitzenmacher are co-last authors. This paper, together with arXiv:1505.02212, subsumes arXiv:1408.490

arXiv.org e-Print Archive

Theoretical Foundations of Equitability and the Maximal Information Coefficient

Author: Mitzenmacher Michael
Reshef David N.
Reshef Yakir A.
Sabeti Pardis C.
Publication venue
Publication date: 12/05/2015
Field of study

The maximal information coefficient (MIC) is a tool for finding the strongest pairwise relationships in a data set with many variables (Reshef et al., 2011). MIC is useful because it gives similar scores to equally noisy relationships of different types. This property, called {\em equitability}, is important for analyzing high-dimensional data sets. Here we formalize the theory behind both equitability and MIC in the language of estimation theory. This formalization has a number of advantages. First, it allows us to show that equitability is a generalization of power against statistical independence. Second, it allows us to compute and discuss the population value of MIC, which we call MIC_*. In doing so we generalize and strengthen the mathematical results proven in Reshef et al. (2011) and clarify the relationship between MIC and mutual information. Introducing MIC_* also enables us to reason about the properties of MIC more abstractly: for instance, we show that MIC_* is continuous and that there is a sense in which it is a canonical "smoothing" of mutual information. We also prove an alternate, equivalent characterization of MIC_* that we use to state new estimators of it as well as an algorithm for explicitly computing it when the joint probability density function of a pair of random variables is known. Our hope is that this paper provides a richer theoretical foundation for MIC and equitability going forward. This paper will be accompanied by a forthcoming companion paper that performs extensive empirical analysis and comparison to other methods and discusses the practical aspects of both equitability and the use of MIC and its related statistics.Comment: 46 pages, 3 figures, 2 tables. This paper has been subsumed by arXiv:1505.02213 and arXiv:1505.02212. Please cite those papers instea

arXiv.org e-Print Archive

Place Matters for Health in Alameda County: Ensuring Opportunities for Good Health for All, A Report on Health Inequities in Alameda County, California

Author
Publication venue: Joint Center for Political and Economic Studies
Publication date: 11/11/2012
Field of study

This report provides a comprehensive analysis of the range of social, economic, and environmental conditions in Alameda County and documents their relationship to the health status of the county's residents

A Visual Analytics System for Multi-model Comparison on Clinical Data Predictions

Author: Choi Yong K.
Fujiwara Takanori
Kim Katherine K.
Li Yiran
Ma Kwan-Liu
Publication venue
Publication date: 23/03/2020
Field of study

There is a growing trend of applying machine learning methods to medical datasets in order to predict patients' future status. Although some of these methods achieve high performance, challenges still exist in comparing and evaluating different models through their interpretable information. Such analytics can help clinicians improve evidence-based medical decision making. In this work, we develop a visual analytics system that compares multiple models' prediction criteria and evaluates their consistency. With our system, users can generate knowledge on different models' inner criteria and how confidently we can rely on each model's prediction for a certain patient. Through a case study of a publicly available clinical dataset, we demonstrate the effectiveness of our visual analytics system to assist clinicians and researchers in comparing and quantitatively evaluating different machine learning methods.Comment: This is the author's version of the article that has been accepted to PacificVis 2020 Visualization Meets AI Worksho

arXiv.org e-Print Archive

Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics

Author: Cook Dianne
Laa Ursula
Publication venue
Publication date: 13/01/2020
Field of study

Projection pursuit is used to find interesting low-dimensional projections of high-dimensional data by optimizing an index over all possible projections. Most indexes have been developed to detect departure from known distributions, such as normality, or to find separations between known groups. Here, we are interested in finding projections revealing potentially complex bivariate patterns, using new indexes constructed from scagnostics and a maximum information coefficient, with a purpose to detect unusual relationships between model parameters describing physics phenomena. The performance of these indexes is examined with respect to ideal behaviour, using simulated data, and then applied to problems from gravitational wave astronomy. The implementation builds upon the projection pursuit tools available in the R package, tourr, with indexes constructed from code in the R packages, scagnostics, minerva and mbgraphic.Comment: 39 pages, 13 figure

arXiv.org e-Print Archive

Policy implications for inclusive growth in the Republic of Korea

Author: Lee Sophia Seung-yoon
Lee Young Youn
Publication venue: [Seoul]
Publication date: 01/01/2013
Field of study

Four simple axioms of dependence measures

Author: Móri Tamás F.
Székely Gábor J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 18/07/2018
Field of study

Recently new methods for measuring and testing dependence have appeared in the literature. One way to evaluate and compare these measures with each other and with classical ones is to consider what are reasonable and natural axioms that should hold for any measure of dependence. We propose four natural axioms for dependence measures and establish which axioms hold or fail to hold for several widely applied methods. All of the proposed axioms are satisfied by distance correlation. We prove that if a dependence measure is defined for all bounded nonconstant real valued random variables and is invariant with respect to all one-to-one measurable transformations of the real line, then the dependence measure cannot be weakly continuous. This implies that the classical maximal correlation cannot be continuous and thus its application is problematic. The recently introduced maximal information coefficient has the same disadvantage. The lack of weak continuity means that as the sample size increases the empirical values of a dependence measure do not necessarily converge to the population value