15,456 research outputs found
Distinguishing cause from effect using observational data: methods and benchmarks
The discovery of causal relationships from purely observational data is a
fundamental problem in science. The most elementary form of such a causal
discovery problem is to decide whether X causes Y or, alternatively, Y causes
X, given joint observations of two variables X, Y. An example is to decide
whether altitude causes temperature, or vice versa, given only joint
measurements of both variables. Even under the simplifying assumptions of no
confounding, no feedback loops, and no selection bias, such bivariate causal
discovery problems are challenging. Nevertheless, several approaches for
addressing those problems have been proposed in recent years. We review two
families of such methods: Additive Noise Methods (ANM) and Information
Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs
that consists of data for 100 different cause-effect pairs selected from 37
datasets from various domains (e.g., meteorology, biology, medicine,
engineering, economy, etc.) and motivate our decisions regarding the "ground
truth" causal directions of all pairs. We evaluate the performance of several
bivariate causal discovery methods on these real-world benchmark data and in
addition on artificially simulated data. Our empirical results on real-world
data indicate that certain methods are indeed able to distinguish cause from
effect using only purely observational data, although more benchmark data would
be needed to obtain statistically significant conclusions. One of the best
performing methods overall is the additive-noise method originally proposed by
Hoyer et al. (2009), which obtains an accuracy of 63+-10 % and an AUC of
0.74+-0.05 on the real-world benchmark. As the main theoretical contribution of
this work we prove the consistency of that method.Comment: 101 pages, second revision submitted to Journal of Machine Learning
Researc
Quantifying dependencies for sensitivity analysis with multivariate input sample data
We present a novel method for quantifying dependencies in multivariate
datasets, based on estimating the R\'{e}nyi entropy by minimum spanning trees
(MSTs). The length of the MSTs can be used to order pairs of variables from
strongly to weakly dependent, making it a useful tool for sensitivity analysis
with dependent input variables. It is well-suited for cases where the input
distribution is unknown and only a sample of the inputs is available. We
introduce an estimator to quantify dependency based on the MST length, and
investigate its properties with several numerical examples. To reduce the
computational cost of constructing the exact MST for large datasets, we explore
methods to compute approximations to the exact MST, and find the multilevel
approach introduced recently by Zhong et al. (2015) to be the most accurate. We
apply our proposed method to an artificial testcase based on the Ishigami
function, as well as to a real-world testcase involving sediment transport in
the North Sea. The results are consistent with prior knowledge and heuristic
understanding, as well as with variance-based analysis using Sobol indices in
the case where these indices can be computed
On the consistency of Multithreshold Entropy Linear Classifier
Multithreshold Entropy Linear Classifier (MELC) is a recent classifier idea
which employs information theoretic concept in order to create a multithreshold
maximum margin model. In this paper we analyze its consistency over
multithreshold linear models and show that its objective function upper bounds
the amount of misclassified points in a similar manner like hinge loss does in
support vector machines. For further confirmation we also conduct some
numerical experiments on five datasets.Comment: Presented at Theoretical Foundations of Machine Learning 2015
(http://tfml.gmum.net), final version published in Schedae Informaticae
Journa
- …