459 research outputs found
Linear causal model discovery using the MML criterion
Determining the causal structure of a domain is a key task in the area of Data Mining and Knowledge Discovery.The algorithm proposed by Wallace et al. [15] has demonstrated its strong ability in discovering Linear Causal Models from given data sets. However, some experiments showed that this algorithm experienced difficulty in discovering linear relations with small deviation, and it occasionally gives a negative message length, which should not be allowed. In this paper, a more efficient and precise MML encoding scheme is proposed to describe the model structure and the nodes in a Linear Causal Model. The estimation of different parameters is also derived. Empirical results show that the new algorithm outperformed the previous MML-based algorithm in terms of both speed and precision. <br /
Distinguishing cause from effect using observational data: methods and benchmarks
The discovery of causal relationships from purely observational data is a
fundamental problem in science. The most elementary form of such a causal
discovery problem is to decide whether X causes Y or, alternatively, Y causes
X, given joint observations of two variables X, Y. An example is to decide
whether altitude causes temperature, or vice versa, given only joint
measurements of both variables. Even under the simplifying assumptions of no
confounding, no feedback loops, and no selection bias, such bivariate causal
discovery problems are challenging. Nevertheless, several approaches for
addressing those problems have been proposed in recent years. We review two
families of such methods: Additive Noise Methods (ANM) and Information
Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs
that consists of data for 100 different cause-effect pairs selected from 37
datasets from various domains (e.g., meteorology, biology, medicine,
engineering, economy, etc.) and motivate our decisions regarding the "ground
truth" causal directions of all pairs. We evaluate the performance of several
bivariate causal discovery methods on these real-world benchmark data and in
addition on artificially simulated data. Our empirical results on real-world
data indicate that certain methods are indeed able to distinguish cause from
effect using only purely observational data, although more benchmark data would
be needed to obtain statistically significant conclusions. One of the best
performing methods overall is the additive-noise method originally proposed by
Hoyer et al. (2009), which obtains an accuracy of 63+-10 % and an AUC of
0.74+-0.05 on the real-world benchmark. As the main theoretical contribution of
this work we prove the consistency of that method.Comment: 101 pages, second revision submitted to Journal of Machine Learning
Researc
Discovering linear causal model from incomplete data
One common drawback in algorithms for learning Linear Causal Models is that they can not deal with incomplete data set. This is unfortunate since many real problems involve missing data or even hidden variable. In this paper, based on multiple imputation, we propose a three-step process to learn linear causal models from incomplete data set. Experimental results indicate that this algorithm is better than the single imputation method (EM algorithm) and the simple list deletion method, and for lower missing rate, this algorithm can even find models better than the results from the greedy learning algorithm MLGS working in a complete data set. In addition, the method is amenable to parallel or distributed processing, which is an important characteristic for data mining in large data sets.<br /
Ensemble parameter estimation for graphical models
Parameter Estimation is one of the key issues involved in the discovery of graphical models from data. Current state of the art methods have demonstrated their abilities in different kind of graphical models. In this paper, we introduce ensemble learning into the process of parameter estimation, and examine ensemble parameter estimation methods for different kind of graphical models under complete data set and incomplete data set. We provide experimental results which show that ensemble method can achieve an improved result over the base parameter estimation method in terms of accuracy. In addition, the method is amenable to parallel or distributed processing, which is an important characteristic for data mining in large data sets.<br /
An examination on the performance of MML causal induction
This paper presents an examination report on the performance of the improved MML based causal model discovery algorithm. In this paper, We firstly describe our improvement to the causal discovery algorithm which introduces a new encoding scheme for measuring the cost of describing the causal structure. Stiring function is also applied to further simplify the computational complexity and thus works more efficiently. It is followed by a detailed examination report on the performance of our improved discovery algorithm. The experimental results of the current version of the discovery system show that: (l) the current version is capable of discovering what discovered by previous system; (2) current system is capable of discovering more complicated causal networks with large number of variables; (3) the new version works more efficiently compared with the previous version in terms of time complexity
Granger Causal Inference in Multivariate Hawkes Processes by Minimum Message Length
Multivariate Hawkes processes (MHPs) are versatile probabilistic tools used
to model various real-life phenomena: earthquakes, operations on stock markets,
neuronal activity, virus propagation and many others. In this paper, we focus
on MHPs with exponential decay kernels and estimate connectivity graphs, which
represent the Granger causal relations between their components. We approach
this inference problem by proposing an optimization criterion and model
selection algorithm based on the minimum message length (MML) principle. MML
compares Granger causal models using the Occam's razor principle in the
following way: even when models have a comparable goodness-of-fit to the
observed data, the one generating the most concise explanation of the data is
preferred. While most of the state-of-art methods using lasso-type penalization
tend to overfitting in scenarios with short time horizons, the proposed
MML-based method achieves high F1 scores in these settings. We conduct a
numerical study comparing the proposed algorithm to other related classical and
state-of-art methods, where we achieve the highest F1 scores in specific sparse
graph settings. We illustrate the proposed method also on G7 sovereign bond
data and obtain causal connections, which are in agreement with the expert
knowledge available in the literature.Comment: 23 pages, 5 figure
MML Probabilistic Principal Component Analysis
Principal component analysis (PCA) is perhaps the most widely method for data
dimensionality reduction. A key question in PCA decomposition of data is
deciding how many factors to retain. This manuscript describes a new approach
to automatically selecting the number of principal components based on the
Bayesian minimum message length method of inductive inference. We also derive a
new estimate of the isotropic residual variance and demonstrate, via numerical
experiments, that it improves on the usual maximum likelihood approach
Telling Cause from Effect using MDL-based Local and Global Regression
We consider the fundamental problem of inferring the causal direction between
two univariate numeric random variables and from observational data.
The two-variable case is especially difficult to solve since it is not possible
to use standard conditional independence tests between the variables.
To tackle this problem, we follow an information theoretic approach based on
Kolmogorov complexity and use the Minimum Description Length (MDL) principle to
provide a practical solution. In particular, we propose a compression scheme to
encode local and global functional relations using MDL-based regression. We
infer causes in case it is shorter to describe as a function of
than the inverse direction. In addition, we introduce Slope, an efficient
linear-time algorithm that through thorough empirical evaluation on both
synthetic and real world data we show outperforms the state of the art by a
wide margin.Comment: 10 pages, To appear in ICDM1
Learning Causal Models for Noisy Biological Data Mining: An Application to Ovarian Cancer Detection
Singapore Management University Office of Researc
- …