Search CORE

312 research outputs found

We Are Not Your Real Parents: Telling Causal from Confounded using MDL

Author: Kaltenpoth D.
Vreeken J.
Publication venue
Publication date: 01/01/2019
Field of study

Given data over variables

(X_1,...,X_m, Y)

we consider the problem of finding out whether

X

jointly causes

Y

or whether they are all confounded by an unobserved latent variable

Z

. To do so, we take an information-theoretic approach based on Kolmogorov complexity. In a nutshell, we follow the postulate that first encoding the true cause, and then the effects given that cause, results in a shorter description than any other encoding of the observed variables. The ideal score is not computable, and hence we have to approximate it. We propose to do so using the Minimum Description Length (MDL) principle. We compare the MDL scores under the models where

X

causes

Y

and where there exists a latent variables

Z

confounding both

X

and

Y

and show our scores are consistent. To find potential confounders we propose using latent factor modeling, in particular, probabilistic PCA (PPCA). Empirical evaluation on both synthetic and real-world data shows that our method, CoCa, performs very well -- even when the true generating process of the data is far from the assumptions made by the models we use. Moreover, it is robust as its accuracy goes hand in hand with its confidence

MPG.PuRe

{VoG}: {Summarizing} and Understanding Large Graphs

Author: Faloutsos C.
Kang U.
Koutra D.
Vreeken J.
Publication venue
Publication date: 01/01/2014
Field of study

How can we succinctly describe a million-node graph with a few simple sentences? How can we measure the "importance" of a set of discovered subgraphs in a large graph? These are exactly the problems we focus on. Our main ideas are to construct a "vocabulary" of subgraph-types that often occur in real graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the most succinct description of a graph in terms of this vocabulary. We measure success in a well-founded way by means of the Minimum Description Length (MDL) principle: a subgraph is included in the summary if it decreases the total description length of the graph. Our contributions are three-fold: (a) formulation: we provide a principled encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop \method, an efficient method to minimize the description cost, and (c) applicability: we report experimental results on multi-million-edge real graphs, including Flickr and the Notre Dame web graph

MPG.PuRe

Stratigraphy, Sedimentology, and Moisture Contents in a Small Loess Watershed in Tama County, lowa

Author: Vreeken W. J.
Publication venue: UNI ScholarWorks
Publication date: 01/01/1968
Field of study

A traverse across a small first-order watershed in loess has been studied. The loess is Wisconsin in age and has a vertical tripartition that can be explained on a regional basis. Clear stratification is present in the middle loess increment. Moisture distribution patterns correlate highly with differences in particle-size distribution. The explanatory physical phenomena must be moisture-tension relationships

University of Northern Iowa

Differentiable Pattern Set Mining

Author: Fischer J.
Vreeken J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

MPG.PuRe

{MDL4BMF}: Minimum Description Length for Boolean Matrix Factorization

Author: Miettinen P.
Vreeken J.
Publication venue: Max-Planck-Institut für Informatik
Publication date: 01/01/2012
Field of study

Matrix factorizations—where a given data matrix is approximated by a prod- uct of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem’ of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior

MPG.PuRe

Causal Inference by Stochastic Complexity

Author: Budhathoki K.
Vreeken J.
Publication venue
Publication date: 01/01/2017
Field of study

The algorithmic Markov condition states that the most likely causal direction between two random variables X and Y can be identified as that direction with the lowest Kolmogorov complexity. Due to the halting problem, however, this notion is not computable. We hence propose to do causal inference by stochastic complexity. That is, we propose to approximate Kolmogorov complexity via the Minimum Description Length (MDL) principle, using a score that is mini-max optimal with regard to the model class under consideration. This means that even in an adversarial setting, such as when the true distribution is not in this class, we still obtain the optimal encoding for the data relative to the class. We instantiate this framework, which we call CISC, for pairs of univariate discrete variables, using the class of multinomial distributions. Experiments show that CISC is highly accurate on synthetic, benchmark, as well as real-world data, outperforming the state of the art by a margin, and scales extremely well with regard to sample and domain sizes

MPG.PuRe

Federated Learning from Small Datasets

Author: Fischer J.
Kamp M.
Vreeken J.
Publication venue
Publication date: 01/01/2021
Field of study

Federated learning allows multiple parties to collaboratively train a joint model without sharing local data. This enables applications of machine learning in settings of inherently distributed, undisclosable data such as in the medical domain. In practice, joint training is usually achieved by aggregating local models, for which local training objectives have to be in expectation similar to the joint (global) objective. Often, however, local datasets are so small that local objectives differ greatly from the global objective, resulting in federated learning to fail. We propose a novel approach that intertwines model aggregations with permutations of local models. The permutations expose each local model to a daisy chain of local datasets resulting in more efficient training in data-sparse domains. This enables training on extremely small local datasets, such as patient data across hospitals, while retaining the training efficiency and privacy benefits of federated learning

MPG.PuRe

All the World's a (Hyper)Graph: A Data Drama

Author: Coupette C.
Rieck B.
Vreeken J.
Publication venue
Publication date: 01/01/2022
Field of study

We introduce Hyperbard, a dataset of diverse relational data representationsderived from Shakespeare's plays. Our representations range from simple graphscapturing character co-occurrence in single scenes to hypergraphs encodingcomplex communication settings and character contributions as hyperedges withedge-specific node weights. By making multiple intuitive representationsreadily available for experimentation, we facilitate rigorous representationrobustness checks in graph learning, graph mining, and network analysis,highlighting the advantages and drawbacks of specific representations.Leveraging the data released in Hyperbard, we demonstrate that many solutionsto popular graph mining problems are highly dependent on the representationchoice, thus calling current graph curation practices into question. As anhomage to our data source, and asserting that science can also be art, wepresent all our points in the form of a play.<br

MPG.PuRe

Differentially Describing Groups of Graphs

Author: Coupette C.
Dalleiger S.
Vreeken J.
Publication venue
Publication date: 01/01/2022
Field of study

How does neural connectivity in autistic children differ from neuralconnectivity in healthy children or autistic youths? What patterns in globaltrade networks are shared across classes of goods, and how do these patternschange over time? Answering questions like these requires us to differentiallydescribe groups of graphs: Given a set of graphs and a partition of thesegraphs into groups, discover what graphs in one group have in common, how theysystematically differ from graphs in other groups, and how multiple groups ofgraphs are related. We refer to this task as graph group analysis, which seeksto describe similarities and differences between graph groups by means ofstatistically significant subgraphs. To perform graph group analysis, weintroduce Gragra, which uses maximum entropy modeling to identify anon-redundant set of subgraphs with statistically significant associations toone or more graph groups. Through an extensive set of experiments on a widerange of synthetic and real-world graph groups, we confirm that Gragra workswell in practice.<br

MPG.PuRe