1,551 research outputs found
A PAC-Bayesian Analysis of Graph Clustering and Pairwise Clustering
We formulate weighted graph clustering as a prediction problem: given a
subset of edge weights we analyze the ability of graph clustering to predict
the remaining edge weights. This formulation enables practical and theoretical
comparison of different approaches to graph clustering as well as comparison of
graph clustering with other possible ways to model the graph. We adapt the
PAC-Bayesian analysis of co-clustering (Seldin and Tishby, 2008; Seldin, 2009)
to derive a PAC-Bayesian generalization bound for graph clustering. The bound
shows that graph clustering should optimize a trade-off between empirical data
fit and the mutual information that clusters preserve on the graph nodes. A
similar trade-off derived from information-theoretic considerations was already
shown to produce state-of-the-art results in practice (Slonim et al., 2005;
Yom-Tov and Slonim, 2009). This paper supports the empirical evidence by
providing a better theoretical foundation, suggesting formal generalization
guarantees, and offering a more accurate way to deal with finite sample issues.
We derive a bound minimization algorithm and show that it provides good results
in real-life problems and that the derived PAC-Bayesian bound is reasonably
tight
A PAC-Bayesian Analysis of Co-clustering, Graph Clustering, and Pairwise Clustering
We review briefly the PAC-Bayesian analysis of co-clustering (Seldin and Tishby, 2008, 2009, 2010), which provided generalization guarantees and regularization terms absent in the preceding formulations of this problem and achieved state-of-the-art prediction results in MovieLens collaborative filtering task. Inspired by this analysis we formulate weighted graph clustering1 as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. This formulation enables practical and theoretical comparison of different approaches to graph clustering as well as comparison of graph clustering with other possible ways to model the graph. Following the lines of (Seldin and Tishby, 2010) we derive PAC-Bayesian generalization bounds for graph clustering. The bounds show that graph clustering should optimize a trade-off between empirical data fit and the mutual information that clusters preserve on the graph nodes. A similar trade-off derived from information-theoretic considerations was already shown to produce state-of-the-art results in practice (Slonim et al., 2005; Yom-Tov and Slonim, 2009). This paper supports the empirical evidence by providing a better theoretical foundation, suggesting formal generalization guarantees, and offering a more accurate way to deal with finite sample issues
A General Framework for Updating Belief Distributions
We propose a framework for general Bayesian inference. We argue that a valid
update of a prior belief distribution to a posterior can be made for parameters
which are connected to observations through a loss function rather than the
traditional likelihood function, which is recovered under the special case of
using self information loss. Modern application areas make it is increasingly
challenging for Bayesians to attempt to model the true data generating
mechanism. Moreover, when the object of interest is low dimensional, such as a
mean or median, it is cumbersome to have to achieve this via a complete model
for the whole data distribution. More importantly, there are settings where the
parameter of interest does not directly index a family of density functions and
thus the Bayesian approach to learning about such parameters is currently
regarded as problematic. Our proposed framework uses loss-functions to connect
information in the data to functionals of interest. The updating of beliefs
then follows from a decision theoretic approach involving cumulative loss
functions. Importantly, the procedure coincides with Bayesian updating when a
true likelihood is known, yet provides coherent subjective inference in much
more general settings. Connections to other inference frameworks are
highlighted.Comment: This is the pre-peer reviewed version of the article "A General
Framework for Updating Belief Distributions", which has been accepted for
publication in the Journal of Statistical Society - Series B. This article
may be used for non-commercial purposes in accordance with Wiley Terms and
Conditions for Self-Archivin
Optimal Data Collection For Informative Rankings Expose Well-Connected Graphs
Given a graph where vertices represent alternatives and arcs represent
pairwise comparison data, the statistical ranking problem is to find a
potential function, defined on the vertices, such that the gradient of the
potential function agrees with the pairwise comparisons. Our goal in this paper
is to develop a method for collecting data for which the least squares
estimator for the ranking problem has maximal Fisher information. Our approach,
based on experimental design, is to view data collection as a bi-level
optimization problem where the inner problem is the ranking problem and the
outer problem is to identify data which maximizes the informativeness of the
ranking. Under certain assumptions, the data collection problem decouples,
reducing to a problem of finding multigraphs with large algebraic connectivity.
This reduction of the data collection problem to graph-theoretic questions is
one of the primary contributions of this work. As an application, we study the
Yahoo! Movie user rating dataset and demonstrate that the addition of a small
number of well-chosen pairwise comparisons can significantly increase the
Fisher informativeness of the ranking. As another application, we study the
2011-12 NCAA football schedule and propose schedules with the same number of
games which are significantly more informative. Using spectral clustering
methods to identify highly-connected communities within the division, we argue
that the NCAA could improve its notoriously poor rankings by simply scheduling
more out-of-conference games.Comment: 31 pages, 10 figures, 3 table
Detection of regulator genes and eQTLs in gene networks
Genetic differences between individuals associated to quantitative phenotypic
traits, including disease states, are usually found in non-coding genomic
regions. These genetic variants are often also associated to differences in
expression levels of nearby genes (they are "expression quantitative trait
loci" or eQTLs for short) and presumably play a gene regulatory role, affecting
the status of molecular networks of interacting genes, proteins and
metabolites. Computational systems biology approaches to reconstruct causal
gene networks from large-scale omics data have therefore become essential to
understand the structure of networks controlled by eQTLs together with other
regulatory genes, and to generate detailed hypotheses about the molecular
mechanisms that lead from genotype to phenotype. Here we review the main
analytical methods and softwares to identify eQTLs and their associated genes,
to reconstruct co-expression networks and modules, to reconstruct causal
Bayesian gene and module networks, and to validate predicted networks in
silico.Comment: minor revision with typos corrected; review article; 24 pages, 2
figure
- …