53,205 research outputs found
Bayesian Cluster Enumeration Criterion for Unsupervised Learning
We derive a new Bayesian Information Criterion (BIC) by formulating the
problem of estimating the number of clusters in an observed data set as
maximization of the posterior probability of the candidate models. Given that
some mild assumptions are satisfied, we provide a general BIC expression for a
broad class of data distributions. This serves as a starting point when
deriving the BIC for specific distributions. Along this line, we provide a
closed-form BIC expression for multivariate Gaussian distributed variables. We
show that incorporating the data structure of the clustering problem into the
derivation of the BIC results in an expression whose penalty term is different
from that of the original BIC. We propose a two-step cluster enumeration
algorithm. First, a model-based unsupervised learning algorithm partitions the
data according to a given set of candidate models. Subsequently, the number of
clusters is determined as the one associated with the model for which the
proposed BIC is maximal. The performance of the proposed two-step algorithm is
tested using synthetic and real data sets.Comment: 14 pages, 7 figure
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
Multimodal nested sampling: an efficient and robust alternative to MCMC methods for astronomical data analysis
In performing a Bayesian analysis of astronomical data, two difficult
problems often emerge. First, in estimating the parameters of some model for
the data, the resulting posterior distribution may be multimodal or exhibit
pronounced (curving) degeneracies, which can cause problems for traditional
MCMC sampling methods. Second, in selecting between a set of competing models,
calculation of the Bayesian evidence for each model is computationally
expensive. The nested sampling method introduced by Skilling (2004), has
greatly reduced the computational expense of calculating evidences and also
produces posterior inferences as a by-product. This method has been applied
successfully in cosmological applications by Mukherjee et al. (2006), but their
implementation was efficient only for unimodal distributions without pronounced
degeneracies. Shaw et al. (2007), recently introduced a clustered nested
sampling method which is significantly more efficient in sampling from
multimodal posteriors and also determines the expectation and variance of the
final evidence from a single run of the algorithm, hence providing a further
increase in efficiency. In this paper, we build on the work of Shaw et al. and
present three new methods for sampling and evidence evaluation from
distributions that may contain multiple modes and significant degeneracies; we
also present an even more efficient technique for estimating the uncertainty on
the evaluated evidence. These methods lead to a further substantial improvement
in sampling efficiency and robustness, and are applied to toy problems to
demonstrate the accuracy and economy of the evidence calculation and parameter
estimation. Finally, we discuss the use of these methods in performing Bayesian
object detection in astronomical datasets.Comment: 14 pages, 11 figures, submitted to MNRAS, some major additions to the
previous version in response to the referee's comment
Bayesian outlier detection in Capital Asset Pricing Model
We propose a novel Bayesian optimisation procedure for outlier detection in
the Capital Asset Pricing Model. We use a parametric product partition model to
robustly estimate the systematic risk of an asset. We assume that the returns
follow independent normal distributions and we impose a partition structure on
the parameters of interest. The partition structure imposed on the parameters
induces a corresponding clustering of the returns. We identify via an
optimisation procedure the partition that best separates standard observations
from the atypical ones. The methodology is illustrated with reference to a real
data set, for which we also provide a microeconomic interpretation of the
detected outliers
- …