2,411 research outputs found
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
Multivariate Bernoulli distribution
In this paper, we consider the multivariate Bernoulli distribution as a model
to estimate the structure of graphs with binary nodes. This distribution is
discussed in the framework of the exponential family, and its statistical
properties regarding independence of the nodes are demonstrated. Importantly
the model can estimate not only the main effects and pairwise interactions
among the nodes but also is capable of modeling higher order interactions,
allowing for the existence of complex clique effects. We compare the
multivariate Bernoulli model with existing graphical inference models - the
Ising model and the multivariate Gaussian model, where only the pairwise
interactions are considered. On the other hand, the multivariate Bernoulli
distribution has an interesting property in that independence and
uncorrelatedness of the component random variables are equivalent. Both the
marginal and conditional distributions of a subset of variables in the
multivariate Bernoulli distribution still follow the multivariate Bernoulli
distribution. Furthermore, the multivariate Bernoulli logistic model is
developed under generalized linear model theory by utilizing the canonical link
function in order to include covariate information on the nodes, edges and
cliques. We also consider variable selection techniques such as LASSO in the
logistic model to impose sparsity structure on the graph. Finally, we discuss
extending the smoothing spline ANOVA approach to the multivariate Bernoulli
logistic model to enable estimation of non-linear effects of the predictor
variables.Comment: Published in at http://dx.doi.org/10.3150/12-BEJSP10 the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
Limitations of the Empirical Fisher Approximation for Natural Gradient Descent
Natural gradient descent, which preconditions a gradient descent update with
the Fisher information matrix of the underlying statistical model, is a way to
capture partial second-order information. Several highly visible works have
advocated an approximation known as the empirical Fisher, drawing connections
between approximate second-order methods and heuristics like Adam. We dispute
this argument by showing that the empirical Fisher---unlike the Fisher---does
not generally capture second-order information. We further argue that the
conditions under which the empirical Fisher approaches the Fisher (and the
Hessian) are unlikely to be met in practice, and that, even on simple
optimization problems, the pathologies of the empirical Fisher can have
undesirable effects.Comment: V3: Minor corrections (typographic errors
A Review of Codebook Models in Patch-Based Visual Object Recognition
The codebook model-based approach, while ignoring any structural aspect in vision, nonetheless provides state-of-the-art performances on current datasets. The key role of a visual codebook is to provide a way to map the low-level features into a fixed-length vector in histogram space to which standard classifiers can be directly applied. The discriminative power of such a visual codebook determines the quality of the codebook model, whereas the size of the codebook controls the complexity of the model. Thus, the construction of a codebook is an important step which is usually done by cluster analysis. However, clustering is a process that retains regions of high density in a distribution and it follows that the resulting codebook need not have discriminant properties. This is also recognised as a computational bottleneck of such systems. In our recent work, we proposed a resource-allocating codebook, to constructing a discriminant codebook in a one-pass design procedure that slightly outperforms more traditional approaches at drastically reduced computing times. In this review we survey several approaches that have been proposed over the last decade with their use of feature detectors, descriptors, codebook construction schemes, choice of classifiers in recognising objects, and datasets that were used in evaluating the proposed methods
Fossil evidence for spin alignment of SDSS galaxies in filaments
We search for and find fossil evidence that the distribution of the spin axes
of galaxies in cosmic web filaments relative to their host filaments are not
randomly distributed. This would indicate that the action of large scale tidal
torques effected the alignments of galaxies located in cosmic filaments. To
this end, we constructed a catalogue of clean filaments containing edge-on
galaxies. We started by applying the Multiscale Morphology Filter (MMF)
technique to the galaxies in a redshift-distortion corrected version of the
Sloan Digital Sky Survey DR5. From that sample we extracted those 426 filaments
that contained edge-on galaxies (b/a < 0.2). These filaments were then visually
classified relative to a variety of quality criteria. Statistical analysis
using "feature measures" indicates that the distribution of orientations of
these edge-on galaxies relative to their parent filament deviate significantly
from what would be expected on the basis of a random distribution of
orientations. The interpretation of this result may not be immediately
apparent, but it is easy to identify a population of 14 objects whose spin axes
are aligned perpendicular to the spine of the parent filament (\cos \theta <
0.2). The candidate objects are found in relatively less dense filaments. This
might be expected since galaxies in such locations suffer less interaction with
surrounding galaxies, and consequently better preserve their tidally induced
orientations relative to the parent filament. The technique of searching for
fossil evidence of alignment yields relatively few candidate objects, but it
does not suffer from the dilution effects inherent in correlation analysis of
large samples.Comment: 20 pages, 19 figures, slightly revised and upgraded version, accepted
for publication by MNRAS. For high-res version see
http://www.astro.rug.nl/~weygaert/SpinAlignJones.rev.pd
- …