Search CORE

2,411 research outputs found

Evaluating Overfit and Underfit in Models of Network Community Structure

Author: Clauset Aaron
Ghasemian Amir
Hosseinmardi Homa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network's connectivity. Although many methods exist, the No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs. Thus, different algorithms will over or underfit on different inputs, finding more, fewer, or just different communities than is optimal, and evaluation methods that use a metadata partition as a ground truth will produce misleading conclusions about general accuracy. Here, we present a broad evaluation of over and underfitting in community detection, comparing the behavior of 16 state-of-the-art community detection algorithms on a novel and structurally diverse corpus of 406 real-world networks. We find that (i) algorithms vary widely both in the number of communities they find and in their corresponding composition, given the same input, (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks, and (iii) these differences induce wide variation in accuracy on link prediction and link description tasks. We introduce a new diagnostic for evaluating overfitting and underfitting in practice, and use it to roughly divide community detection methods into general and specialized learning algorithms. Across methods and inputs, Bayesian techniques based on the stochastic block model and a minimum description length approach to regularization represent the best general learning approach, but can be outperformed under specific circumstances. These results introduce both a theoretically principled approach to evaluate over and underfitting in models of network community structure and a realistic benchmark by which new methods may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table

arXiv.org e-Print Archive

Multivariate Bernoulli distribution

Author: Bin Dai
Bin Dai
Grace Wahba
Grace Wahba
Shilin Ding
Shilin Ding
Publication venue: 'Bernoulli Society for Mathematical Statistics and Probability'
Publication date: 12/11/2013
Field of study

In this paper, we consider the multivariate Bernoulli distribution as a model to estimate the structure of graphs with binary nodes. This distribution is discussed in the framework of the exponential family, and its statistical properties regarding independence of the nodes are demonstrated. Importantly the model can estimate not only the main effects and pairwise interactions among the nodes but also is capable of modeling higher order interactions, allowing for the existence of complex clique effects. We compare the multivariate Bernoulli model with existing graphical inference models - the Ising model and the multivariate Gaussian model, where only the pairwise interactions are considered. On the other hand, the multivariate Bernoulli distribution has an interesting property in that independence and uncorrelatedness of the component random variables are equivalent. Both the marginal and conditional distributions of a subset of variables in the multivariate Bernoulli distribution still follow the multivariate Bernoulli distribution. Furthermore, the multivariate Bernoulli logistic model is developed under generalized linear model theory by utilizing the canonical link function in order to include covariate information on the nodes, edges and cliques. We also consider variable selection techniques such as LASSO in the logistic model to impose sparsity structure on the graph. Finally, we discuss extending the smoothing spline ANOVA approach to the multivariate Bernoulli logistic model to enable estimation of non-linear effects of the predictor variables.Comment: Published in at http://dx.doi.org/10.3150/12-BEJSP10 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

arXiv.org e-Print Archive

CiteSeerX

Limitations of the Empirical Fisher Approximation for Natural Gradient Descent

Author: Balles Lukas
Hennig Philipp
Kunstner Frederik
Publication venue
Publication date: 01/06/2020
Field of study

Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argument by showing that the empirical Fisher---unlike the Fisher---does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that, even on simple optimization problems, the pathologies of the empirical Fisher can have undesirable effects.Comment: V3: Minor corrections (typographic errors

arXiv.org e-Print Archive

A Review of Codebook Models in Patch-Based Visual Object Recognition

Author: Niranjan Mahesan
Ramanan Amirthalingam
Publication venue
Publication date: 22/09/2011
Field of study

The codebook model-based approach, while ignoring any structural aspect in vision, nonetheless provides state-of-the-art performances on current datasets. The key role of a visual codebook is to provide a way to map the low-level features into a fixed-length vector in histogram space to which standard classifiers can be directly applied. The discriminative power of such a visual codebook determines the quality of the codebook model, whereas the size of the codebook controls the complexity of the model. Thus, the construction of a codebook is an important step which is usually done by cluster analysis. However, clustering is a process that retains regions of high density in a distribution and it follows that the resulting codebook need not have discriminant properties. This is also recognised as a computational bottleneck of such systems. In our recent work, we proposed a resource-allocating codebook, to constructing a discriminant codebook in a one-pass design procedure that slightly outperforms more traditional approaches at drastically reduced computing times. In this review we survey several approaches that have been proposed over the last decade with their use of feature detectors, descriptors, codebook construction schemes, choice of classifiers in recognising objects, and datasets that were used in evaluating the proposed methods

Southampton (e-Prints Soton)

Fossil evidence for spin alignment of SDSS galaxies in filaments

We search for and find fossil evidence that the distribution of the spin axes of galaxies in cosmic web filaments relative to their host filaments are not randomly distributed. This would indicate that the action of large scale tidal torques effected the alignments of galaxies located in cosmic filaments. To this end, we constructed a catalogue of clean filaments containing edge-on galaxies. We started by applying the Multiscale Morphology Filter (MMF) technique to the galaxies in a redshift-distortion corrected version of the Sloan Digital Sky Survey DR5. From that sample we extracted those 426 filaments that contained edge-on galaxies (b/a < 0.2). These filaments were then visually classified relative to a variety of quality criteria. Statistical analysis using "feature measures" indicates that the distribution of orientations of these edge-on galaxies relative to their parent filament deviate significantly from what would be expected on the basis of a random distribution of orientations. The interpretation of this result may not be immediately apparent, but it is easy to identify a population of 14 objects whose spin axes are aligned perpendicular to the spine of the parent filament (\cos \theta < 0.2). The candidate objects are found in relatively less dense filaments. This might be expected since galaxies in such locations suffer less interaction with surrounding galaxies, and consequently better preserve their tidally induced orientations relative to the parent filament. The technique of searching for fossil evidence of alignment yields relatively few candidate objects, but it does not suffer from the dilution effects inherent in correlation analysis of large samples.Comment: 20 pages, 19 figures, slightly revised and upgraded version, accepted for publication by MNRAS. For high-res version see http://www.astro.rug.nl/~weygaert/SpinAlignJones.rev.pd

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen Digital Archive

Dissertations of the University of Groningen