Search CORE

24,078 research outputs found

On Controlling the Size of Clusters in Probabilistic Clustering

Author: Jitta Aditya
Klami Arto
Publication venue: AAAI Press
Publication date: 01/01/2018
Field of study

Classical model-based partitional clustering algorithms, such as k-means or mixture of Gaussians, provide only loose and indirect control over the size of the resulting clusters. In this work, we present a family of probabilistic clustering models that can be steered towards clusters of desired size by providing a prior distribution over the possible sizes, allowing the analyst to fine-tune exploratory analysis or to produce clusters of suitable size for future down-stream processing. Our formulation supports arbitrary multimodal prior distributions, generalizing the previous work on clustering algorithms searching for clusters of equal size or algorithms designed for the microclustering task of finding small clusters. We provide practical methods for solving the problem, using integer programming for making the cluster assignments, and demonstrate that we can also automatically infer the number of clusters.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Association for the Advancement of Artificial Intelligence: AAAI Publications

Effective Unsupervised Author Disambiguation with Relative Frequencies

Author: Backes Tobias
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/08/2018
Field of study

This work addresses the problem of author name homonymy in the Web of Science. Aiming for an efficient, simple and straightforward solution, we introduce a novel probabilistic similarity measure for author name disambiguation based on feature overlap. Using the researcher-ID available for a subset of the Web of Science, we evaluate the application of this measure in the context of agglomeratively clustering author mentions. We focus on a concise evaluation that shows clearly for which problem setups and at which time during the clustering process our approach works best. In contrast to most other works in this field, we are sceptical towards the performance of author name disambiguation methods in general and compare our approach to the trivial single-cluster baseline. Our results are presented separately for each correct clustering size as we can explain that, when treating all cases together, the trivial baseline and more sophisticated approaches are hardly distinguishable in terms of evaluation results. Our model shows state-of-the-art performance for all correct clustering sizes without any discriminative training and with tuning only one convergence parameter.Comment: Proceedings of JCDL 201

arXiv.org e-Print Archive

Crossref

Evaluating Overfit and Underfit in Models of Network Community Structure

Author: Clauset Aaron
Ghasemian Amir
Hosseinmardi Homa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network's connectivity. Although many methods exist, the No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs. Thus, different algorithms will over or underfit on different inputs, finding more, fewer, or just different communities than is optimal, and evaluation methods that use a metadata partition as a ground truth will produce misleading conclusions about general accuracy. Here, we present a broad evaluation of over and underfitting in community detection, comparing the behavior of 16 state-of-the-art community detection algorithms on a novel and structurally diverse corpus of 406 real-world networks. We find that (i) algorithms vary widely both in the number of communities they find and in their corresponding composition, given the same input, (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks, and (iii) these differences induce wide variation in accuracy on link prediction and link description tasks. We introduce a new diagnostic for evaluating overfitting and underfitting in practice, and use it to roughly divide community detection methods into general and specialized learning algorithms. Across methods and inputs, Bayesian techniques based on the stochastic block model and a minimum description length approach to regularization represent the best general learning approach, but can be outperformed under specific circumstances. These results introduce both a theoretically principled approach to evaluate over and underfitting in models of network community structure and a realistic benchmark by which new methods may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table

arXiv.org e-Print Archive

Crossref

A self-learning algorithm for biased molecular dynamics

Author: Abrams
Gareth A. Tribello
Maragakis
Marsili
Michele Ceriotti
Michele Parrinello
Piana
Tipping
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 01/01/2010
Field of study

A new self-learning algorithm for accelerated dynamics, reconnaissance metadynamics, is proposed that is able to work with a very large number of collective coordinates. Acceleration of the dynamics is achieved by constructing a bias potential in terms of a patchwork of one-dimensional, locally valid collective coordinates. These collective coordinates are obtained from trajectory analyses so that they adapt to any new features encountered during the simulation. We show how this methodology can be used to enhance sampling in real chemical systems citing examples both from the physics of clusters and from the biological sciences.Comment: 6 pages, 5 figures + 9 pages of supplementary informatio

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

Crossref

PubMed Central

Oxford University Research Archive

Colouring and breaking sticks: random distributions and heterogeneous clustering

Author: Green Peter J.
Publication venue
Publication date: 01/01/2010
Field of study

We begin by reviewing some probabilistic results about the Dirichlet Process and its close relatives, focussing on their implications for statistical modelling and analysis. We then introduce a class of simple mixture models in which clusters are of different `colours', with statistical characteristics that are constant within colours, but different between colours. Thus cluster identities are exchangeable only within colours. The basic form of our model is a variant on the familiar Dirichlet process, and we find that much of the standard modelling and computational machinery associated with the Dirichlet process may be readily adapted to our generalisation. The methodology is illustrated with an application to the partially-parametric clustering of gene expression profiles.Comment: 26 pages, 3 figures. Chapter 13 of "Probability and Mathematical Genetics: Papers in Honour of Sir John Kingman" (Editors N.H. Bingham and C.M. Goldie), Cambridge University Press, 201

arXiv.org e-Print Archive

CiteSeerX

OPUS - University of Technology Sydney

Explore Bristol Research

A Survey on Soft Subspace Clustering

Author: Choi Kup-Sze
Deng Zhaohong
Jiang Yizhang
Wang Jun
Wang Shitong
Publication venue: 'Elsevier BV'
Publication date: 07/04/2016
Field of study

Subspace clustering (SC) is a promising clustering technology to identify clusters based on their associations with subspaces in high dimensional spaces. SC can be classified into hard subspace clustering (HSC) and soft subspace clustering (SSC). While HSC algorithms have been extensively studied and well accepted by the scientific community, SSC algorithms are relatively new but gaining more attention in recent years due to better adaptability. In the paper, a comprehensive survey on existing SSC algorithms and the recent development are presented. The SSC algorithms are classified systematically into three main categories, namely, conventional SSC (CSSC), independent SSC (ISSC) and extended SSC (XSSC). The characteristics of these algorithms are highlighted and the potential future development of SSC is also discussed.Comment: This paper has been published in Information Sciences Journal in 201

arXiv.org e-Print Archive

The Hong Kong Polytechnic University Pao Yue-kong Library

PolyU Institutional Repository

Energy-aware data processing techniques for wireless sensor networks: a review

Author: Chong S.
Gaber M.
Krishnaswamy S.
Loke S.
Publication venue
Publication date: 01/06/2011
Field of study

Portsmouth University Research Portal (Pure)

DO MANUFACTURING PLANTS CLUSTER ACROSS RURAL AREAS? EVIDENCE FROM A PROBABILISTIC MODELING APPROACH

Author: Barkley David L.
Henry Mark S.
Kim Yunsoo
Publication venue
Publication date
Field of study

A statistical procedure for detecting "contagious" location patterns for manufacturing establishments is presented. Manufacturing industries' establishment clustering tendencies are ranked based on the "dispersion parameter" of the negative binomial distribution. Establishment data are for three-digit SIC manufacturing industries, nonmetro counties of BEA Component Economic Areas, 1981 and 1992. Findings indicate that virtually all manufacturing industries cluster establishments in nonmetro areas. Approximately two-thirds of the industries had dispersion parameters indicating a high or moderate level of spatial concentration. The propensity to cluster plants in nonmetro CEAs was evident for both 1981 and 1992, though weaker in 1992. Much of the industry clustering in nonmetro areas appears to be attributable to local "natural advantages" and not to inter-firm spillovers.Community/Rural/Urban Development,

Research Papers in Economics

Bayesian modeling of networks in complex business intelligence problems

Author: Airoldi
Aldous
Azzalini
Banerjee
Bhattacharya
Bigelow
Dunson
Escobar
Fruchterman
Gershman
Griffiths
Hjort
Hoff
Hoff
Kaishev
Kamakura
Kamakura
Lau
Matiş
Medvedovic
Neal
Nowicki
Polson
Rousseau
Stephens
Thuring
Thuring
Verhoef
Publication venue: 'Wiley'
Publication date: 28/03/2016
Field of study

Complex network data problems are increasingly common in many fields of application. Our motivation is drawn from strategic marketing studies monitoring customer choices of specific products, along with co-subscription networks encoding multiple purchasing behavior. Data are available for several agencies within the same insurance company, and our goal is to efficiently exploit co-subscription networks to inform targeted advertising of cross-sell strategies to currently mono-product customers. We address this goal by developing a Bayesian hierarchical model, which clusters agencies according to common mono-product customer choices and co-subscription networks. Within each cluster, we efficiently model customer behavior via a cluster-dependent mixture of latent eigenmodels. This formulation provides key information on mono-product customer choices and multiple purchasing behavior within each cluster, informing targeted cross-sell strategies. We develop simple algorithms for tractable inference, and assess performance in simulations and an application to business intelligence

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

Crossref

Archivio istituzionale della ricerca - Università di Padova