Search CORE

225,662 research outputs found

Jet analysis by Deterministic Annealing

Author: Ballestrero
Bartel
Bentvelsen
Bethke
Brown
Catani
Dokshitzer
Dokshitzer
G Nardulli
Groom
Kirkpatrick
L Angelini
L Nitti
M Maggi
M Pellicoro
Metropolis
Moretti
P de Felice
Rose
Rose
S Stramaglia
Seymour
Sjöstrand
Sjöstrand
Skrzypek
Černy
Publication venue: 'Elsevier BV'
Publication date: 01/01/2002
Field of study

We perform a comparison of two jet clusterization algorithms. The first one is the standard Durham algorithm and the second one is a global optimization scheme, Deterministic Annealing, often used in clusterization problems, and adapted to the problem of jet identification in particle production by high energy collisions; in particular we study hadronic jets in WW production by high energy electron positron scattering. Our results are as follows. First, we find that the two procedures give basically the same output as far as the particle clusterization is concerned. Second, we find that the increase of CPU time with the particle multiplicity is much faster for the Durham jet clustering algorithm in comparison with Deterministic Annealing. Since this result follows from the higher computational complexity of the Durham scheme, it should not depend on the particular process studied here and might be significant for jet physics at LHC as well.Comment: 15 pages, 4 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Archivio istituzionale della ricerca - Università di Bari

CERN Document Server

Communication-optimal distributed clustering

Author: Chen Jiecao
Sun He
Woodruff David
Zhang Qin
Publication venue
Publication date: 01/12/2016
Field of study

Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single site. In this work, we study both graph and geometric clustering problems in two distributed models: (1) a point-to-point model, and (2) a model with a broadcast channel. We give protocols in both models which we show are nearly optimal by proving almost matching communication lower bounds. Our work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to spectrally cluster

n

points or

n

vertices in a graph distributed across

s

servers, for a worst-case partitioning the communication complexity in a point-to-point model is

n \cdot s

, while in the broadcast model it is

n + s

. A similar phenomenon holds for the geometric setting as well. We implement our algorithms and demonstrate this phenomenon on real life datasets, showing that our algorithms are also very efficient in practice.Comment: A preliminary version of this paper appeared at the 30th Annual Conference on Neural Information Processing Systems (NIPS), 201

arXiv.org e-Print Archive

Explore Bristol Research

Solution-space structure of (some) optimization problems

Author: A K Hartmann
A Mann
Bauke H
Billoire A
Fischer K
Garey M R
Gent I P
Hartmann A K
Hartmann A K
Hartmann A K
Jain A K
Lawler E L
Mertens S
Mézard M
Parisi G
W Radenbach
Weigt M
Young A P (ed)
Zecchina R
Publication venue: 'IOP Publishing'
Publication date: 25/11/2007
Field of study

We study numerically the cluster structure of random ensembles of two NP-hard optimization problems originating in computational complexity, the vertex-cover problem and the number partitioning problem. We use branch-and-bound type algorithms to obtain exact solutions of these problems for moderate system sizes. Using two methods, direct neighborhood-based clustering and hierarchical clustering, we investigate the structure of the solution space. The main result is that the correspondence between solution structure and the phase diagrams of the problems is not unique. Namely, for vertex cover we observe a drastic change of the solution space from large single clusters to multiple nested levels of clusters. In contrast, for the number-partitioning problem, the phase space looks always very simple, similar to a random distribution of the lowest-energy configurations. This holds in the ``easy''/solvable phase as well as in the ``hard''/unsolvable phase.Comment: 10 pages, 5 figures, Fig. 4 in reduced quality to reduce size, Proceedings of the International Workshop on Statistical-Mechanical Informatics 2007, Kyoto (Japan) September 16-19, 200

arXiv.org e-Print Archive

Crossref

Data Mining Using the Crossing Minimization Paradigm

Author: Abdullah Ahsan
Publication venue: University of Stirling
Publication date: 01/01/2007
Field of study

Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data. Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has been observed during 1989-97 between cotton yield and pesticide consumption in Pakistan showing unexpected periods of negative correlation. By applying the indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis

CiteSeerX

Stirling Online Research Repository