225,662 research outputs found
Jet analysis by Deterministic Annealing
We perform a comparison of two jet clusterization algorithms. The first one
is the standard Durham algorithm and the second one is a global optimization
scheme, Deterministic Annealing, often used in clusterization problems, and
adapted to the problem of jet identification in particle production by high
energy collisions; in particular we study hadronic jets in WW production by
high energy electron positron scattering. Our results are as follows. First, we
find that the two procedures give basically the same output as far as the
particle clusterization is concerned. Second, we find that the increase of CPU
time with the particle multiplicity is much faster for the Durham jet
clustering algorithm in comparison with Deterministic Annealing. Since this
result follows from the higher computational complexity of the Durham scheme,
it should not depend on the particular process studied here and might be
significant for jet physics at LHC as well.Comment: 15 pages, 4 figure
Communication-optimal distributed clustering
Clustering large datasets is a fundamental problem with a number of
applications in machine learning. Data is often collected on different sites
and clustering needs to be performed in a distributed manner with low
communication. We would like the quality of the clustering in the distributed
setting to match that in the centralized setting for which all the data resides
on a single site. In this work, we study both graph and geometric clustering
problems in two distributed models: (1) a point-to-point model, and (2) a model
with a broadcast channel. We give protocols in both models which we show are
nearly optimal by proving almost matching communication lower bounds. Our work
highlights the surprising power of a broadcast channel for clustering problems;
roughly speaking, to spectrally cluster points or vertices in a graph
distributed across servers, for a worst-case partitioning the communication
complexity in a point-to-point model is , while in the broadcast
model it is . A similar phenomenon holds for the geometric setting as
well. We implement our algorithms and demonstrate this phenomenon on real life
datasets, showing that our algorithms are also very efficient in practice.Comment: A preliminary version of this paper appeared at the 30th Annual
Conference on Neural Information Processing Systems (NIPS), 201
Solution-space structure of (some) optimization problems
We study numerically the cluster structure of random ensembles of two NP-hard
optimization problems originating in computational complexity, the vertex-cover
problem and the number partitioning problem. We use branch-and-bound type
algorithms to obtain exact solutions of these problems for moderate system
sizes. Using two methods, direct neighborhood-based clustering and hierarchical
clustering, we investigate the structure of the solution space. The main result
is that the correspondence between solution structure and the phase diagrams of
the problems is not unique. Namely, for vertex cover we observe a drastic
change of the solution space from large single clusters to multiple nested
levels of clusters. In contrast, for the number-partitioning problem, the phase
space looks always very simple, similar to a random distribution of the
lowest-energy configurations. This holds in the ``easy''/solvable phase as well
as in the ``hard''/unsolvable phase.Comment: 10 pages, 5 figures, Fig. 4 in reduced quality to reduce size,
Proceedings of the International Workshop on Statistical-Mechanical
Informatics 2007, Kyoto (Japan) September 16-19, 200
Data Mining Using the Crossing Minimization Paradigm
Our ability and capacity to generate, record and store multi-dimensional, apparently
unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining.
Because of the size, and complexity of the problem, practical data mining problems are
best attempted using automatic means.
Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes.
In this dissertation, a novel fast and white noise tolerant data mining solution is
proposed based on the Crossing Minimization (CM) paradigm; the solution works for
one-way as well as two-way clustering for discovering overlapping biclusters. For
decades the CM paradigm has traditionally been used for graph drawing and VLSI
(Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains.
Two other interesting and hard problems also addressed in this dissertation are (i) the
Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth
Minimization (BWM) problem of sparse matrices. The proposed CM technique is
demonstrated to provide very convincing results while attempting to solve the said
problems using real public domain data.
Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has
been observed during 1989-97 between cotton yield and pesticide consumption in
Pakistan showing unexpected periods of negative correlation. By applying the
indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis
- …