225,662 research outputs found

    Jet analysis by Deterministic Annealing

    Get PDF
    We perform a comparison of two jet clusterization algorithms. The first one is the standard Durham algorithm and the second one is a global optimization scheme, Deterministic Annealing, often used in clusterization problems, and adapted to the problem of jet identification in particle production by high energy collisions; in particular we study hadronic jets in WW production by high energy electron positron scattering. Our results are as follows. First, we find that the two procedures give basically the same output as far as the particle clusterization is concerned. Second, we find that the increase of CPU time with the particle multiplicity is much faster for the Durham jet clustering algorithm in comparison with Deterministic Annealing. Since this result follows from the higher computational complexity of the Durham scheme, it should not depend on the particular process studied here and might be significant for jet physics at LHC as well.Comment: 15 pages, 4 figure

    Communication-optimal distributed clustering

    Get PDF
    Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single site. In this work, we study both graph and geometric clustering problems in two distributed models: (1) a point-to-point model, and (2) a model with a broadcast channel. We give protocols in both models which we show are nearly optimal by proving almost matching communication lower bounds. Our work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to spectrally cluster nn points or nn vertices in a graph distributed across ss servers, for a worst-case partitioning the communication complexity in a point-to-point model is nsn \cdot s, while in the broadcast model it is n+sn + s. A similar phenomenon holds for the geometric setting as well. We implement our algorithms and demonstrate this phenomenon on real life datasets, showing that our algorithms are also very efficient in practice.Comment: A preliminary version of this paper appeared at the 30th Annual Conference on Neural Information Processing Systems (NIPS), 201

    Solution-space structure of (some) optimization problems

    Full text link
    We study numerically the cluster structure of random ensembles of two NP-hard optimization problems originating in computational complexity, the vertex-cover problem and the number partitioning problem. We use branch-and-bound type algorithms to obtain exact solutions of these problems for moderate system sizes. Using two methods, direct neighborhood-based clustering and hierarchical clustering, we investigate the structure of the solution space. The main result is that the correspondence between solution structure and the phase diagrams of the problems is not unique. Namely, for vertex cover we observe a drastic change of the solution space from large single clusters to multiple nested levels of clusters. In contrast, for the number-partitioning problem, the phase space looks always very simple, similar to a random distribution of the lowest-energy configurations. This holds in the ``easy''/solvable phase as well as in the ``hard''/unsolvable phase.Comment: 10 pages, 5 figures, Fig. 4 in reduced quality to reduce size, Proceedings of the International Workshop on Statistical-Mechanical Informatics 2007, Kyoto (Japan) September 16-19, 200

    Data Mining Using the Crossing Minimization Paradigm

    Get PDF
    Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data. Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has been observed during 1989-97 between cotton yield and pesticide consumption in Pakistan showing unexpected periods of negative correlation. By applying the indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis
    corecore