783 research outputs found

    Enabling computation of correlation bounds for finite-dimensional quantum systems via symmetrisation

    Full text link
    We present a technique for reducing the computational requirements by several orders of magnitude in the evaluation of semidefinite relaxations for bounding the set of quantum correlations arising from finite-dimensional Hilbert spaces. The technique, which we make publicly available through a user-friendly software package, relies on the exploitation of symmetries present in the optimisation problem to reduce the number of variables and the block sizes in semidefinite relaxations. It is widely applicable in problems encountered in quantum information theory and enables computations that were previously too demanding. We demonstrate its advantages and general applicability in several physical problems. In particular, we use it to robustly certify the non-projectiveness of high-dimensional measurements in a black-box scenario based on self-tests of dd-dimensional symmetric informationally complete POVMs.Comment: A. T. and D. R. contributed equally for this projec

    Design and analysis of algorithms for similarity search based on intrinsic dimension

    Get PDF
    One of the most fundamental operations employed in data mining tasks such as classification, cluster analysis, and anomaly detection, is that of similarity search. It has been used in numerous fields of application such as multimedia, information retrieval, recommender systems and pattern recognition. Specifically, a similarity query aims to retrieve from the database the most similar objects to a query object, where the underlying similarity measure is usually expressed as a distance function. The cost of processing similarity queries has been typically assessed in terms of the representational dimension of the data involved, that is, the number of features used to represent individual data objects. It is generally the case that high representational dimension would result in a significant increase in the processing cost of similarity queries. This relation is often attributed to an amalgamation of phenomena, collectively referred to as the curse of dimensionality. However, the observed effects of dimensionality in practice may not be as severe as expected. This has led to the development of models quantifying the complexity of data in terms of some measure of the intrinsic dimensionality. The generalized expansion dimension (GED) is one of such models, which estimates the intrinsic dimension in the vicinity of a query point q through the observation of the ranks and distances of pairs of neighbors with respect to q. This dissertation is mainly concerned with the design and analysis of search algorithms, based on the GED model. In particular, three variants of similarity search problem are considered, including adaptive similarity search, flexible aggregate similarity search, and subspace similarity search. The good practical performance of the proposed algorithms demonstrates the effectiveness of dimensionality-driven design of search algorithms

    Coping With New Challengens for Density-Based Clustering

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining tasks and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. Beside many others, the density-based clustering notion underlying the algorithm DBSCAN and its hierarchical extension OPTICS has been proposed recently, being one of the most successful approaches to clustering. In this thesis, our aim is to advance the state-of-the-art clustering, especially density-based clustering by identifying novel challenges for density-based clustering and proposing innovative and solid solutions for these challenges. We describe the development of the industrial prototype BOSS (Browsing OPTICS plots for Similarity Search) which is a first step towards developing a comprehensive, scalable and distributed computing solution designed to make the efficiency and analytical capabilities of OPTICS available to a broader audience. For the development of BOSS, several key enhancements of OPTICS are required which are addressed in this thesis. We develop incremental algorithms of OPTICS to efficiently reconstruct the hierarchical clustering structure in frequently updated databases, in particular, when a set of objects is inserted in or deleted from the database. We empirically show that these incremental algorithms yield significant speed-up factors over the original OPTICS algorithm. Furthermore, we propose a novel algorithm for automatic extraction of clusters from hierarchical clustering representations that outperforms comparative methods, and introduce two novel approaches for selecting meaningful representatives, using the density-based concepts of OPTICS and producing better results than the related medoid approach. Another major challenge for density-based clustering is to cope with high dimensional data. Many today's real-world data sets contain a large number of measurements (or features) for a single data object. Usually, global feature reduction techniques cannot be applied to these data sets. Thus, the task of feature selection must be combined with and incooperated into the clustering process. In this thesis, we present original extensions and enhancements of the density-based clustering notion to cope with high dimensional data. In particular, we propose an algorithm called SUBCLU (density based SUBspace CLUstering) that extends DBSCAN to the problem of subspace clustering. SUBCLU efficiently computes all clusters that would have been found if DBSCAN is applied to all possible subspaces of the feature space. An experimental evaluation on real-world data sets illustrates that SUBCLU is more effective than existing subspace clustering algorithms because it is able to find clusters of arbitrary size and shape, and produces determine results. A semi-hierarchical extension of SUBCLU called RIS (Ranking Interesting Subspaces) is proposed that does not compute the subspace clusters directly, but generates a list of subspaces ranked by their clustering characteristics. A hierarchical clustering algorithm can be applied to these interesting subspaces in order to compute a hierarchical (subspace) clustering. A comparative evaluation of RIS and SUBCLU shows that RIS in combination with OPTICS can achieve an information gain over SUBCLU. In addition, we propose the algorithm 4C (Computing Correlation Connected Clusters) that extends the concepts of DBSCAN to compute density-based correlation clusters. 4C benefits from an innovative, well-defined and effective clustering model, outperforming related approaches in terms of clustering quality on real-world data sets.Knowledge Discovery in Databases (KDD) ist der Prozess der (semi-)automatischen Extraktion von Wissen aus Datenbanken, das gĂŒltig, bisher unbekannt und potentiell nĂŒtzlich fĂŒr eine gegebene Anwendung ist. Der zentrale Schritt des KDD-Prozesses ist das Data Mining. Eine der wichtigsten Aufgaben des Data Mining ist Clustering. Dabei sollen die Objekte einer Datenbank in Gruppen (Cluster) partitioniert werden, so dass Objekte eines Clusters möglichst Ă€hnlich und Objekte verschiedener Cluster möglichst unĂ€hnlich zu einander sind. Das dichtebasierte Clustermodell und die darauf aufbauenden Algorithmen DBSCAN und OPTICS sind unter einer Vielzahl anderer Clustering-AnsĂ€tze eine der erfolgreichsten Methoden zum Clustering. Im Rahmen dieser Dissertation wollen wir den aktuellen Stand der Technik im Bereich Clustering und speziell im Bereich dichtebasiertes Clustering voranbringen. Dazu erarbeiten wir neue Herausforderungen fĂŒr das dichtebasierte Clustermodell und schlagen dazu innovative Lösungen vor. ZunĂ€chst steht die Entwicklung des industriellen Prototyps BOSS (Browsing OPTICS plots for Similarity Search) im Mittelpunkt dieser Arbeit. BOSS ist ein erster Beitrag zu einer umfassenden, skalierbaren und verteilten Softwarelösung, die eine Nutzung der Effizienzvorteile und die analytischen Möglichkeiten des dichtebasierten, hierarchischen Clustering-Algorithmus OPTICS fĂŒr ein breites Publikum ermöglichen. Zur Entwicklung von BOSS werden drei entscheidende Erweiterungen von OPTICS benötigt: Wir entwickeln eine inkrementelle Version von OPTICS um nach einem Update der Datenbank (EinfĂŒgen/Löschen einer Menge von Objekten) die hierarchische Clustering Struktur effizient zu reorganisieren. Anhand von Experimenten mit synthetischen und realen Daten zeigen wir, dass die vorgeschlagenen, inkrementellen Algorithmen deutliche Beschleunigungsfaktoren gegenĂŒber dem originalen OPTICS-Algorithmus erzielen. Desweiteren schlagen wir einen neuen Algorithmus zur automatischen Clusterextraktion aus hierarchischen ReprĂ€sentationen und zwei innovative Methoden zur automatischen Auswahl geeigneter ClusterreprĂ€sentaten vor. Unsere neuen Techniken erzielen bei Tests auf mehreren realen Datenbanken im Vergleich zu den konkurrierenden Verfahren bessere Ergebnisse. Eine weitere Herausforderung fĂŒr Clustering-Verfahren stellen hochdimensionale FeaturerĂ€ume dar. Reale DatensĂ€tze beinhalten dank moderner Verfahren zur Datenerhebung hĂ€ufig sehr viele Merkmale. Teile dieser Merkmale unterliegen oft Rauschen oder AbhĂ€ngigkeiten und können meist nicht im Vorfeld ausgesiebt werden, da diese Effekte jeweils in Teilen der Datenbank unterschiedlich ausgeprĂ€gt sind. Daher muss die Wahl der Features mit dem Data-Mining-Verfahren verknĂŒpft werden. Im Rahmen dieser Arbeit stellen wir innovative Erweiterungen des dichtebasierten Clustermodells fĂŒr hochdimensionale Daten vor. Wir entwickeln SUBCLU (dichtebasiertes SUBspace CLUstering), ein auf DBSCAN basierender Subspace Clustering Algorithmus. SUBCLU erzeugt effizient alle Cluster, die gefunden werden, wenn man DBSCAN auf alle möglichen TeilrĂ€ume des Datensatzes anwendet. Experimente auf realen Daten zeigen, dass SUBCLU effektiver als vergleichbare Algorithmen ist. RIS (Ranking Interesting Subspaces), eine semi-hierarchische Erweiterung von SUBCLU, wird vorgeschlagen, das nicht mehr direkt die Teilraumcluster berechnet, sondern eine Liste von TeilrĂ€umen geordnet anhand ihrer Clustering-QualitĂ€t erzeugt. Dadurch können hierarchische Partitionierungen auf ausgewĂ€hlten TeilrĂ€umen erzeugt werden. Experimente belegen, dass RIS in Kombination mit OPTICS ein Informationsgewinn gegenĂŒber SUBCLU erreicht. Außerdem stellen wir den neuartigen Korrelationscluster Algorithmus 4C (Computing Correlation Connected Clusters) vor. 4C basiert auf einem innovativen und wohldefinierten Clustermodell und erzielt in unseren Experimenten mit realen Daten bessere Ergebnisse als vergleichbare Clustering-AnsĂ€tze

    A study of two problems in data mining: projective clustering and multiple tables association rules mining.

    Get PDF
    Ng Ka Ka.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 114-120).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgement --- p.viiChapter I --- Projective Clustering --- p.1Chapter 1 --- Introduction to Projective Clustering --- p.2Chapter 2 --- Related Work to Projective Clustering --- p.7Chapter 2.1 --- CLARANS - Graph Abstraction and Bounded Optimization --- p.8Chapter 2.1.1 --- Graph Abstraction --- p.8Chapter 2.1.2 --- Bounded Optimized Random Search --- p.9Chapter 2.2 --- OptiGrid ÂŽŰ€ Grid Partitioning Approach and Density Estimation Function --- p.9Chapter 2.2.1 --- Empty Space Phenomenon --- p.10Chapter 2.2.2 --- Density Estimation Function --- p.11Chapter 2.2.3 --- Upper Bound Property --- p.12Chapter 2.3 --- CLIQUE and ENCLUS - Subspace Clustering --- p.13Chapter 2.3.1 --- Monotonicity Property of Subspaces --- p.14Chapter 2.4 --- PROCLUS Projective Clustering --- p.15Chapter 2.5 --- ORCLUS - Generalized Projective Clustering --- p.16Chapter 2.5.1 --- Singular Value Decomposition SVD --- p.17Chapter 2.6 --- "An ""Optimal"" Projective Clustering" --- p.17Chapter 3 --- EPC : Efficient Projective Clustering --- p.19Chapter 3.1 --- Motivation --- p.19Chapter 3.2 --- Notations and Definitions --- p.21Chapter 3.2.1 --- Density Estimation Function --- p.22Chapter 3.2.2 --- 1-d Histogram --- p.23Chapter 3.2.3 --- 1-d Dense Region --- p.25Chapter 3.2.4 --- Signature Q --- p.26Chapter 3.3 --- The overall framework --- p.28Chapter 3.4 --- Major Steps --- p.30Chapter 3.4.1 --- Histogram Generation --- p.30Chapter 3.4.2 --- Adaptive discovery of dense regions --- p.31Chapter 3.4.3 --- Count the occurrences of signatures --- p.36Chapter 3.4.4 --- Find the most frequent signatures --- p.36Chapter 3.4.5 --- Refine the top 3m signatures --- p.37Chapter 3.5 --- Time and Space Complexity --- p.38Chapter 4 --- EPCH: An extension and generalization of EPC --- p.40Chapter 4.1 --- Motivation of the extension --- p.40Chapter 4.2 --- Distinguish clusters by their projections in different subspaces --- p.43Chapter 4.3 --- EPCH: a generalization of EPC by building histogram with higher dimensionality --- p.46Chapter 4.3.1 --- Multidimensional histograms construction and dense re- gions detection --- p.46Chapter 4.3.2 --- Compressing data objects to signatures --- p.47Chapter 4.3.3 --- Merging Similar Signature Entries --- p.49Chapter 4.3.4 --- Associating membership degree --- p.51Chapter 4.3.5 --- The choice of Dimensionality d of the Histogram --- p.52Chapter 4.4 --- Implementation of EPC2 --- p.53Chapter 4.5 --- Time and Space Complexity of EPCH --- p.54Chapter 5 --- Experimental Results --- p.56Chapter 5.1 --- Clustering Quality Measurement --- p.56Chapter 5.2 --- Synthetic Data Generation --- p.58Chapter 5.3 --- Experimental setup --- p.59Chapter 5.4 --- Comparison between EPC and PROCULS --- p.60Chapter 5.5 --- Comparison between EPCH and ORCLUS --- p.62Chapter 5.5.1 --- Dimensionality of the original space and the associated subspace --- p.65Chapter 5.5.2 --- Projection not parallel to original axes --- p.66Chapter 5.5.3 --- Data objects belong to more than one cluster under fuzzy clustering --- p.67Chapter 5.6 --- Scalability of EPC --- p.68Chapter 5.7 --- Scalability of EPC2 --- p.69Chapter 6 --- Conclusion --- p.71Chapter II --- Multiple Tables Association Rules Mining --- p.74Chapter 7 --- Introduction to Multiple Tables Association Rule Mining --- p.75Chapter 7.1 --- Problem Statement --- p.77Chapter 8 --- Related Work to Multiple Tables Association Rules Mining --- p.80Chapter 8.1 --- Aprori - A Bottom-up approach to generate candidate sets --- p.80Chapter 8.2 --- VIPER - Vertical Mining with various optimization techniques --- p.81Chapter 8.2.1 --- Vertical TID Representation and Mining --- p.82Chapter 8.2.2 --- FORC --- p.83Chapter 8.3 --- Frequent Itemset Counting across Multiple Tables --- p.84Chapter 9 --- The Proposed Method --- p.85Chapter 9.1 --- Notations --- p.85Chapter 9.2 --- Converting Dimension Tables to internal representation --- p.87Chapter 9.3 --- The idea of discovering frequent itemsets without joining --- p.89Chapter 9.4 --- Overall Steps --- p.91Chapter 9.5 --- Binding multiple Dimension Tables --- p.92Chapter 9.6 --- Prefix Tree for FT --- p.94Chapter 9.7 --- Maintaining frequent itemsets in FI-trees --- p.96Chapter 9.8 --- Frequency Counting --- p.99Chapter 10 --- Experiments --- p.102Chapter 10.1 --- Synthetic Data Generation --- p.102Chapter 10.2 --- Experimental Findings --- p.106Chapter 11 --- Conclusion and Future Works --- p.112Bibliography --- p.11

    HARP: A practical projected clustering algorithm

    Get PDF
    In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to Identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering accuracy can be seriously degraded If incorrect values are used. Unfortunately, in real situations, it is rarely possible for users to supply the parameter values accurately, which causes practical difficulties in applying these algorithms to real data. In this paper, we analyze the major challenges of projected clustering and suggest why these algorithms need to depend heavily on user parameters. Based on the analysis, we propose a new algorithm that exploits the clustering status to adjust the internal thresholds dynamically without the assistance of user parameters. According to the results of extensive experiments on real and synthetic data, the new method has excellent accuracy and usability. It outperformed the other algorithms even when correct parameter values were artificially supplied to them. The encouraging results suggest that projected clustering can be a practical tool for various kinds of real applications.published_or_final_versio

    Algorithms for data mining

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 81-89).Data of massive size are now available in a wide variety of fields and come with great promise. In theory, these massive data sets allow data mining and exploration on a scale previously unimaginable. However, in practice, it can be difficult to apply classic data mining techniques to such massive data sets due to their sheer size. In this thesis, we study three algorithmic problems in data mining with consideration to the analysis of massive data sets. Our work is both theoretical and experimental - we design algorithms and prove guarantees for their performance and also give experimental results on real data sets. The three problems we study are: 1) finding a matrix of low rank that approximates a given matrix, 2) clustering high-dimensional points into subsets whose points lie in the same subspace, and 3) clustering objects by pairwise similarities/distances.by Grant J. Wang.Ph.D

    Improving k-nn search and subspace clustering based on local intrinsic dimensionality

    Get PDF
    In several novel applications such as multimedia and recommender systems, data is often represented as object feature vectors in high-dimensional spaces. The high-dimensional data is always a challenge for state-of-the-art algorithms, because of the so-called curse of dimensionality . As the dimensionality increases, the discriminative ability of similarity measures diminishes to the point where many data analysis algorithms, such as similarity search and clustering, that depend on them lose their effectiveness. One way to handle this challenge is by selecting the most important features, which is essential for providing compact object representations as well as improving the overall search and clustering performance. Having compact feature vectors can further reduce the storage space and the computational complexity of search and learning tasks. Support-Weighted Intrinsic Dimensionality (support-weighted ID) is a new promising feature selection criterion that estimates the contribution of each feature to the overall intrinsic dimensionality. Support-weighted ID identifies relevant features locally for each object, and penalizes those features that have locally lower discriminative power as well as higher density. In fact, support-weighted ID measures the ability of each feature to locally discriminate between objects in the dataset. Based on support-weighted ID, this dissertation introduces three main research contributions: First, this dissertation proposes NNWID-Descent, a similarity graph construction method that utilizes the support-weighted ID criterion to identify and retain relevant features locally for each object and enhance the overall graph quality. Second, with the aim to improve the accuracy and performance of cluster analysis, this dissertation introduces k-LIDoids, a subspace clustering algorithm that extends the utility of support-weighted ID within a clustering framework in order to gradually select the subset of informative and important features per cluster. k-LIDoids is able to construct clusters together with finding a low dimensional subspace for each cluster. Finally, using the compact object and cluster representations from NNWID-Descent and k-LIDoids, this dissertation defines LID-Fingerprint, a new binary fingerprinting and multi-level indexing framework for the high-dimensional data. LID-Fingerprint can be used for hiding the information as a way of preventing passive adversaries as well as providing an efficient and secure similarity search and retrieval for the data stored on the cloud. When compared to other state-of-the-art algorithms, the good practical performance provides an evidence for the effectiveness of the proposed algorithms for the data in high-dimensional spaces

    Low-Density Cluster Separators for Large, High-Dimensional, Mixed and Non-Linearly Separable Data.

    Get PDF
    The location of groups of similar observations (clusters) in data is a well-studied problem, and has many practical applications. There are a wide range of approaches to clustering, which rely on different definitions of similarity, and are appropriate for datasets with different characteristics. Despite a rich literature, there exist a number of open problems in clustering, and limitations to existing algorithms. This thesis develops methodology for clustering high-dimensional, mixed datasets with complex clustering structures, using low-density cluster separators that bi-partition datasets using cluster boundaries that pass through regions of minimal density, separating regions of high probability density, associated with clusters. The bi-partitions arising from a succession of minimum density cluster separators are combined using divisive hierarchical and partitional algorithms, to locate a complete clustering, while estimating the number of clusters. The proposed algorithms locate cluster separators using one-dimensional arbitrarily oriented subspaces, circumventing the challenges associated with clustering in high-dimensional spaces. This requires continuous observations; thus, to extend the applicability of the proposed algorithms to mixed datasets, methods for producing an appropriate continuous representation of datasets containing non-continuous features are investigated. The exact evaluation of the density intersected by a cluster boundary is restricted to linear separators. This limitation is lifted by a non-linear mapping of the original observations into a feature space, in which a linear separator permits the correct identification of non-linearly separable clusters in the original dataset. In large, high-dimensional datasets, searching for one-dimensional subspaces, which result in a minimum density separator is computationally expensive. Therefore, a computationally efficient approach to low-density cluster separation using approximately optimal projection directions is proposed, which searches over a collection of one-dimensional random projections for an appropriate subspace for cluster identification. The proposed approaches produce high-quality partitions, that are competitive with well-established and state-of-the-art algorithms
    • 

    corecore