7 research outputs found

    Spectral Clustering: An Empirical Study of Approximation Algorithms and its Application to the Attrition Problem

    Get PDF
    Clustering is the problem of separating a set of objects into groups (called clusters) so that objects within the same cluster are more similar to each other than to those in different clusters. Spectral clustering is a now well-known method for clustering which utilizes the spectrum of the data similarity matrix to perform this separation. Since the method relies on solving an eigenvector problem, it is computationally expensive for large datasets. To overcome this constraint, approximation methods have been developed which aim to reduce running time while maintaining accurate classification. In this article, we summarize and experimentally evaluate several approximation methods for spectral clustering. From an applications standpoint, we employ spectral clustering to solve the so-called attrition problem, where one aims to identify from a set of employees those who are likely to voluntarily leave the company from those who are not. Our study sheds light on the empirical performance of existing approximate spectral clustering methods and shows the applicability of these methods in an important business optimization related problem

    A Review on Data Clustering Algorithms for Mixed Data

    Get PDF
    Clustering is the unsupervised classification of patterns into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. In general, clustering is a method of dividing the data into groups of similar objects. One of significant research areas in data mining is to develop methods to modernize knowledge by using the existing knowledge, since it can generally augment mining efficiency, especially for very bulky database. Data mining uncovers hidden, previously unknown, and potentially useful information from large amounts of data. This paper presents a general survey of various clustering algorithms. In addition, the paper also describes the efficiency of Self-Organized Map (SOM) algorithm in enhancing the mixed data clustering

    Applications of a Graph Theoretic Based Clustering Framework in Computer Vision and Pattern Recognition

    Full text link
    Recently, several clustering algorithms have been used to solve variety of problems from different discipline. This dissertation aims to address different challenging tasks in computer vision and pattern recognition by casting the problems as a clustering problem. We proposed novel approaches to solve multi-target tracking, visual geo-localization and outlier detection problems using a unified underlining clustering framework, i.e., dominant set clustering and its extensions, and presented a superior result over several state-of-the-art approaches.Comment: doctoral dissertatio

    Efficient out-of-sample extension of dominant-set clusters

    No full text
    Dominant sets are a new graph-theoretic concept that has proven to be relevant in pairwise data clustering problems, such as image segmentation. They generalize the notion of a maximal clique to edgeweighted graphs and have intriguing, non-trivial connections to continuous quadratic optimization and spectral-based grouping. We address the problem of grouping out-of-sample examples after the clustering process has taken place. This may serve either to drastically reduce the computational burden associated to the processing of very large data sets, or to efficiently deal with dynamic situations whereby data sets need to be updated continually. We show that the very notion of a dominant set offers a simple and efficient way of doing this. Numerical experiments on various grouping problems show the effectiveness of the approach.

    Algorithmic Results for Clustering and Refined Physarum Analysis

    Get PDF
    In the first part of this thesis, we study the Binary ℓ0\ell_0-Rank-kk problem which given a binary matrix AA and a positive integer kk, seeks to find a rank-kk binary matrix BB minimizing the number of non-zero entries of A−BA-B. A central open question is whether this problem admits a polynomial time approximation scheme. We give an affirmative answer to this question by designing the first randomized almost-linear time approximation scheme for constant kk over the reals, F2\mathbb{F}_2, and the Boolean semiring. In addition, we give novel algorithms for important variants of ℓ0\ell_0-low rank approximation. The second part of this dissertation, studies a popular and successful heuristic, known as Approximate Spectral Clustering (ASC), for partitioning the nodes of a graph GG into clusters with small conductance. We give a comprehensive analysis, showing that ASC runs efficiently and yields a good approximation of an optimal kk-way node partition of GG. In the final part of this thesis, we present two results on slime mold computations: i) the continuous undirected Physarum dynamics converges for undirected linear programs with a non-negative cost vector; and ii) for the discrete directed Physarum dynamics, we give a refined analysis that yields strengthened and close to optimal convergence rate bounds, and shows that the model can be initialized with any strongly dominating point.Im ersten Teil dieser Arbeit untersuchen wir das Binary ℓ0\ell_0-Rank-kk Problem. Hier sind eine bin{\"a}re Matrix AA und eine positive ganze Zahl kk gegeben und gesucht wird eine bin{\"a}re Matrix BB mit Rang kk, welche die Anzahl von nicht null Eintr{\"a}gen in A−BA-B minimiert. Wir stellen das erste randomisierte, nahezu lineare Aproximationsschema vor konstantes kk {\"u}ber die reellen Zahlen, F2\mathbb{F}_2 und den Booleschen Semiring. Zus{\"a}tzlich erzielen wir neue Algorithmen f{\"u}r wichtige Varianten der ℓ0\ell_0-low rank Approximation. Der zweite Teil dieser Dissertation besch{\"a}ftigt sich mit einer beliebten und erfolgreichen Heuristik, die unter dem Namen Approximate Spectral Cluster (ASC) bekannt ist. ASC partitioniert die Knoten eines gegeben Graphen GG in Cluster kleiner Conductance. Wir geben eine umfassende Analyse von ASC, die zeigt, dass ASC eine effiziente Laufzeit besitzt und eine gute Approximation einer optimale kk-Weg-Knoten Partition f{\"u}r GG berechnet. Im letzten Teil dieser Dissertation pr{\"a}sentieren wir zwei Ergebnisse {\"u}ber Berechnungen mit Hilfe von Schleimpilzen: i) die kontinuierliche ungerichtete Physarum Dynamik konvergiert f{\"u}r ungerichtete lineare Programme mit einem nicht negativen Kostenvektor; und ii) f{\"u}r die diskrete gerichtete Physikum Dynamik geben wir eine verfeinerte Analyse, die st{\"a}rkere und beinahe optimale Schranken f{\"u}r ihre Konvergenzraten liefert und zeigt, dass das Model mit einem beliebigen stark dominierender Punkt initialisiert werden kann
    corecore