8 research outputs found

    Approximation algorithms for clustering and facility location problems

    Get PDF
    In this thesis we design and analyze algorithms for various facility location and clustering problems. The problems we study are NP-Hard and therefore, assuming P is not equal NP, there do not exist polynomial time algorithms to solve them optimally. One approach to cope with the intractability of these problems is to design approximation algorithms which run in polynomial-time and output a near-optimal solution for all instances of the problem. However these algorithms do not always work well in practice. Often heuristics with no explicit approximation guarantee perform quite well. To bridge this gap between theory and practice, and to design algorithms that are tuned for instances arising in practice, there is an increasing emphasis on beyond worst-case analysis. In this thesis we consider both these approaches. In the first part we design worst case approximation algorithms for Uniform Submodular Facility Location (USFL), and Capacitated k-center (CapKCenter) problems. USFL is a generalization of the well-known Uncapacitated Facility Location problem. In USFL the cost of opening a facility is a submodular function of the clients assigned to it (the function is identical for all facilities). We show that a natural greedy algorithm (which gives constant factor approximation for Uncapacitated Facility Location and other facility location problems) has a lower bound of log(n), where n is the number of clients. We present an O(log^2 k) approximation algorithm where k is the number of facilities. The algorithm is based on rounding a convex relaxation. We further consider several special cases of the problem and give improved approximation bounds for them. The CapKCenter problem is an extension of the well-known k-center problem: each facility has a maximum capacity on the number of clients that can be assigned to it. We obtain a 9-approximation for this problem via a linear programming (LP) rounding procedure. Our result, combined with previously known lower bounds, almost settles the integrality gap for a natural LP relaxation. In the second part we consider several well-known clustering problems like k-center, k-median, k-means and their corresponding outlier variants. We use beyond worst-case analysis due to the practical relevance of these problems. In particular we show that when the input instances are 2-perturbation resilient (i.e. the optimal solution does not change when the distances change by a multiplicative factor of 2), the LP integrality gap for k-center (and also asymmetric k-center) is 1. We further introduce a model of perturbation resilience for clustering with outliers. Under this new model, we show that previous results (including our LP integrality result) known for clustering under perturbation resilience also extend for clustering with outliers. This leads to a dynamic programming based heuristic for k-means with outliers (k-means-outlier) which gives an optimal solution when the instance is 2-perturbation resilient. We propose two more algorithms for k-means-outlier — a sampling based algorithm which gives an O(1) approximation when the optimal clusters are not “too small”, and an LP rounding algorithm which gives an O(1) approximation at the expense of violating the number of clusters and outliers by a small constant. We empirically study our proposed algorithms on several clustering datasets

    Theoretical Analysis of Hierarchical Clustering and the Shadow Vertex Algorithm

    Get PDF
    Agglomerative clustering (AC) is a very popular greedy method for computing hierarchical clusterings in practice, yet its theoretical properties have been studied relatively little. We consider AC with respect to the most popular objective functions, especially the diameter function, the radius function and the k-means function. Given a finite set P of points in Rd, AC starts with each point from P in a cluster of its own and then iteratively merges two clusters from the current clustering that minimize the respective objective function when merged into a single cluster. We study the problem of partitioning P into k clusters such that the largest diameter of the clusters is minimized and we prove that AC computes an O(1)-approximation for this problem for any metric that is induced by a norm, assuming that the dimension d is a constant. This improves the best previously known bound of O(log k) due to Ackermann et al. Our bound also carries over to the k-center and the continuous k-center problem. Moreover we study the behavior of agglomerative clustering for the hierarchical k-means problem. We show that AC computes a 2-approximation with respect to the k-means objective function if the optimal k-clustering is well separated. If additionally the optimal clustering also satisfies a balance condition, then AC fully recovers the optimum solution. These results hold in arbitrary dimension. We accompany our positive results with a lower bound of Ω((3/2)^d) for data sets in Rd that holds if no separation is guaranteed, and with lower bounds when the guaranteed separation is not sufficiently strong. Finally, we show that AC produces an O(1)-approximative clustering for one-dimensional data sets. Apart from AC we provide improved and in some cases new general upper and lower bounds on the existence of hierarchical clusterings. For the objective function discrete radius we provide a new lower bound of 2 and improve the upper bound of 4. For the k-means objective function we state a lower bound of 32 on the existence of hierarchical clusterings. This improves the best previously known bound of 576. The simplex algorithm is probably the most popular algorithm for solving linear pro grams in practice. It is determined by so called pivot rules. The shadow vertex simplex algorithm is a popular pivot rule which has gained attention in recent years because it was shown to have polynomial running time in the model of smoothed complexity. In the second part of the dissertation we show that the shadow vertex simplex algorithm can be used to solve linear programs in strongly polynomial time with respect to the number n of variables, the number m of constraints, and 1/δ, where δ is a parameter that measures the flatness of the vertices of the polyhedron. This extends a previous result that the shadow vertex algorithm finds paths of polynomial length (w.r.t. n, m, and 1/δ) between two given vertices of a polyhedron. Our result also complements a result due to Eisenbrand and Vempala who have shown that a certain version of the random edge pivot rule solves linear programs with a running time that is strongly polynomial in the number of variables n and 1/δ, but independent of the number m of constraints. Even though the running time of our algorithm depends on m, it is significantly faster for the important special case of totally unimodular linear programs, for which 1/δ is smaller or equal than n and which have only O(n2) constraints

    On algorithms for large-scale graph and clustering problems

    Get PDF
    Gegenstand dieser Arbeit sind algorithmische Methoden der modernen Datenanalyse. Dabei werden vorwiegend zwei übergeordnete Themen behandelt: Datenstromalgorithmen mit Kompressionseigenschaften und Approximationsalgorithmen für Clusteringverfahren. Datenstromalgorithmen verarbeiten einen Datensatz sequentiell und haben das Ziel, Eigenschaften des Datensatzes (approximativ) zu bestimmen, ohne dabei den gesamten Datensatz abzuspeichern. Unter Clustering versteht man die Partitionierung eines Datensatzes in verschiedene Gruppen. Das erste dargestellte Problem betrifft Matching in Graphen. Hier besteht der Datensatz aus einer Folge von Einfüge- und Löschoperationen von Kanten. Die Aufgabe besteht darin, die Größe des so genannten Maximum Matchings so genau wie möglich zu bestimmen. Es wird ein Algorithmus vorgestellt, der, unter der Annahme, dass das Matching höchstens die Größe k hat, die exakte Größe bestimmt und dabei k² Speichereinheiten benötigt. Dieser Algorithmus lässt sich weiterhin verwenden um eine konstante Approximation der Matchinggröße in planaren Graphen zu bestimmen. Des Weiteren werden untere Schranken für den benötigten Speicherplatz bestimmt und eine Reduktion von gewichtetem Matching zu ungewichteten Matching durchgeführt. Anschließend werden Datenstromalgorithmen für die Nachbarschaftssuche betrachtet, wobei die Aufgabe darin besteht, für n gegebene Mengen die Paare mit hoher Ähnlichkeit in nahezu Linearzeit zu finden. Dabei ist der Jaccard Index |A ∩ B|/|A U B| das Ähnlichkeitsmaß für zwei Mengen A und B. In der Arbeit wird eine Datenstruktur beschrieben, die dies erstmalig in dynamischen Datenströmen mit geringem Speicherplatzverbrauch leistet. Dabei werden Zufallszahlen mit nur 2-facher Unabhängigkeit verwendet, was eine sehr effiziente Implementierung ermöglicht. Das dritte Problem befindet sich an der Schnittstelle zwischen den beiden Themen dieser Arbeit und betrifft das k-center Clustering Problem in Datenströmen mit einem Zeitfenster. Die Aufgabe besteht darin k Zentren zu finden, sodass die maximale Distanz unter allen Punkten zu dem jeweils nächsten Zentrum minimiert wird. Ergebnis sind ein 6-Approximationalgorithmus für ein beliebiges k und ein optimaler 4-Approximationsalgorithmus für k = 2. Die entwickelten Techniken lassen sich ebenfalls auf das Durchmesserproblem anwenden und ermöglichen für dieses Problem einen optimalen Algorithmus. Danach werden Clusteringprobleme bezüglich der Jaccard Distanz analysiert. Dabei sind wieder eine Menge N von Teilmengen aus einer Grundgesamtheit U sind und die Aufgabe besteht darin eine Teilmenge CC zu finden, die max 1-|X ∩ C|/|X U C| minimiert. Es wird gezeigt, dass zwar eine exakte Lösung des Problems NP-schwer ist, es aber gleichzeitig eine PTAS gibt. Abschließend wird die weit verbreitete lokale Suchheuristik für k-median und k-means Clustering untersucht. Obwohl es im Allgemeinen schwer ist, diese Probleme exakt oder auch nur approximativ zu lösen, gelten sie in der Praxis als relativ gut handhabbar, was andeutet, dass die Härteresultate auf pathologischen Eingaben beruhen. Auf Grund dieser Diskrepanz gab es in der Vergangenheit praxisrelevante Datensätze zu charakterisieren. Für drei der wichtigsten Charakterisierungen wird das Verhalten einer lokalen Suchheuristik untersucht mit dem Ergebnis, dass die lokale Suchheuristik in diesen Fällen optimale oder fast optimale Cluster ermittelt

    Perturbation Resilient Clustering for k-Center and Related Problems via LP Relaxations

    Get PDF
    We consider clustering in the perturbation resilience model that has been studied since the work of Bilu and Linial [Yonatan Bilu and Nathan Linial, 2010] and Awasthi, Blum and Sheffet [Awasthi et al., 2012]. A clustering instance I is said to be alpha-perturbation resilient if the optimal solution does not change when the pairwise distances are modified by a factor of alpha and the perturbed distances satisfy the metric property - this is the metric perturbation resilience property introduced in [Angelidakis et al., 2017] and a weaker requirement than prior models. We make two high-level contributions. - We show that the natural LP relaxation of k-center and asymmetric k-center is integral for 2-perturbation resilient instances. We belive that demonstrating the goodness of standard LP relaxations complements existing results [Maria{-}Florina Balcan et al., 2016; Angelidakis et al., 2017] that are based on new algorithms designed for the perturbation model. - We define a simple new model of perturbation resilience for clustering with outliers. Using this model we show that the unified MST and dynamic programming based algorithm proposed in [Angelidakis et al., 2017] exactly solves the clustering with outliers problem for several common center based objectives (like k-center, k-means, k-median) when the instances is 2-perturbation resilient. We further show that a natural LP relxation is integral for 2-perturbation resilient instances of k-center with outliers
    corecore