89 research outputs found

    Approximation algorithms for clustering and facility location problems

    Get PDF
    In this thesis we design and analyze algorithms for various facility location and clustering problems. The problems we study are NP-Hard and therefore, assuming P is not equal NP, there do not exist polynomial time algorithms to solve them optimally. One approach to cope with the intractability of these problems is to design approximation algorithms which run in polynomial-time and output a near-optimal solution for all instances of the problem. However these algorithms do not always work well in practice. Often heuristics with no explicit approximation guarantee perform quite well. To bridge this gap between theory and practice, and to design algorithms that are tuned for instances arising in practice, there is an increasing emphasis on beyond worst-case analysis. In this thesis we consider both these approaches. In the first part we design worst case approximation algorithms for Uniform Submodular Facility Location (USFL), and Capacitated k-center (CapKCenter) problems. USFL is a generalization of the well-known Uncapacitated Facility Location problem. In USFL the cost of opening a facility is a submodular function of the clients assigned to it (the function is identical for all facilities). We show that a natural greedy algorithm (which gives constant factor approximation for Uncapacitated Facility Location and other facility location problems) has a lower bound of log(n), where n is the number of clients. We present an O(log^2 k) approximation algorithm where k is the number of facilities. The algorithm is based on rounding a convex relaxation. We further consider several special cases of the problem and give improved approximation bounds for them. The CapKCenter problem is an extension of the well-known k-center problem: each facility has a maximum capacity on the number of clients that can be assigned to it. We obtain a 9-approximation for this problem via a linear programming (LP) rounding procedure. Our result, combined with previously known lower bounds, almost settles the integrality gap for a natural LP relaxation. In the second part we consider several well-known clustering problems like k-center, k-median, k-means and their corresponding outlier variants. We use beyond worst-case analysis due to the practical relevance of these problems. In particular we show that when the input instances are 2-perturbation resilient (i.e. the optimal solution does not change when the distances change by a multiplicative factor of 2), the LP integrality gap for k-center (and also asymmetric k-center) is 1. We further introduce a model of perturbation resilience for clustering with outliers. Under this new model, we show that previous results (including our LP integrality result) known for clustering under perturbation resilience also extend for clustering with outliers. This leads to a dynamic programming based heuristic for k-means with outliers (k-means-outlier) which gives an optimal solution when the instance is 2-perturbation resilient. We propose two more algorithms for k-means-outlier — a sampling based algorithm which gives an O(1) approximation when the optimal clusters are not “too small”, and an LP rounding algorithm which gives an O(1) approximation at the expense of violating the number of clusters and outliers by a small constant. We empirically study our proposed algorithms on several clustering datasets

    Clustering Under Perturbation Stability in Near-Linear Time

    Get PDF
    We consider the problem of center-based clustering in low-dimensional Euclidean spaces under the perturbation stability assumption. An instance is ?-stable if the underlying optimal clustering continues to remain optimal even when all pairwise distances are arbitrarily perturbed by a factor of at most ?. Our main contribution is in presenting efficient exact algorithms for ?-stable clustering instances whose running times depend near-linearly on the size of the data set when ? ? 2 + ?3. For k-center and k-means problems, our algorithms also achieve polynomial dependence on the number of clusters, k, when ? ? 2 + ?3 + ? for any constant ? > 0 in any fixed dimension. For k-median, our algorithms have polynomial dependence on k for ? > 5 in any fixed dimension; and for ? ? 2 + ?3 in two dimensions. Our algorithms are simple, and only require applying techniques such as local search or dynamic programming to a suitably modified metric space, combined with careful choice of data structures

    Robust Communication-Optimal Distributed Clustering Algorithms

    Get PDF
    In this work, we study the k-median and k-means clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worst-case instances, we focus on gaining a finer understanding through the lens of beyond worst-case analysis. Our main motivation is the following: for many applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a k-median or k-means algorithm will produce clusterings which are close to matching the target clustering. Worst-case results can guarantee constant factor approximations to the optimal k-median or k-means objective value, but not closeness to the target clustering. Our first result is a distributed algorithm which returns a near-optimal clustering assuming a natural notion of stability, namely, approximation stability [Awasthi and Balcan, 2014], even when a constant fraction of the data are outliers. The communication complexity is O~(sk+z) where s is the number of machines, k is the number of clusters, and z is the number of outliers. Next, we show this amount of communication cannot be improved even in the setting when the input satisfies various non-worst-case assumptions. We give a matching Omega(sk+z) lower bound on the communication required both for approximating the optimal k-means or k-median cost up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance. These lower bounds hold even when the data satisfies approximation stability or other common notions of stability, and the cluster sizes are balanced. Therefore, Omega(sk+z) is a communication bottleneck, even for real-world instances

    Resilience for Asynchronous Iterative Methods for Sparse Linear Systems

    Get PDF
    Large scale simulations are used in a variety of application areas in science and engineering to help forward the progress of innovation. Many spend the vast majority of their computational time attempting to solve large systems of linear equations; typically arising from discretizations of partial differential equations that are used to mathematically model various phenomena. The algorithms used to solve these problems are typically iterative in nature, and making efficient use of computational time on High Performance Computing (HPC) clusters involves constantly improving these iterative algorithms. Future HPC platforms are expected to encounter three main problem areas: scalability of code, reliability of hardware, and energy efficiency of the platform. The HPC resources that are expected to run the large programs are planned to consist of billions of processing units that come from more traditional multicore processors as well as a variety of different hardware accelerators. This growth in parallelism leads to the presence of all three problems. Previously, work on algorithm development has focused primarily on creating fault tolerance mechanisms for traditional iterative solvers. Recent work has begun to revisit using asynchronous methods for solving large scale applications, and this dissertation presents research into fault tolerance for fine-grained methods that are asynchronous in nature. Classical convergence results for asynchronous methods are revisited and modified to account for the possible occurrence of a fault, and a variety of techniques for recovery from the effects of a fault are proposed. Examples of how these techniques can be used are shown for various algorithms, including an analysis of a fine-grained algorithm for computing incomplete factorizations. Lastly, numerous modeling and simulation tools for the further construction of iterative algorithms for HPC applications are developed, including numerical models for simulating faults and a simulation framework that can be used to extrapolate the performance of algorithms towards future HPC systems

    Exploring the mechanisms behind nondiffusive transport in a simple turbulence model

    Get PDF
    Thesis (Ph.D.) University of Alaska Fairbanks, 2017Elements for nondiffusive transport have been identified in a plasma turbulence model based on the slab drift-wave model. Motivated by the self-organized criticality paradigm, a standard set of drift-wave equations in doubly-periodic spatial domain has been elevated to include a flux-driven background profile with critical gradients. The profile is maintained by the turbulence induced flux from the source to the sink. Tracers that follow the Lagrangian trajectories are the primary transport characterization technique. The competition between down-gradient relaxations and self-generated flows highlights the dual reactions to local steepening of profile gradients, which leads to different transport regimes. An additional external sheared flow further inhibits down-gradient transfer and acts as another critical threshold condition that can lead to flow-driven instabilities. Superdiffusive transport is observed primarily when radial relaxation events dominate while subdiffusive character become more prominent with self-generated and external poloidal flows. Diffusive transport exists when the superdiffusive and subdiffusive components are in balance. The interplay between turbulent relaxation and self-generated sheared poloidal flows, that form the basis for the transport explored in this model, is absent unless a flux-driven setup is used. Most of the rich dynamics were not present when running the simplified model without an equation for background profile evolution. Nondiffusive transport characteristics can also be recovered from a passive scalar field that is advected by the turbulent flow with an inherent diffusivity. The spread of a highly localized cloud of tracers and a passive scalar field reasserts the equivalence between the Lagrangian and quasi-Lagrangian frames. The coincidence between the passive scalar field with the tracers provide a regime of validity where existing experimental technique can be used to characterize transport from two-dimensional experimental data. The results from this work highlight the key features of flux-driven turbulent transport leading to nondiffusive transport. Specifcally, the dual reactions to the local steepening of profile gradients exposes the multiscale feature of turbulent transport that becomes more apparent under a flux-driven profile. The quantification of nondiffusive transport characteristics from the evolution of a passive scalar can have important implication towards the fundamental understanding of fluid turbulence and turbulent transport

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF

    Efficient Algorithms for Mumford-Shah and Potts Problems

    Get PDF
    In this work, we consider Mumford-Shah and Potts models and their higher order generalizations. Mumford-Shah and Potts models are among the most well-known variational approaches to edge-preserving smoothing and partitioning of images. Though their formulations are intuitive, their application is not straightforward as it corresponds to solving challenging, particularly non-convex, minimization problems. The main focus of this thesis is the development of new algorithmic approaches to Mumford-Shah and Potts models, which is to this day an active field of research. We start by considering the situation for univariate data. We find that switching to higher order models can overcome known shortcomings of the classical first order models when applied to data with steep slopes. Though the existing approaches to the first order models could be applied in principle, they are slow or become numerically unstable for higher orders. Therefore, we develop a new algorithm for univariate Mumford-Shah and Potts models of any order and show that it solves the models in a stable way in O(n^2). Furthermore, we develop algorithms for the inverse Potts model. The inverse Potts model can be seen as an approach to jointly reconstructing and partitioning images that are only available indirectly on the basis of measured data. Further, we give a convergence analysis for the proposed algorithms. In particular, we prove the convergence to a local minimum of the underlying NP-hard minimization problem. We apply the proposed algorithms to numerical data to illustrate their benefits. Next, we apply the multi-channel Potts prior to the reconstruction problem in multi-spectral computed tomography (CT). To this end, we propose a new superiorization approach, which perturbs the iterates of the conjugate gradient method towards better results with respect to the Potts prior. In numerical experiments, we illustrate the benefits of the proposed approach by comparing it to the existing Potts model approach from the literature as well as to the existing total variation type methods. Hereafter, we consider the second order Mumford-Shah model for edge-preserving smoothing of images which –similarly to the univariate case– improves upon the classical Mumford-Shah model for images with linear color gradients. Based on reformulations in terms of Taylor jets, i.e. specific fields of polynomials, we derive discrete second order Mumford-Shah models for which we develop an efficient algorithm using an ADMM scheme. We illustrate the potential of the proposed method by comparing it with existing methods for the second order Mumford-Shah model. Further, we illustrate its benefits in connection with edge detection. Finally, we consider the affine-linear Potts model for the image partitioning problem. As many images possess linear trends within homogeneous regions, the classical Potts model frequently leads to oversegmentation. The affine-linear Potts model accounts for that problem by allowing for linear trends within segments. We lift the corresponding minimization problem to the jet space and develop again an ADMM approach. In numerical experiments, we show that the proposed algorithm achieves lower energy values as well as faster runtimes than the method of comparison, which is based on the iterative application of the graph cut algorithm (with α-expansion moves)

    Public policy modeling and applications

    Full text link
    • …
    corecore