326 research outputs found

    Guruswami-Sinop Rounding without Higher Level Lasserre

    Get PDF
    Guruswami and Sinop give a O(1/delta) approximation guarantee for the non-uniform Sparsest Cut problem by solving O(r)-level Lasserre semidefinite constraints, provided that the generalized eigenvalues of the Laplacians of the cost and demand graphs satisfy a certain spectral condition, namely, the (r+1)-th generalized eigenvalue is at least OPT/(1-delta). Their key idea is a rounding technique that first maps a vector-valued solution to [0,1] using appropriately scaled projections onto Lasserre vectors. In this paper, we show that similar projections and analysis can be obtained using only l_2^2 triangle inequality constraints. This results in a O(r/delta^2) approximation guarantee for the non-uniform Sparsest Cut problem by adding only l_2^2 triangle inequality constraints to the usual semidefinite program, provided that the same spectral condition, the (r+1)-th generalized eigenvalue is at least OPT/(1-delta), holds

    Improved Outlier Robust Seeding for k-means

    Full text link
    The kk-means is a popular clustering objective, although it is inherently non-robust and sensitive to outliers. Its popular seeding or initialization called kk-means++ uses D2D^{2} sampling and comes with a provable O(logk)O(\log k) approximation guarantee \cite{AV2007}. However, in the presence of adversarial noise or outliers, D2D^{2} sampling is more likely to pick centers from distant outliers instead of inlier clusters, and therefore its approximation guarantees \textit{w.r.t.} kk-means solution on inliers, does not hold. Assuming that the outliers constitute a constant fraction of the given data, we propose a simple variant in the D2D^2 sampling distribution, which makes it robust to the outliers. Our algorithm runs in O(ndk)O(ndk) time, outputs O(k)O(k) clusters, discards marginally more points than the optimal number of outliers, and comes with a provable O(1)O(1) approximation guarantee. Our algorithm can also be modified to output exactly kk clusters instead of O(k)O(k) clusters, while keeping its running time linear in nn and dd. This is an improvement over previous results for robust kk-means based on LP relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and \textit{robust kk-means++} \cite{DeshpandeKP20}. Our empirical results show the advantage of our algorithm over kk-means++~\cite{AV2007}, uniform random seeding, greedy sampling for kk means~\cite{tkmeanspp}, and robust kk-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data sets used in previous work. Our proposal is easily amenable to scalable, faster, parallel implementations of kk-means++ \cite{Bahmani,BachemL017} and is of independent interest for coreset constructions in the presence of outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}

    Mining Query Plans for Finding Candidate Queries and Sub-Queries for Materialized Views in BI Systems Without Cube Generation

    Get PDF
    Materialized views are important for optimizing Business Intelligence (BI) systems when they are designed without data cubes. Selecting candidate queries from large number of queries for materialized views is a challenging task. Most of the work done in the past involves finding out frequent queries from the past workload and creating materialized views from such queries by either manually analyzing workload or using approximate string matching algorithms using query text. Most of the existing methods suggest complete queries but ignore query components such as sub queries for creation of materialized views. This paper presents a novel method to determine on which queries and query components materialized views can be created to optimize aggregate and join queries by mining database of query execution plans which are in the form of binary trees. The proposed algorithm showed significant improvement in terms of more number of optimized queries because it is using the execution plan tree of the query as a basis of selection of query to be optimized using materialized views rather than choosing query text which is used by traditional methods. For selecting a correct set of queries to be optimized using materialized views, the paper proposes efficient specialized frequent tree component mining algorithm with novel heuristics to prune search space. These frequent components are used to determine the possible set of candidate queries for creation of materialized views. Experimentation on standard, real and synthetic data sets, and also the theoretical basis, proved that the proposed method is able to optimize a large number of queries with less number of materialized views and showed a significant improvement in performance compared to traditional methods

    The Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective

    Full text link
    Training datasets for machine learning often have some form of missingness. For example, to learn a model for deciding whom to give a loan, the available training data includes individuals who were given a loan in the past, but not those who were not. This missingness, if ignored, nullifies any fairness guarantee of the training procedure when the model is deployed. Using causal graphs, we characterize the missingness mechanisms in different real-world scenarios. We show conditions under which various distributions, used in popular fairness algorithms, can or can not be recovered from the training data. Our theoretical results imply that many of these algorithms can not guarantee fairness in practice. Modeling missingness also helps to identify correct design principles for fair algorithms. For example, in multi-stage settings where decisions are made in multiple screening rounds, we use our framework to derive the minimal distributions required to design a fair algorithm. Our proposed algorithm decentralizes the decision-making process and still achieves similar performance to the optimal algorithm that requires centralization and non-recoverable distributions.Comment: To appear in the Proceedings of AAAI 202

    Embedding Approximately Low-Dimensional l_2^2 Metrics into l_1

    Get PDF
    Goemans showed that any n points x_1,..., x_n in d-dimensions satisfying l_2^2 triangle inequalities can be embedded into l_{1}, with worst-case distortion at most sqrt{d}. We consider an extension of this theorem to the case when the points are approximately low-dimensional as opposed to exactly low-dimensional, and prove the following analogous theorem, albeit with average distortion guarantees: There exists an l_{2}^{2}-to-l_{1} embedding with average distortion at most the stable rank, sr(M), of the matrix M consisting of columns {x_i-x_j}_{i<j}. Average distortion embedding suffices for applications such as the SPARSEST CUT problem. Our embedding gives an approximation algorithm for the SPARSEST CUT problem on low threshold-rank graphs, where earlier work was inspired by Lasserre SDP hierarchy, and improves on a previous result of the first and third author [Deshpande and Venkat, in Proc. 17th APPROX, 2014]. Our ideas give a new perspective on l_{2}^{2} metric, an alternate proof of Goemans\u27 theorem, and a simpler proof for average distortion sqrt{d}
    corecore