326 research outputs found
Guruswami-Sinop Rounding without Higher Level Lasserre
Guruswami and Sinop give a O(1/delta) approximation guarantee for the non-uniform Sparsest Cut problem by solving O(r)-level Lasserre semidefinite constraints, provided that the generalized eigenvalues of the Laplacians of the cost and demand graphs satisfy a certain spectral condition, namely, the (r+1)-th generalized eigenvalue is at least OPT/(1-delta). Their key idea is a rounding technique that first maps a vector-valued solution to [0,1] using appropriately scaled projections onto Lasserre vectors. In this paper, we show that similar projections and analysis can be obtained using only l_2^2 triangle inequality constraints. This results in a O(r/delta^2) approximation guarantee for the non-uniform Sparsest Cut problem by adding only l_2^2 triangle inequality constraints to the usual semidefinite program, provided that the same spectral condition, the (r+1)-th generalized eigenvalue is at least OPT/(1-delta), holds
Improved Outlier Robust Seeding for k-means
The -means is a popular clustering objective, although it is inherently
non-robust and sensitive to outliers. Its popular seeding or initialization
called -means++ uses sampling and comes with a provable
approximation guarantee \cite{AV2007}. However, in the presence of adversarial
noise or outliers, sampling is more likely to pick centers from distant
outliers instead of inlier clusters, and therefore its approximation guarantees
\textit{w.r.t.} -means solution on inliers, does not hold.
Assuming that the outliers constitute a constant fraction of the given data,
we propose a simple variant in the sampling distribution, which makes it
robust to the outliers. Our algorithm runs in time, outputs
clusters, discards marginally more points than the optimal number of outliers,
and comes with a provable approximation guarantee.
Our algorithm can also be modified to output exactly clusters instead of
clusters, while keeping its running time linear in and . This is
an improvement over previous results for robust -means based on LP
relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and
\textit{robust -means++} \cite{DeshpandeKP20}. Our empirical results show
the advantage of our algorithm over -means++~\cite{AV2007}, uniform random
seeding, greedy sampling for means~\cite{tkmeanspp}, and robust
-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data
sets used in previous work. Our proposal is easily amenable to scalable,
faster, parallel implementations of -means++ \cite{Bahmani,BachemL017} and
is of independent interest for coreset constructions in the presence of
outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}
Mining Query Plans for Finding Candidate Queries and Sub-Queries for Materialized Views in BI Systems Without Cube Generation
Materialized views are important for optimizing Business Intelligence (BI) systems when they are designed without data cubes. Selecting candidate queries from large number of queries for materialized views is a challenging task. Most of the work done in the past involves finding out frequent queries from the past workload and creating materialized views from such queries by either manually analyzing workload or using approximate string matching algorithms using query text. Most of the existing methods suggest complete queries but ignore query components such as sub queries for creation of materialized views. This paper presents a novel method to determine on which queries and query components materialized views can be created to optimize aggregate and join queries by mining database of query execution plans which are in the form of binary trees. The proposed algorithm showed significant improvement in terms of more number of optimized queries because it is using the execution plan tree of the query as a basis of selection of query to be optimized using materialized views rather than choosing query text which is used by traditional methods. For selecting a correct set of queries to be optimized using materialized views, the paper proposes efficient specialized frequent tree component mining algorithm with novel heuristics to prune search space. These frequent components are used to determine the possible set of candidate queries for creation of materialized views. Experimentation on standard, real and synthetic data sets, and also the theoretical basis, proved that the proposed method is able to optimize a large number of queries with less number of materialized views and showed a significant improvement in performance compared to traditional methods
The Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective
Training datasets for machine learning often have some form of missingness.
For example, to learn a model for deciding whom to give a loan, the available
training data includes individuals who were given a loan in the past, but not
those who were not. This missingness, if ignored, nullifies any fairness
guarantee of the training procedure when the model is deployed. Using causal
graphs, we characterize the missingness mechanisms in different real-world
scenarios. We show conditions under which various distributions, used in
popular fairness algorithms, can or can not be recovered from the training
data. Our theoretical results imply that many of these algorithms can not
guarantee fairness in practice. Modeling missingness also helps to identify
correct design principles for fair algorithms. For example, in multi-stage
settings where decisions are made in multiple screening rounds, we use our
framework to derive the minimal distributions required to design a fair
algorithm. Our proposed algorithm decentralizes the decision-making process and
still achieves similar performance to the optimal algorithm that requires
centralization and non-recoverable distributions.Comment: To appear in the Proceedings of AAAI 202
Embedding Approximately Low-Dimensional l_2^2 Metrics into l_1
Goemans showed that any n points x_1,..., x_n in d-dimensions satisfying l_2^2 triangle inequalities can be embedded into l_{1}, with worst-case distortion at most sqrt{d}. We consider an extension of this theorem to the case when the points are approximately low-dimensional as opposed to exactly low-dimensional, and prove the following analogous theorem, albeit with average distortion guarantees: There exists an l_{2}^{2}-to-l_{1} embedding with average distortion at most the stable rank, sr(M), of the matrix M consisting of columns {x_i-x_j}_{i<j}. Average distortion embedding suffices for applications such as the SPARSEST CUT problem. Our embedding gives an approximation algorithm for the SPARSEST CUT problem on low threshold-rank graphs, where earlier work was inspired by Lasserre SDP hierarchy, and improves on a previous result of the first and third author [Deshpande and Venkat, in Proc. 17th APPROX, 2014]. Our ideas give a new perspective on l_{2}^{2} metric, an alternate proof of Goemans\u27 theorem, and a simpler proof for average distortion sqrt{d}
- …