62 research outputs found
Assessing and Remedying Coverage for a Given Dataset
Data analysis impacts virtually every aspect of our society today. Often,
this analysis is performed on an existing dataset, possibly collected through a
process that the data scientists had limited control over. The existing data
analyzed may not include the complete universe, but it is expected to cover the
diversity of items in the universe. Lack of adequate coverage in the dataset
can result in undesirable outcomes such as biased decisions and algorithmic
racism, as well as creating vulnerabilities such as opening up room for
adversarial attacks.
In this paper, we assess the coverage of a given dataset over multiple
categorical attributes. We first provide efficient techniques for traversing
the combinatorial explosion of value combinations to identify any regions of
attribute space not adequately covered by the data. Then, we determine the
least amount of additional data that must be obtained to resolve this lack of
adequate coverage. We confirm the value of our proposal through both
theoretical analyses and comprehensive experiments on real data.Comment: in ICDE 201
Efficient Computation of Subspace Skyline over Categorical Domains
Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed
the way we search for accommodation, restaurants, etc. The underlying datasets
in such applications have numerous attributes that are mostly Boolean or
Categorical. Discovering the skyline of such datasets over a subset of
attributes would identify entries that stand out while enabling numerous
applications. There are only a few algorithms designed to compute the skyline
over categorical attributes, yet are applicable only when the number of
attributes is small.
In this paper, we place the problem of skyline discovery over categorical
attributes into perspective and design efficient algorithms for two cases. (i)
In the absence of indices, we propose two algorithms, ST-S and ST-P, that
exploits the categorical characteristics of the datasets, organizing tuples in
a tree data structure, supporting efficient dominance tests over the candidate
set. (ii) We then consider the existence of widely used precomputed sorted
lists. After discussing several approaches, and studying their limitations, we
propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists.
Moreover, we further optimize TA-SKY and explore its progressive nature, making
it suitable for applications with strict interactive requirements. In addition
to the extensive theoretical analysis of the proposed algorithms, we conduct a
comprehensive experimental evaluation of the combination of real (including the
entire AirBnB data collection) and synthetic datasets to study the practicality
of the proposed algorithms. The results showcase the superior performance of
our techniques, outperforming applicable approaches by orders of magnitude
Discovering the Skyline of Web Databases
Many web databases are "hidden" behind proprietary search interfaces that
enforce the top- output constraint, i.e., each query returns at most of
all matching tuples, preferentially selected and returned according to a
proprietary ranking function. In this paper, we initiate research into the
novel problem of skyline discovery over top- hidden web databases. Since
skyline tuples provide critical insights into the database and include the
top-ranked tuple for every possible ranking function following the monotonic
order of attribute values, skyline discovery from a hidden web database can
enable a wide variety of innovative third-party applications over one or
multiple web databases. Our research in the paper shows that the critical
factor affecting the cost of skyline discovery is the type of search interface
controls provided by the website. As such, we develop efficient algorithms for
three most popular types, i.e., one-ended range, free range and point
predicates, and then combine them to support web databases that feature a
mixture of these types. Rigorous theoretical analysis and extensive real-world
online and offline experiments demonstrate the effectiveness of our proposed
techniques and their superiority over baseline solutions
RRR: Rank-Regret Representative
Selecting the best items in a dataset is a common task in data exploration.
However, the concept of "best" lies in the eyes of the beholder: different
users may consider different attributes more important, and hence arrive at
different rankings. Nevertheless, one can remove "dominated" items and create a
"representative" subset of the data set, comprising the "best items" in it. A
Pareto-optimal representative is guaranteed to contain the best item of each
possible ranking, but it can be almost as big as the full data. Representative
can be found if we relax the requirement to include the best item for every
possible user, and instead just limit the users' "regret". Existing work
defines regret as the loss in score by limiting consideration to the
representative instead of the full data set, for any chosen ranking function.
However, the score is often not a meaningful number and users may not
understand its absolute value. Sometimes small ranges in score can include
large fractions of the data set. In contrast, users do understand the notion of
rank ordering. Therefore, alternatively, we consider the position of the items
in the ranked list for defining the regret and propose the {\em rank-regret
representative} as the minimal subset of the data containing at least one of
the top- of any possible ranking function. This problem is NP-complete. We
use the geometric interpretation of items to bound their ranks on ranges of
functions and to utilize combinatorial geometry notions for developing
effective and efficient approximation algorithms for the problem. Experiments
on real datasets demonstrate that we can efficiently find small subsets with
small rank-regrets
A Nutritional Label for Rankings
Algorithmic decisions often result in scoring and ranking individuals to
determine credit worthiness, qualifications for college admissions and
employment, and compatibility as dating partners. While automatic and seemingly
objective, ranking algorithms can discriminate against individuals and
protected groups, and exhibit low diversity. Furthermore, ranked results are
often unstable --- small changes in the input data or in the ranking
methodology may lead to drastic changes in the output, making the result
uninformative and easy to manipulate. Similar concerns apply in cases where
items other than individuals are ranked, including colleges, academic
departments, or products.
In this demonstration we present Ranking Facts, a Web-based application that
generates a "nutritional label" for rankings. Ranking Facts is made up of a
collection of visual widgets that implement our latest research results on
fairness, stability, and transparency for rankings, and that communicate
details of the ranking methodology, or of the output, to the end user. We will
showcase Ranking Facts on real datasets from different domains, including
college rankings, criminal risk assessment, and financial services.Comment: 4 pages, SIGMOD demo, 3 figuress, ACM SIGMOD 201
Online Maximum Independent Set of Hyperrectangles
The maximum independent set problem is a classical NP-hard problem in
theoretical computer science. In this work, we study a special case where the
family of graphs considered is restricted to intersection graphs of sets of
axis-aligned hyperrectangles and the input is provided in an online fashion. We
prove bounds on the competitive ratio of an optimal online algorithm under the
adaptive offline, adaptive online, and oblivious adversary models, for several
classes of hyperrectangles and restrictions on the order of the input.
We are the first to present results on this problem under the oblivious
adversary model. We prove bounds on the competitive ratio for unit hypercubes,
-bounded hypercubes, unit-volume hypercubes, arbitrary hypercubes, and
arbitrary hyperrectangles, in both arbitrary and non-dominated order. We are
also the first to present results under the adaptive offline and adaptive
online adversary models with input in non-dominated order, proving bounds on
the competitive ratio for the same classes of hyperrectangles; for input in
arbitrary order, we present the first results on -bounded hypercubes,
unit-volume hyperrectangles, arbitrary hypercubes, and arbitrary
hyperrectangles. For input in dominating order, we show that the performance of
the naive greedy algorithm matches the performance of an optimal offline
algorithm in all cases. We also give lower bounds on the competitive ratio of a
probabilistic greedy algorithm under the oblivious adversary model. We conclude
by discussing several promising directions for future work.Comment: 27 pages, 12 figure
Designing Fair Ranking Schemes
Items from a database are often ranked based on a combination of multiple
criteria. A user may have the flexibility to accept combinations that weigh
these criteria differently, within limits. On the other hand, this choice of
weights can greatly affect the fairness of the produced ranking. In this paper,
we develop a system that helps users choose criterion weights that lead to
greater fairness.
We consider ranking functions that compute the score of each item as a
weighted sum of (numeric) attribute values, and then sort items on their score.
Each ranking function can be expressed as a vector of weights, or as a point in
a multi-dimensional space. For a broad range of fairness criteria, we show how
to efficiently identify regions in this space that satisfy these criteria.
Using this identification method, our system is able to tell users whether
their proposed ranking function satisfies the desired fairness criteria and, if
it does not, to suggest the smallest modification that does. We develop
user-controllable approximation that and indexing techniques that are applied
during preprocessing, and support sub-second response times during the online
phase. Our extensive experiments on real datasets demonstrate that our methods
are able to find solutions that satisfy fairness criteria effectively and
efficiently
- …
