5 research outputs found
Efficient Computation of Subspace Skyline over Categorical Domains
Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed
the way we search for accommodation, restaurants, etc. The underlying datasets
in such applications have numerous attributes that are mostly Boolean or
Categorical. Discovering the skyline of such datasets over a subset of
attributes would identify entries that stand out while enabling numerous
applications. There are only a few algorithms designed to compute the skyline
over categorical attributes, yet are applicable only when the number of
attributes is small.
In this paper, we place the problem of skyline discovery over categorical
attributes into perspective and design efficient algorithms for two cases. (i)
In the absence of indices, we propose two algorithms, ST-S and ST-P, that
exploits the categorical characteristics of the datasets, organizing tuples in
a tree data structure, supporting efficient dominance tests over the candidate
set. (ii) We then consider the existence of widely used precomputed sorted
lists. After discussing several approaches, and studying their limitations, we
propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists.
Moreover, we further optimize TA-SKY and explore its progressive nature, making
it suitable for applications with strict interactive requirements. In addition
to the extensive theoretical analysis of the proposed algorithms, we conduct a
comprehensive experimental evaluation of the combination of real (including the
entire AirBnB data collection) and synthetic datasets to study the practicality
of the proposed algorithms. The results showcase the superior performance of
our techniques, outperforming applicable approaches by orders of magnitude
RRR: Rank-Regret Representative
Selecting the best items in a dataset is a common task in data exploration.
However, the concept of "best" lies in the eyes of the beholder: different
users may consider different attributes more important, and hence arrive at
different rankings. Nevertheless, one can remove "dominated" items and create a
"representative" subset of the data set, comprising the "best items" in it. A
Pareto-optimal representative is guaranteed to contain the best item of each
possible ranking, but it can be almost as big as the full data. Representative
can be found if we relax the requirement to include the best item for every
possible user, and instead just limit the users' "regret". Existing work
defines regret as the loss in score by limiting consideration to the
representative instead of the full data set, for any chosen ranking function.
However, the score is often not a meaningful number and users may not
understand its absolute value. Sometimes small ranges in score can include
large fractions of the data set. In contrast, users do understand the notion of
rank ordering. Therefore, alternatively, we consider the position of the items
in the ranked list for defining the regret and propose the {\em rank-regret
representative} as the minimal subset of the data containing at least one of
the top- of any possible ranking function. This problem is NP-complete. We
use the geometric interpretation of items to bound their ranks on ranges of
functions and to utilize combinatorial geometry notions for developing
effective and efficient approximation algorithms for the problem. Experiments
on real datasets demonstrate that we can efficiently find small subsets with
small rank-regrets
On Obtaining Stable Rankings
Decision making is challenging when there is more than one criterion to
consider. In such cases, it is common to assign a goodness score to each item
as a weighted sum of its attribute values and rank them accordingly. Clearly,
the ranking obtained depends on the weights used for this summation. Ideally,
one would want the ranked order not to change if the weights are changed
slightly. We call this property {\em stability} of the ranking. A consumer of a
ranked list may trust the ranking more if it has high stability. A producer of
a ranked list prefers to choose weights that result in a stable ranking, both
to earn the trust of potential consumers and because a stable ranking is
intrinsically likely to be more meaningful. In this paper, we develop a
framework that can be used to assess the stability of a provided ranking and to
obtain a stable ranking within an "acceptable" range of weight values (called
"the region of interest"). We address the case where the user cares about the
rank order of the entire set of items, and also the case where the user cares
only about the top- items. Using a geometric interpretation, we propose
algorithms that produce stable rankings. In addition to theoretical analyses,
we conduct extensive experiments on real datasets that validate our proposal