3,417 research outputs found
RRR: Rank-Regret Representative
Selecting the best items in a dataset is a common task in data exploration.
However, the concept of "best" lies in the eyes of the beholder: different
users may consider different attributes more important, and hence arrive at
different rankings. Nevertheless, one can remove "dominated" items and create a
"representative" subset of the data set, comprising the "best items" in it. A
Pareto-optimal representative is guaranteed to contain the best item of each
possible ranking, but it can be almost as big as the full data. Representative
can be found if we relax the requirement to include the best item for every
possible user, and instead just limit the users' "regret". Existing work
defines regret as the loss in score by limiting consideration to the
representative instead of the full data set, for any chosen ranking function.
However, the score is often not a meaningful number and users may not
understand its absolute value. Sometimes small ranges in score can include
large fractions of the data set. In contrast, users do understand the notion of
rank ordering. Therefore, alternatively, we consider the position of the items
in the ranked list for defining the regret and propose the {\em rank-regret
representative} as the minimal subset of the data containing at least one of
the top- of any possible ranking function. This problem is NP-complete. We
use the geometric interpretation of items to bound their ranks on ranges of
functions and to utilize combinatorial geometry notions for developing
effective and efficient approximation algorithms for the problem. Experiments
on real datasets demonstrate that we can efficiently find small subsets with
small rank-regrets
Using Set Covering to Generate Databases for Holistic Steganalysis
Within an operational framework, covers used by a steganographer are likely
to come from different sensors and different processing pipelines than the ones
used by researchers for training their steganalysis models. Thus, a performance
gap is unavoidable when it comes to out-of-distributions covers, an extremely
frequent scenario called Cover Source Mismatch (CSM). Here, we explore a grid
of processing pipelines to study the origins of CSM, to better understand it,
and to better tackle it. A set-covering greedy algorithm is used to select
representative pipelines minimizing the maximum regret between the
representative and the pipelines within the set. Our main contribution is a
methodology for generating relevant bases able to tackle operational CSM.
Experimental validation highlights that, for a given number of training
samples, our set covering selection is a better strategy than selecting random
pipelines or using all the available pipelines. Our analysis also shows that
parameters as denoising, sharpening, and downsampling are very important to
foster diversity. Finally, different benchmarks for classical and wild
databases show the good generalization property of the extracted databases.
Additional resources are available at
github.com/RonyAbecidan/HolisticSteganalysisWithSetCovering
A Fully Dynamic Algorithm for k-Regret Minimizing Sets
Selecting a small set of representatives from a large database is important in many applications such as multi-criteria decision making, web search, and recommendation. The k-regret minimizing set (k-RMS) problem was recently proposed for representative tuple discovery. Specifically, for a large database P of tuples with multiple numerical attributes, the k-RMS problem returns a size-r subset Q of P such that, for any possible ranking function, the score of the top-ranked tuple in Q is not much worse than the score of the k th-ranked tuple in P. Although the k-RMS problem has been extensively studied in the literature, existing methods are designed for the static setting and cannot maintain the result efficiently when the database is updated. To address this issue, we propose the first fully-dynamic algorithm for the k-RMS problem that can efficiently provide the up-to-date result w.r.t. any tuple insertion and deletion in the database with a provable guarantee. Experimental results on several real-world and synthetic datasets demonstrate that our algorithm runs up to four orders of magnitude faster than existing k-RMS algorithms while providing results of nearly equal quality.Peer reviewe
Efficient Algorithms for k-Regret Minimizing Sets
A regret minimizing set Q is a small size representation of a much larger database P so that user queries executed on Q return answers whose scores are not much worse than those on the full dataset. In particular, a k-regret minimizing set has the property that the regret ratio between the score of the top-1 item in Q and the score of the top-k item in P is minimized, where the score of an item is the inner product of the item\u27s attributes with a user\u27s weight (preference) vector. The problem is challenging because we want to find a single representative set Q whose regret ratio is small with respect to all possible user weight vectors.
We show that k-regret minimization is NP-Complete for all dimensions d>=3, settling an open problem from Chester et al. [VLDB 2014]. Our main algorithmic contributions are two approximation algorithms, both with provable guarantees, one based on coresets and another based on hitting sets. We perform extensive experimental evaluation of our algorithms, using both real-world and synthetic data, and compare their performance against the solution proposed in [VLDB 14]. The results show that our algorithms are significantly faster and scalable to much larger sets than the greedy algorithm of Chester et al. for comparable quality answers
Happiness Maximizing Sets under Group Fairness Constraints (Technical Report)
Finding a happiness maximizing set (HMS) from a database, i.e., selecting a
small subset of tuples that preserves the best score with respect to any
nonnegative linear utility function, is an important problem in multi-criteria
decision-making. When an HMS is extracted from a set of individuals to assist
data-driven algorithmic decisions such as hiring and admission, it is crucial
to ensure that the HMS can fairly represent different groups of candidates
without bias and discrimination. However, although the HMS problem was
extensively studied in the database community, existing algorithms do not take
group fairness into account and may provide solutions that under-represent some
groups.
In this paper, we propose and investigate a fair variant of HMS (FairHMS)
that not only maximizes the minimum happiness ratio but also guarantees that
the number of tuples chosen from each group falls within predefined lower and
upper bounds. Similar to the vanilla HMS problem, we show that FairHMS is
NP-hard in three and higher dimensions. Therefore, we first propose an exact
interval cover-based algorithm called IntCov for FairHMS on two-dimensional
databases. Then, we propose a bicriteria approximation algorithm called
BiGreedy for FairHMS on multi-dimensional databases by transforming it into a
submodular maximization problem under a matroid constraint. We also design an
adaptive sampling strategy to improve the practical efficiency of BiGreedy.
Extensive experiments on real-world and synthetic datasets confirm the efficacy
and efficiency of our proposal.Comment: Technical report, a shorter version to appear in PVLDB 16(2
- …