75 research outputs found
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
The tasks of extracting (top-) Frequent Itemsets (FI's) and Association
Rules (AR's) are fundamental primitives in data mining and database
applications. Exact algorithms for these problems exist and are widely used,
but their running time is hindered by the need of scanning the entire dataset,
possibly multiple times. High quality approximations of FI's and AR's are
sufficient for most practical uses, and a number of recent works explored the
application of sampling for fast discovery of approximate solutions to the
problems. However, these works do not provide satisfactory performance
guarantees on the quality of the approximation, due to the difficulty of
bounding the probability of under- or over-sampling any one of an unknown
number of frequent itemsets. In this work we circumvent this issue by applying
the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop
a novel technique for providing tight bounds on the sample size that guarantees
approximation within user-specified parameters. Our technique applies both to
absolute and to relative approximations of (top-) FI's and AR's. The
resulting sample size is linearly dependent on the VC-dimension of a range
space associated with the dataset to be mined. The main theoretical
contribution of this work is a proof that the VC-dimension of this range space
is upper bounded by an easy-to-compute characteristic quantity of the dataset
which we call \emph{d-index}, and is the maximum integer such that the
dataset contains at least transactions of length at least such that no
one of them is a superset of or equal to another. We show that this bound is
strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the
proceedings of ECML PKDD 201
Reinforcement Learning with Human Feedback for Realistic Traffic Simulation
In light of the challenges and costs of real-world testing, autonomous
vehicle developers often rely on testing in simulation for the creation of
reliable systems. A key element of effective simulation is the incorporation of
realistic traffic models that align with human knowledge, an aspect that has
proven challenging due to the need to balance realism and diversity. This works
aims to address this by developing a framework that employs reinforcement
learning with human preference (RLHF) to enhance the realism of existing
traffic models. This study also identifies two main challenges: capturing the
nuances of human preferences on realism and the unification of diverse traffic
simulation models. To tackle these issues, we propose using human feedback for
alignment and employ RLHF due to its sample efficiency. We also introduce the
first dataset for realism alignment in traffic modeling to support such
research. Our framework, named TrafficRLHF, demonstrates its proficiency in
generating realistic traffic scenarios that are well-aligned with human
preferences, as corroborated by comprehensive evaluations on the nuScenes
dataset.Comment: 9 pages, 4 figure
Anytime Discovery of a Diverse Set of Patterns with Monte Carlo Tree Search
International audienceThe discovery of patterns that accurately discriminate one class label from another remains a challenging data mining task. Subgroup discovery (SD) is one of the frameworks that enables to elicit such interesting patterns from labeled data. A question remains fairly open: How to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is infeasible? Existing approaches make use of beam-search, sampling, and genetic algorithms for discovering a pattern set that is non-redundant and of high quality w.r.t. a pattern quality measure. We argue that such approaches produce pattern sets that lack of diversity: Only few patterns of high quality, and different enough, are discovered. Our main contribution is then to formally define pattern mining as a game and to solve it with Monte Carlo tree search (MCTS). It can be seen as an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its best-first search property. We show through a comprehensive set of experiments how MCTS enables the anytime discovery of a diverse pattern set of high quality. It out-performs other approaches when dealing with a large pattern search space and for different quality measures. Thanks to its genericity, our MCTS approach can be used for SD but also for many other pattern mining tasks
Analyzing Granger causality in climate data with time series classification methods
Attribution studies in climate science aim for scientifically ascertaining the influence of climatic variations on natural or anthropogenic factors. Many of those studies adopt the concept of Granger causality to infer statistical cause-effect relationships, while utilizing traditional autoregressive models. In this article, we investigate the potential of state-of-the-art time series classification techniques to enhance causal inference in climate science. We conduct a comparative experimental study of different types of algorithms on a large test suite that comprises a unique collection of datasets from the area of climate-vegetation dynamics. The results indicate that specialized time series classification methods are able to improve existing inference procedures. Substantial differences are observed among the methods that were tested
- …