3,969 research outputs found
Auto-tuning Distributed Stream Processing Systems using Reinforcement Learning
Fine tuning distributed systems is considered to be a craftsmanship, relying
on intuition and experience. This becomes even more challenging when the
systems need to react in near real time, as streaming engines have to do to
maintain pre-agreed service quality metrics. In this article, we present an
automated approach that builds on a combination of supervised and reinforcement
learning methods to recommend the most appropriate lever configurations based
on previous load. With this, streaming engines can be automatically tuned
without requiring a human to determine the right way and proper time to deploy
them. This opens the door to new configurations that are not being applied
today since the complexity of managing these systems has surpassed the
abilities of human experts. We show how reinforcement learning systems can find
substantially better configurations in less time than their human counterparts
and adapt to changing workloads
Complaint-driven Training Data Debugging for Query 2.0
As the need for machine learning (ML) increases rapidly across all industry
sectors, there is a significant interest among commercial database providers to
support "Query 2.0", which integrates model inference into SQL queries.
Debugging Query 2.0 is very challenging since an unexpected query result may be
caused by the bugs in training data (e.g., wrong labels, corrupted features).
In response, we propose Rain, a complaint-driven training data debugging
system. Rain allows users to specify complaints over the query's intermediate
or final output, and aims to return a minimum set of training examples so that
if they were removed, the complaints would be resolved. To the best of our
knowledge, we are the first to study this problem. A naive solution requires
retraining an exponential number of ML models. We propose two novel heuristic
approaches based on influence functions which both require linear retraining
steps. We provide an in-depth analytical and empirical analysis of the two
approaches and conduct extensive experiments to evaluate their effectiveness
using four real-world datasets. Results show that Rain achieves the highest
recall@k among all the baselines while still returns results interactively.Comment: Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Dat
- …