58 research outputs found
EFQ: Why-Not Answer Polynomials in Action
International audienceOne important issue in modern database applications is supporting the user with efficient tools to debug and fix queries because such tasks are both time and skill demanding. One particular problem is known as Why-Not question and focusses on the reasons for missing tuples from query results. The EFQ platform demonstrated here has been designed in this context to efficiently leverage Why-Not Answers polynomials, a novel approach that provides the user with complete explanations to Why-Not questions and allows for automatic, relevant query refinements
PigReuse: A Reuse-based Optimizer for Pig Latin
Pig Latin is a popular language which is widely used for parallel processing of massive data sets. Currently, subexpressions occurring repeatedly in Pig Latin scripts are executed as many times as they appear, and the current Pig Latin optimizer does not identify reuse opportunities.We present a novel optimization approach aiming at identifying and reusing repeated subexpressions in Pig Latin scripts. Our optimization algorithm, named PigReuse, operates on a particular algebraic representation of Pig Latin scripts. PigReuse identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and reuses their results as needed in order to compute exactly the same output as the original scripts. Our experiments demonstrate the effectiveness of our approach
Combining Programming-by-Example with Transformation Discovery from large Databases
Data transformation discovery is one of the most tedious tasks in data preparation. In particular, the generation of transformation programs for semantic transformations is tricky because additional sources for look-up operations are necessary. Current systems for semantic transformation discovery face two major problems: either they follow a program synthesis approach that only scales to a small set of input tables, or they rely on extraction of transformation functions from large corpora, which requires the identification of exact transformations in those resources and is prone to noisy data. In this paper, we try to combine approaches to benefit from large corpora and the sophistication of program synthesis. To do so, we devise a retrieval and pruning strategy ensemble that extracts the most relevant tables for a given transformation task. The extracted resources can then be processed by a program synthesis engine to generate more accurate transformation results than state-of-the-art
Silentium! Run-Analyse-Eradicate the Noise out of the DB/OS Stack
When multiple tenants compete for resources, database performance tends to suffer. Yet there are scenarios where guaranteed sub-millisecond latencies are crucial, such as in real-time data processing, IoT devices, or when operating in safety-critical environments. In this paper, we study how to make query latencies deterministic in the face of noise (whether caused by other tenants or unrelated operating system tasks). We perform controlled experiments with an in-memory database engine in a multi-tenant setting, where we successively eradicate noisy interference from within the system software stack, to the point where the engine runs close to bare-metal on the underlying hardware. We show that we can achieve query latencies comparable to the database engine running as the sole tenant, but without noticeably impacting the workload of competing tenants. We discuss these results in the context of ongoing efforts to build custom operating systems for database workloads, and point out that for certain use cases, the margin for improvement is rather narrow. In fact, for scenarios like ours, existing operating systems might just be good enough, provided that they are expertly configured. We then critically discuss these findings in the light of a broader family of database systems (e.g. including disk-based), and how to extend the approach of this paper accordingly
From Plate to Prevention: A Dietary Nutrient-aided Platform for Health Promotion in Singapore
Singapore has been striving to improve the provision of healthcare services
to her people. In this course, the government has taken note of the deficiency
in regulating and supervising people's nutrient intake, which is identified as
a contributing factor to the development of chronic diseases. Consequently,
this issue has garnered significant attention. In this paper, we share our
experience in addressing this issue and attaining medical-grade nutrient intake
information to benefit Singaporeans in different aspects. To this end, we
develop the FoodSG platform to incubate diverse healthcare-oriented
applications as a service in Singapore, taking into account their shared
requirements. We further identify the profound meaning of localized food
datasets and systematically clean and curate a localized Singaporean food
dataset FoodSG-233. To overcome the hurdle in recognition performance brought
by Singaporean multifarious food dishes, we propose to integrate supervised
contrastive learning into our food recognition model FoodSG-SCL for the
intrinsic capability to mine hard positive/negative samples and therefore boost
the accuracy. Through a comprehensive evaluation, we present performance
results of the proposed model and insights on food-related healthcare
applications. The FoodSG-233 dataset has been released in
https://foodlg.comp.nus.edu.sg/
it - Information technology Special issue on data integration
International audienc
Wondering why data are missing from query results? Ask Conseil Why-Not
International audienceIn analyzing and debugging data transformations, or more specifically relational queries, a subproblem is to understand why some data are not part of the query result. This problem has recently been addressed from different perspectives for various fragments of relational queries. The different perspectives yield different, yet complementary explanations of such missing-answers. This paper first aims at unifying the different approaches by defining a new type of explanation, called hybrid explanation, that encompasses the variety of previously defined types of explanations. This solution goes beyond simply forming the union of explanations produced by different algorithms and is shown to be able to explain a larger set of missing-answers. Second, we present Conseil, an algorithm to generate hybrid explanations. Conseil is also the first algorithm to handle non-monotonic queries. Experiments on efficiency and explanation quality show that Conseil is comparable to and even outperforms previous algorithms
- …