23 research outputs found
ANSWERING WHY-NOT QUESTIONS ON REVERSE SKYLINE QUERIES OVER INCOMPLETE DATA
      Recently, the development of the query-based preferences has received considerable attention from researchers and data users. One of the most popular preference-based queries is the skyline query, which will give a subset of superior records that are not dominated by any other records. As the developed version of skyline queries, a reverse skyline query rise. This query aims to get information about the query points that make a data or record as the part of result of their skyline query.     Furthermore, data-oriented IT development requires scientists to be able to process data in all conditions. In the real world, there exist incomplete multidimensional data, both because of damage, loss, and privacy. In order to increase the usability over a data set, this study will discuss one of the problems in processing reverse skyline queries over incomplete data, namely the "why-not" problem. The considered solution to this "why-not" problem is advice and steps so that a query point that does not initially consider an incomplete data, as a result, can later make the record or incomplete data as part of the results. In this study, there will be further discussion about the dominance relationship between incomplete data along with the solution of the problem. Moreover, some performance evaluations are conducted to measure the level of efficiency and effectiveness
Explainable and Resource-Efficient Stream Processing Through Provenance and Scheduling
In our era of big data, information is captured at unprecedented volumes and velocities, with technologies such as Cyber-Physical Systems making quick decisions based on the processing of streaming, unbounded datasets. In such scenarios, it can be beneficial to process the data in an online manner, using the stream processing paradigm implemented by Stream Processing Engines (SPEs). While SPEs enable high-throughput, low-latency analysis, they are faced with challenges connected to evolving deployment scenarios, like the increasing use of heterogeneous, resource-constrained edge devices together with cloud resources and the increasing user expectations for usability, control, and resource-efficiency, on par with features provided by traditional databases.This thesis tackles open challenges regarding making stream processing more user-friendly, customizable, and resource-efficient. The first part outlines our work, providing high-level background information, descriptions of the research problems, and our contributions. The second part presents our three state-of-the-art frameworks for explainable data streaming using data provenance, which can help users of streaming queries to identify important data points, explain unexpected behaviors, and aid query understanding and debugging. (A) GeneaLog provides backward provenance allowing users to identify the inputs that contributed to the generation of each output of a streaming query. (B) Ananke is the first framework to provide a duplicate-free graph of live forward provenance, enabling easy bidirectional tracing of input-output relationships in streaming queries and identifying data points that have finished contributing to results. (C) Erebus is the first framework that allows users to define expectations about the results of a streaming query, validating whether these expectations are met or providing explanations in the form of why-not provenance otherwise. The third part presents techniques for execution efficiency through custom scheduling, introducing our state-of-the-art scheduling frameworks that control resource allocation and achieve user-defined performance goals. (D) Haren is an SPE-agnostic user-level scheduler that can efficiently enforce user-defined scheduling policies. (E) Lachesis is a standalone scheduling middleware that requires no changes to SPEs but, instead, directly guides the scheduling decisions of the underlying Operating System. Our extensive evaluations using real-world SPEs and workloads show that our work significantly improves over the state-of-the-art while introducing only small performance overheads
Enhanced Inversion of Schema Evolution with Provenance
Long-term data-driven studies have become indispensable in many areas of
science. Often, the data formats, structures and semantics of data change over
time, the data sets evolve. Therefore, studies over several decades in
particular have to consider changing database schemas. The evolution of these
databases lead at some point to a large number of schemas, which have to be
stored and managed, costly and time-consuming. However, in the sense of
reproducibility of research data each database version must be reconstructable
with little effort. So a previously published result can be validated and
reproduced at any time.
Nevertheless, in many cases, such an evolution can not be fully
reconstructed. This article classifies the 15 most frequently used schema
modification operators and defines the associated inverses for each operation.
For avoiding an information loss, it furthermore defines which additional
provenance information have to be stored. We define four classes dealing with
dangling tuples, duplicates and provenance-invariant operators. Each class will
be presented by one representative.
By using and extending the theory of schema mappings and their inverses for
queries, data analysis, why-provenance, and schema evolution, we are able to
combine data analysis applications with provenance under evolving database
structures, in order to enable the reproducibility of scientific results over
longer periods of time. While most of the inverses of schema mappings used for
analysis or evolution are not exact, but only quasi-inverses, adding provenance
information enables us to reconstruct a sub-database of research data that is
sufficient to guarantee reproducibility
Erklärung fehlender Ergebnisse bei der Verarbeitung hierarchischer Daten in Spark
Es existieren einige Algorithmen, die Entwicklern bei der Fehlersuche bei einer Datenbankanfrage helfen. Diese Arbeiten beantworten, wieso bestimmte Daten nicht in der Ergebnismenge fĂĽr eine Anfrage vorhanden sind oder bestimmte nicht erwartete Daten in der Ergebnismenge erscheinen (Why-not-Frage). FĂĽr Anfragesprachen, die hierarchische Daten unterstĂĽtzen, bestehen bisher aber nur wenige Arbeiten.
In dieser Arbeit wird untersucht, welche Besonderheiten es für Why-not-Fragen bei hierarchischen Daten gibt. Dazu wird betrachtet, welche besonderen Fragestellungen dafür möglich sind und wie diese geeignet beantwortet werden können. Dabei wird auch ein konkreter Algorithmus für Python entworfen und implementiert. Anhand von diesem kann mit Hilfe eines Beispiels untersucht werden, ob der Algorithmus effizient und effektiv genug ist Why-not-Fragen zu beantworten
How and Why is An Answer (Still) Correct? Maintaining Provenance in Dynamic Knowledge Graphs
Knowledge graphs (KGs) have increasingly become the backbone of many critical
knowledge-centric applications. Most large-scale KGs used in practice are
automatically constructed based on an ensemble of extraction techniques applied
over diverse data sources. Therefore, it is important to establish the
provenance of results for a query to determine how these were computed.
Provenance is shown to be useful for assigning confidence scores to the
results, for debugging the KG generation itself, and for providing answer
explanations. In many such applications, certain queries are registered as
standing queries since their answers are needed often. However, KGs keep
continuously changing due to reasons such as changes in the source data,
improvements to the extraction techniques, refinement/enrichment of
information, and so on. This brings us to the issue of efficiently maintaining
the provenance polynomials of complex graph pattern queries for dynamic and
large KGs instead of having to recompute them from scratch each time the KG is
updated. Addressing these issues, we present HUKA which uses provenance
polynomials for tracking the derivation of query results over knowledge graphs
by encoding the edges involved in generating the answer. More importantly, HUKA
also maintains these provenance polynomials in the face of updates---insertions
as well as deletions of facts---to the underlying KG. Experimental results over
large real-world KGs such as YAGO and DBpedia with various benchmark SPARQL
query workloads reveals that HUKA can be almost 50 times faster than existing
systems for provenance computation on dynamic KGs
Considering User Intention in Differential Graph Queries
Empty answers are a major problem by processing pattern matching queries in graph databases. Especially, there can be multiple reasons why a query failed. To support users in such situations, differential queries can be used that deliver missing parts of a graph query. Multiple heuristics are proposed for differential queries, which reduce the search space. Although they are successful in increasing the performance, they can discard query subgraphs relevant to a user. To address this issue, the authors extend the concept of differential queries and introduce top-k differential queries that calculate the ranking based on users’ preferences and significantly support the users’ understanding of query database management systems. A user assigns relevance weights to elements of a graph query that steer the search and are used for the ranking. In this paper the authors propose different strategies for selection of relevance weights and their propagation. As a result, the search is modelled along the most relevant paths. The authors evaluate their solution and both strategies on the DBpedia data graph
EFQ: Why-Not Answer Polynomials in Action
International audienceOne important issue in modern database applications is supporting the user with efficient tools to debug and fix queries because such tasks are both time and skill demanding. One particular problem is known as Why-Not question and focusses on the reasons for missing tuples from query results. The EFQ platform demonstrated here has been designed in this context to efficiently leverage Why-Not Answers polynomials, a novel approach that provides the user with complete explanations to Why-Not questions and allows for automatic, relevant query refinements
Provenance Tools
The importance of provenance has arose for all kinds of sciences over the recent years. During research on data provenance, several tools have been developed to use provenance in a practical way. We chose seven of those tools and exhaustingly tested five of them: Trio, ORCHESTRA, Perm, GProM, and ProvSQL. In this article, we first introduce the basics of data provenance, especially where-, why-, and how-provenance. After that, we present the results of our tool tests