245 research outputs found

    Toward relevant answers to queries on incomplete databases

    Get PDF
    Incomplete and uncertain information is ubiquitous in database management applications. However, the techniques specifically developed to handle incomplete data are not sufficient. Even the evaluation of SQL queries on databases containing NULL values remains a challenge after 40 years. There is no consensus on what an answer to a query on an incomplete database should be, and the existing notions often have limited applicability. One of the most prevalent techniques in the literature is based on finding answers that are certainly true, independently of how missing values are interpreted. However, this notion has yielded several conflicting formal definitions for certain answers. Based on the fact that incomplete data can be enriched by some additional knowledge, we designed a notion able to unify and explain the different definitions for certain answers. Moreover, the knowledge-preserving certain answers notion is able to provide the first well-founded definition of certain answers for the relational bag data model and value-inventing queries, addressing some key limitations of previous approaches. However, it doesn’t provide any guarantee about the relevancy of the answers it captures. To understand what would be relevant answers to queries on incomplete databases, we designed and conducted a survey on the everyday usage of NULL values among database users. One of the findings from this socio-technical study is that even when users agree on the possible interpretation of NULL values, they may not agree on what a satisfactory query answer is. Therefore, to be relevant, query evaluation on incomplete databases must account for users’ tasks and preferences. We model users’ preferences and tasks with the notion of regret. The regret function captures the task-dependent loss a user endures when he considers a database as ground truth instead of another. Thanks to this notion, we designed the first framework able to provide a score accounting for the risk associated with query answers. It allows us to define the risk-minimizing answers to queries on incomplete databases. We show that for some regret functions, regret-minimizing answers coincide with certain answers. Moreover, as the notion is more agile, it can capture more nuanced answers and more interpretations of incompleteness. A different approach to improve the relevancy of an answer is to explain its provenance. We propose to partition the incompleteness into sources and measure their respective contribution to the risk of answer. As a first milestone, we study several models to predict the evolution of the risk when we clean a source of incompleteness. We implemented the framework, and it exhibits promising results on relational databases and queries with aggregate and grouping operations. Indeed, the model allows us to infer the risk reduction obtained by cleaning an attribute. Finally, by considering a game theoretical approach, the model can provide an explanation for answers based on the contribution of each attributes to the risk

    Modern data analytics in the cloud era

    Get PDF
    Cloud Computing ist die dominante Technologie des letzten Jahrzehnts. Die Benutzerfreundlichkeit der verwalteten Umgebung in Kombination mit einer nahezu unbegrenzten Menge an Ressourcen und einem nutzungsabhängigen Preismodell ermöglicht eine schnelle und kosteneffiziente Projektrealisierung für ein breites Nutzerspektrum. Cloud Computing verändert auch die Art und Weise wie Software entwickelt, bereitgestellt und genutzt wird. Diese Arbeit konzentriert sich auf Datenbanksysteme, die in der Cloud-Umgebung eingesetzt werden. Wir identifizieren drei Hauptinteraktionspunkte der Datenbank-Engine mit der Umgebung, die veränderte Anforderungen im Vergleich zu traditionellen On-Premise-Data-Warehouse-Lösungen aufweisen. Der erste Interaktionspunkt ist die Interaktion mit elastischen Ressourcen. Systeme in der Cloud sollten Elastizität unterstützen, um den Lastanforderungen zu entsprechen und dabei kosteneffizient zu sein. Wir stellen einen elastischen Skalierungsmechanismus für verteilte Datenbank-Engines vor, kombiniert mit einem Partitionsmanager, der einen Lastausgleich bietet und gleichzeitig die Neuzuweisung von Partitionen im Falle einer elastischen Skalierung minimiert. Darüber hinaus führen wir eine Strategie zum initialen Befüllen von Puffern ein, die es ermöglicht, skalierte Ressourcen unmittelbar nach der Skalierung auszunutzen. Cloudbasierte Systeme sind von fast überall aus zugänglich und verfügbar. Daten werden häufig von zahlreichen Endpunkten aus eingespeist, was sich von ETL-Pipelines in einer herkömmlichen Data-Warehouse-Lösung unterscheidet. Viele Benutzer verzichten auf die Definition von strikten Schemaanforderungen, um Transaktionsabbrüche aufgrund von Konflikten zu vermeiden oder um den Ladeprozess von Daten zu beschleunigen. Wir führen das Konzept der PatchIndexe ein, die die Definition von unscharfen Constraints ermöglichen. PatchIndexe verwalten Ausnahmen zu diesen Constraints, machen sie für die Optimierung und Ausführung von Anfragen nutzbar und bieten effiziente Unterstützung bei Datenaktualisierungen. Das Konzept kann auf beliebige Constraints angewendet werden und wir geben Beispiele für unscharfe Eindeutigkeits- und Sortierconstraints. Darüber hinaus zeigen wir, wie PatchIndexe genutzt werden können, um fortgeschrittene Constraints wie eine unscharfe Multi-Key-Partitionierung zu definieren, die eine robuste Anfrageperformance bei Workloads mit unterschiedlichen Partitionsanforderungen bietet. Der dritte Interaktionspunkt ist die Nutzerinteraktion. Datengetriebene Anwendungen haben sich in den letzten Jahren verändert. Neben den traditionellen SQL-Anfragen für Business Intelligence sind heute auch datenwissenschaftliche Anwendungen von großer Bedeutung. In diesen Fällen fungiert das Datenbanksystem oft nur als Datenlieferant, während der Rechenaufwand in dedizierten Data-Science- oder Machine-Learning-Umgebungen stattfindet. Wir verfolgen das Ziel, fortgeschrittene Analysen in Richtung der Datenbank-Engine zu verlagern und stellen das Grizzly-Framework als DataFrame-zu-SQL-Transpiler vor. Auf dieser Grundlage identifizieren wir benutzerdefinierte Funktionen (UDFs) und maschinelles Lernen (ML) als wichtige Aufgaben, die von einer tieferen Integration in die Datenbank-Engine profitieren würden. Daher untersuchen und bewerten wir Ansätze für die datenbankinterne Ausführung von Python-UDFs und datenbankinterne ML-Inferenz.Cloud computing has been the groundbreaking technology of the last decade. The ease-of-use of the managed environment in combination with nearly infinite amount of resources and a pay-per-use price model enables fast and cost-efficient project realization for a broad range of users. Cloud computing also changes the way software is designed, deployed and used. This thesis focuses on database systems deployed in the cloud environment. We identify three major interaction points of the database engine with the environment that show changed requirements compared to traditional on-premise data warehouse solutions. First, software is deployed on elastic resources. Consequently, systems should support elasticity in order to match workload requirements and be cost-effective. We present an elastic scaling mechanism for distributed database engines, combined with a partition manager that provides load balancing while minimizing partition reassignments in the case of elastic scaling. Furthermore we introduce a buffer pre-heating strategy that allows to mitigate a cold start after scaling and leads to an immediate performance benefit using scaling. Second, cloud based systems are accessible and available from nearly everywhere. Consequently, data is frequently ingested from numerous endpoints, which differs from bulk loads or ETL pipelines in a traditional data warehouse solution. Many users do not define database constraints in order to avoid transaction aborts due to conflicts or to speed up data ingestion. To mitigate this issue we introduce the concept of PatchIndexes, which allow the definition of approximate constraints. PatchIndexes maintain exceptions to constraints, make them usable in query optimization and execution and offer efficient update support. The concept can be applied to arbitrary constraints and we provide examples of approximate uniqueness and approximate sorting constraints. Moreover, we show how PatchIndexes can be exploited to define advanced constraints like an approximate multi-key partitioning, which offers robust query performance over workloads with different partition key requirements. Third, data-centric workloads changed over the last decade. Besides traditional SQL workloads for business intelligence, data science workloads are of significant importance nowadays. For these cases the database system might only act as data delivery, while the computational effort takes place in data science or machine learning (ML) environments. As this workflow has several drawbacks, we follow the goal of pushing advanced analytics towards the database engine and introduce the Grizzly framework as a DataFrame-to-SQL transpiler. Based on this we identify user-defined functions (UDFs) and machine learning inference as important tasks that would benefit from a deeper engine integration and investigate approaches to push these operations towards the database engine

    Comprehending queries over finite maps

    Get PDF

    SQL Nulls and Two-Valued Logic

    Get PDF

    Querying Incomplete Data : Complexity and Tractability via Datalog and First-Order Rewritings

    Full text link
    To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found efficiently for conjunctive queries and their unions, even in the presence of constraints. With negation added, the problem becomes intractable however. We concentrate on the complexity of certain answers under constraints, and on effficiently answering queries outside the usual classes of (unions) of conjunctive queries by means of rewriting as Datalog and first-order queries. We first notice that there are three different ways in which query answering can be cast as a decision problem. We complete the existing picture and provide precise complexity bounds on all versions of the decision problem, for certain and best answers. We then study a well-behaved class of queries that extends unions of conjunctive queries with a mild form of negation. We show that for them, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general Datalog cannot be replaced by first-order logic, but without constraints such a rewriting can be done in first-order. The paper is under consideration in Theory and Practice of Logic Programming (TPLP).Comment: Under consideration in Theory and Practice of Logic Programming (TPLP

    Querying Incomplete Numerical Data: Between Certain and Possible Answers

    Get PDF
    International audienc

    Technologies and Applications for Big Data Value

    Get PDF
    This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems

    Principles of Query Visualization

    Full text link
    Query Visualization (QV) is the problem of transforming a given query into a graphical representation that helps humans understand its meaning. This task is notably different from designing a Visual Query Language (VQL) that helps a user compose a query. This article discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data.Comment: 20 pages, 12 figures, preprint for IEEE Data Engineering Bulleti

    Databases. Practicum

    Get PDF
    Практикум призначений для ознайомлення студентів з методикою використання мови SQL та сервера баз даних PostgreSQL як допоміжного навчального матеріалу для виконання лабораторних робіт з дисципліни «Бази даних». Починаючи із загальних ідей і термінології баз даних, посібник містить роз’яснення щодо інфологічного моделювання «Cутність-зв’язок», реляційної моделі як ядра мови SQL, а також синтаксису команд SQL і їх практичного використання в середовищі PostgreSQL. Застосування сервера PostgreSQL за допомогою графічного інтерфейсу користувача pgAdmin4, включаючи створення та зміну об’єктів бази даних, а також керування транзакціями є іншою важливою частиною практикуму. Навчальний посібник призначено для студентів денної форми навчання спеціальності 121 – «Інженерія програмного забезпечення» факультету прикладної математики КПІ ім. Ігоря Сікорського.The practicum is designed to acquaint students with the methodology of using SQL language and PostgreSQL database server as the support teaching materials for laboratory works of the "Databases" discipline. Starting with the general ideas and terminology of databases, the study guide provides clarifications on entity-relationship information modeling, the relational model as the core of the SQL language, as well as the syntax of SQL commands and their practical use in the PostgreSQL environment. The application of the PostgreSQL server by the pgAdmin4 GUI tool, including creating and modifying database objects, and the transactions management are the other significant parts of the practicum. The study guide is intended for full-time students of the speciality 121 - "Software Engineering" of Applied Mathematics Faculty of Igor Sikorsky Kyiv Polytechnic Institute

    Acta Cybernetica : Volume 25. Number 3.

    Get PDF
    corecore