    A Rule-Based Approach to Analyzing Database Schema Objects with Datalog

    Database schema elements such as tables, views, triggers and functions are typically defined with many interrelationships. In order to support database users in understanding a given schema, a rule-based approach for analyzing the respective dependencies is proposed using Datalog expressions. We show that many interesting properties of schema elements can be systematically determined this way. The expressiveness of the proposed analysis is exemplarily shown with the problem of computing induced functional dependencies for derived relations. The propagation of functional dependencies plays an important role in data integration and query optimization but represents an undecidable problem in general. And yet, our rule-based analysis covers all relational operators as well as linear recursive expressions in a systematic way showing the depth of analysis possible by our proposal. The analysis of functional dependencies is well-integrated in a uniform approach to analyzing dependencies between schema elements in general.Comment: Pre-proceedings paper presented at the 27th International Symposium on Logic-Based Program Synthesis and Transformation (LOPSTR 2017), Namur, Belgium, 10-12 October 2017 (arXiv:1708.07854

    High Level Efficiency in Database Languages

    The subject of this Ph.D. thesis is the design and implementation of database languages. The thesis consists of five articles:  [1] Joan F. Boyar and Kim S. Larsen. Efficient Rebalancing of Chromatic Search Trees. In O. Nurmi and E. Ukkonen, eds., LNCS 621: Algorithm Theory -- SWAT'92 , pp. 151-164. Springer-Verlag, 1992. [2] Kim S. Larsen. On Aggregation and Computation on Domain Values. PB-414, Computer Science Department, Aarhus University, 1992. [3] Kim S. Larsen. Strategies for Expression Evaluation Using Sort-Merge Algorithms. PB-415, Computer Science Department, Aarhus University, 1992. [4] Kim S. Larsen and Michael I. Schwartzbach. Injectivity of Unary Queries With Computation on Domain Values. Computer Science Department, Aarhus University, 1992. Revised version of PB-311. [5] Kim S. Larsen, Michael I. Schwartzbach and Erik M. Schmidt. A New Formalism for Relational Algebra. IPL , 41(3):163-168, 1992. and this survey paper. In [5], a new query language design is proposed. The expressive power of the language is determined in [2] and all reasonable extensions are considered. In [3, 4], we focus on the optimization issue of avoiding unnecessary sorting of relations. The results in these papers are directly applicable to any algebra-based query language. In addition to the query language part, a database system also has to offer update facilities. The theory of standard tuple based updates is quite well developed in the sequential case. In [1], we discuss a new concurrent implementation of balanced search trees for that purpose.This survey paper describes the results of the papers which form the thesis, and relates these results to each other and to the area in a broader sense than is customary in the introductions of individual papers. The paper is intended to be read in combination with the papers on which it is based

    Computational Complexity And Algorithms For Dirty Data Evaluation And Repairing

    In this dissertation, we study the dirty data evaluation and repairing problem in relational database. Dirty data is usually inconsistent, inaccurate, incomplete and stale. Existing methods and theories of consistency describe using integrity constraints, such as data dependencies. However, integrity constraints are good at detection but not at evaluating the degree of data inconsistency and cannot guide the data repairing. This dissertation first studies the computational complexity of and algorithms for the database inconsistency evaluation. We define and use the minimum tuple deletion to evaluate the database inconsistency. For such minimum tuple deletion problem, we study the relationship between the size of rule set and its computational complexity. We show that the minimum tuple deletion problem is NP-hard to approximate the minimum tuple deletion within 17/16 if given three functional dependencies and four attributes involved. A near optimal approximated algorithm for computing the minimum tuple deletion is proposed with a ratio of 2 − 1/2r , where r is the number of given functional dependencies. To guide the data repairing, this dissertation also investigates the data repairing method by using query feedbacks, formally studies two decision problems, functional dependency restricted deletion and insertion propagation problem, corresponding to the feedbacks of deletion and insertion. A comprehensive analysis on both combined and data complexity of the cases is provided by considering different relational operators and feedback types. We have identified the intractable and tractable cases to picture the complexity hierarchy of these problems, and provided the efficient algorithm on these tractable cases. Two improvements are proposed, one focuses on figuring out the minimum vertex cover in conflict graph to improve the upper bound of tuple deletion problem, and the other one is a better dichotomy for deletion and insertion propagation problems at the absence of functional dependencies from the point of respectively considering data, combined and parameterized complexities

    Data bases and data base systems related to NASA's aerospace program. A bibliography with indexes

    This bibliography lists 1778 reports, articles, and other documents introduced into the NASA scientific and technical information system, 1975 through 1980

    Management of Data and Collaboration for Business Processes

    A business process (BP) is a collection of activities and services assembled together to accomplish a business goal. Business process management (BPM) refers to the man- agement and support for a collection of inter-related business processes, which has been playing an essential role in all enterprises. Business practitioners today face enormous difficulties in managing data for BPs due to the fact that the data for BP execution is scattered across databases for enterprise, auxiliary data stores managed by the BPM sys- tems, and even file systems (e.g., definition of BP models). Moreover, current data and business process modeling approaches leave associations of persistent data in databases and data in BPs to the implementation level with little abstraction. Implementing busi- ness logic involves data access from and to database often demands high development efforts.In the current study, we conceptualize the data used in BPs by capturing all needed information for a BP throughout its execution into a “universal artifact”. The concep- tualization provides a foundation for the separation of BP execution and BP data. With the new framework, the data analysis can be carried out without knowing the logic of BPs and the modification of the BP logics can be directly applied without understanding the data structure.Even though universal artifacts provide convenient data access for processes, the data is yet stored in the underlying database and the relationship between data in artifacts and the one in database is still undefined. In general, a way to link the data of these two data sources is needed. we propose a data mapping language aiming to bridge BP data and enterprise database, so that the BP designers only need to focus on business data instead of how to manipulate data by accessing the database. We formulate syntactic conditions upon specified mapping in order that updates upon database or BP data can be properly propagated.In database area, mapping database to a view has been widely studied In recently years, data exchange method extends the notion of database views to a target database (i.e., multiple views) by using a set of conjunctive queries called “tuple generating de- pendency” (tgd). Tgd is a language that is easy to understand/specify, expressive, and decidable for a wide range of properties, which is ideal as a mapping language. Naturally, if both enterprise database and artifacts are represented as relational database, we can take advantage of data exchange technology to bridge enterprise database and artifacts by using tgd as well. Therefore, we re-visit the mapping and update propagation problem under the relational setting.In addition to the data management for a single BP, it is equivalently essential to un- derstand how messages and data should be exchanged among multiple collaborative BPs. With the introduction of artifacts, data is explicitly modeled that can be used in a collab- orative setting. Unfortunately, today’s BP collaboration languages (either orchestration or choreography) do not emphasize on how data is evolved during execution. More- over, the existing languages always assume each participant type has a single participant instance. Therefore, a declarative language is introduced to specify the collaboration among BPs with data and multiple instances concerned. The language adopts a subset of linear temporal logics (LTL) as constraints to restrict the behavior of the collaborative BPs.As a follow-up study, we focus on the satisfiability problem of the declarative BP collaboration language, i.e., whether a given specification as a set of constraints allows at least one finite execution. Naturally, if a specification excludes every possible execution, it should be considered as an undesirable design. Therefore, we consider different combi- nation of the constraint types and for each combination, syntactic conditions are provided to decide whether the given constraints are satisfiable. The syntactic conditions automat- ically lead to polynomial testing methods (comparing to PSPACE-complete complexity of general LTL satisfiability testing)

    Extending dependencies for improving data quality

    This doctoral thesis presents the results of my work on extending dependencies for improving data quality, both in a centralized environment with a single database and in a data exchange and integration environment with multiple databases. The first part of the thesis proposes five classes of data dependencies, referred to as CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly found in practice in a centralized environment. For each class of these dependencies, we investigate two central problems: the satisfiability problem and the implication problem. The satisfiability problem is to determine given a set Σ of dependencies defined on a database schema R, whether or not there exists a nonempty database D of R that satisfies Σ. And the implication problem is to determine whether or not a set Σ of dependencies defined on a database schema R entails another dependency φ on R. That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are important for the validation and optimization of data-cleaning processes. We establish complexity results of the satisfiability problem and the implication problem for all these five classes of dependencies, both in the absence of finite-domain attributes and in the general setting with finite-domain attributes. Moreover, SQL-based techniques are developed to detect data inconsistencies for each class of the proposed dependencies, which can be easily implemented on the top of current database management systems. The second part of the thesis studies three important topics for data cleaning in a data exchange and integration environment with multiple databases. One is the dependency propagation problem, which is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. We investigate dependency propagation for views defined in various fragments of relational algebra, conditional functional dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies given as either CFDs or traditional functional dependencies (FDs). And we establish lower and upper bounds, all matching, ranging from PTIME to undecidable. These not only provide the first results for CFD propagation, but also extend the classical work of FD propagation by giving new complexity bounds in the presence of a setting with finite domains. We finally provide the first algorithm for computing a minimal cover of all CFDs propagated via SPC views. The algorithm has the same complexity as one of the most efficient algorithms for computing a cover of FDs propagated via a projection view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching dependencies (MDs) is introduced for specifying the semantics of unreliable data. As opposed to static constraints for schema design such as FDs, MDs are developed for record matching, and are defined in terms of similarity metrics and a dynamic semantics. We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. We also propose a mechanism for inferring MDs with a sound and complete system, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. We finally provide a quadratic time algorithm for inferring MDs, and an effective algorithm for deducing quality RCKs from a given set of MDs. The last one is finding certain fixes for data monitoring [CGGM03, SMO07], which is to find and correct errors in a tuple when it is created, either entered manually or generated by some process. That is, we want to ensure that a tuple t is clean before it is used, to prevent errors introduced by adding t. As noted by [SMO07], it is far less costly to correct a tuple at the point of entry than fixing it afterward. Data repairing based on integrity constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct

    Calculating Constraints on Relational Expressions

    A desirable feature of a database management system is the ability to support many views of the database via several user models. In order to provide this support while allowing the user to believe that his/her view and data model are the only ones, the database system must have a number of facilities. One of the most important of these is a mechanism to tell when view constraints will be satisfied given that the underlying database constraints are satisfied so that the user always sees what is expected. This paper deals with a particular instance of this problem where the constraints are functional dependencies and the views are created through relational algebra expressions. The problem immediately reduces to the problem of calculating all valid functional dependencies (and other constraints) on a relational algebra expression over relations in the base schema. The problem is undecidable in general but we give a sound and complete algorithm when set difference is omitted from relational algebra