7 research outputs found

    Strongly possible functional dependencies for SQL

    Get PDF
    Missing data is a large-scale challenge to research and investigate. It reduces the statistical power and produces negative consequences that may introduce selection bias on the data. Many approaches to handle this problem have been introduced. The main approaches suggested are either missing values to be ignored (removed) or imputed (filled in) with new values. This paper uses the second method. Possible worlds and possible and certain keys were introduced in Köhler et.al., and by Levene et.al. Köhler and Link introduced certain functional dependencies (c-FD) as a natural complement to Lien's class of possible functional dependencies (p-FD). Weak and strong functional dependencies were studied by Levene and Loizou. We introduced the intermediate concept of strongly possible worlds that are obtained by imputing values already existing in the table in a preceding paper. This results in strongly possible keys (spKey's) and strongly possible functional dependencies (spFD's). We give a polynomial algorithm to verify a single spKey and show that in general, it is NP-complete to verify an arbitrary collection of spKeys. We give a graph-theoretical characterization of the validity of a given spFD X →sp Y. We show, that the complexity to verify a single strongly possible functional dependency is NP-complete in general, then we introduce some cases when verifying a single spFD can be done in polynomial time. As a step forward axiomatization of spFD's, the rules given for weak and strong functional dependencies are checked. Appropriate weakenings of those that are not sound for spFD's are listed. The interaction between spFD's and spKey's and certain keys is studied. Furthermore, a graph theoretical characterization of implication between singular attribute spFD's is given

    Dmodel and Dalgebra : a data model and algebra for office documents

    Get PDF
    This dissertation presents a data model (called D_model) and an algebra (called D_ algebra) for office documents. The data model adopts a very natural view of modeling office documents. Documents are grouped into classes; each class is characterized by a frame template , which describes the properties (or attributes) for the class of documents. A frame template is instantiated by providing it with values to form a frame instance which becomes the synopsis of the document of the class associated with the frame template. Different frame instances can be grouped into a folder. Therefore, a folder is a set of frame instances which need not be over the same frame template. The D_model is a dual model which describes documents using two hierarchies: a document type hierarchy which depicts the structural organization of the documents and a folder organization, which represents the user\u27s real-world document filing system. The document type hierarchy exploits structural commonalities between frame templates. Such a hierarchy helps classify various documents. The folder organization mimics the user\u27s real-world document filing system and provides the user with an intuitively clear view of the filing system. This facilitates document retrieval activities. The D_algebra includes a family of operators which together comprise the fundamental query language for the D_model. The algebra provides operators that can be applied to folders which contain frame instances of different types. It has more expressive power than the relational algebra. It extends the classical relational algebra by associating attributes with types, and supporting attribute inheritance. Aggregate operators which can be applied to different frame instances in a folder are also provided. The proposed algebra is used as a sound basis to express the semantics of a high level query language for a document processing system, called TEXPROS. In the model, frame instances can represent incomplete information. Null values of the form value at present unknown are used to denote missing information in some fields of the incomplete frame instances. This dissertation provides a proof-theoretic characterization of the data model and defines the semantics of the null values within the proof-theoretic paradigm

    Disjunctively incomplete information in relational databases: modeling and related issues

    Get PDF
    In this dissertation, the issues related to the information incompleteness in relational databases are explored. In general, this dissertation can be divided into two parts. The first part extends the relational natural join operator and the update operations of insertion and deletion to I-tables, an extended relational model representing inclusively indefinite and maybe information, in a semantically correct manner. Rudimentary or naive algorithms for computing natural joins on I-tables require an exponential number of pair-up operations and block accesses proportional to the size of I-tables due to the combinatorial nature of natural joins on I-tables. Thus, the problem becomes intractable for large I-tables. An algorithm for computing natural joins under the extended model which reduces the number of pair-up operations to a linear order of complexity in general and in the worst case to a polynomial order of complexity with respect to the size of I-tables is proposed in this dissertation. In addition, this algorithm also reduces the number of block accesses to a linear order of complexity with respect to the size of I-tables;The second part is related to the modeling aspect of incomplete databases. An extended relational model, called E-table, is proposed. E-table is capable of representing exclusively disjunctive information. That is, disjunctions of the form P[subscript]1\mid P[subscript]2\mid·s\mid P[subscript]n, where ǁ denotes a generalized logical exclusive or indicating that exactly one of the P[subscript]i\u27s can be true. The information content of an E-table is precisely defined and relational operators of selection, projection, difference, union, intersection, and cartisian product are extended to E-tables in a semantically correct manner. Conditions under which redundancies could arise due to the presence of exclusively disjunctive information are characterized and the procedure for resolving redundancies is presented;Finally, this dissertation is concluded with discussions on the directions for further research in the area of incomplete information modeling. In particular, a sketch of a relational model, IE-table (Inclusive and Exclusive table), for representing both inclusively and exclusively disjunctive information is provided

    Acta Cybernetica : Volume 25. Number 3.

    Get PDF

    On the Discovery of Semantically Meaningful SQL Constraints from Armstrong Samples: Foundations, Implementation, and Evaluation

    No full text
    A database is said to be C-Armstrong for a finite set Σ of data dependencies in a class C if the database satisfies all data dependencies in Σ and violates all data dependencies in C that are not implied by Σ. Therefore, Armstrong databases are concise, user-friendly representations of abstract data dependencies that can be used to judge, justify, convey, and test the understanding of database design choices. Indeed, an Armstrong database satisfies exactly those data dependencies that are considered meaningful by the current design choice Σ. Structural and computational properties of Armstrong databases have been deeply investigated in Codd’s Turing Award winning relational model of data. Armstrong databases have been incorporated in approaches towards relational database design. They have also been found useful for the elicitation of requirements, the semantic sampling of existing databases, and the specification of schema mappings. This research establishes a toolbox of Armstrong databases for SQL data. This is challenging as SQL data can contain null marker occurrences in columns declared NULL, and may contain duplicate rows. Thus, the existing theory of Armstrong databases only applies to idealized instances of SQL data, that is, instances without null marker occurrences and without duplicate rows. For the thesis, two popular interpretations of null markers are considered: the no information interpretation used in SQL, and the exists but unknown interpretation by Codd. Furthermore, the study is limited to the popular class C of functional dependencies. However, the presence of duplicate rows means that the class of uniqueness constraints is no longer subsumed by the class of functional dependencies, in contrast to the relational model of data. As a first contribution a provably-correct algorithm is developed that computes Armstrong databases for an arbitrarily given finite set of uniqueness constraints and functional dependencies. This contribution is based on axiomatic, algorithmic and logical characterizations of the associated implication problem that are also established in this thesis. While the problem to decide whether a given database is Armstrong for a given set of such constraints is precisely exponential, our algorithm computes an Armstrong database with a number of rows that is at most quadratic in the number of rows of a minimum-sized Armstrong database. As a second contribution the algorithms are implemented in the form of a design tool. Users of the tool can therefore inspect Armstrong databases to analyze their current design choice Σ. Intuitively, Armstrong databases are useful for the acquisition of semantically meaningful constraints, if the users can recognize the actual meaningfulness of constraints that they incorrectly perceived as meaningless before the inspection of an Armstrong database. As a final contribution, measures are introduced that formalize the term “useful” and it is shown by some detailed experiments that Armstrong tables, as computed by the tool, are indeed useful. In summary, this research establishes a toolbox of Armstrong databases that can be applied by database designers to concisely visualize constraints on SQL data. Such support can lead to database designs that guarantee efficient data management in practice

    Comparaison et évolution de schémas XML

    Get PDF
    XML has become the de facto format for data exchange. We aim at establishing a multi-system environment where some local original systems work in harmony with a global integrated system, which is a conservative evolution of local ones. Data exchange is possible in both directions, allowing activities on both levels. For this purpose, we need schema mapping whose is to ensure schema evolution, and to guide the construction of a document translator, allowing automatic data adaptation wrt type evolution. We propose a set of tools to help dealing with XML database evolution. These tools are used : (i) to compute a mapping capable of obtaining a global schema which is a conservative extension of original local schemas, and to adapt XML documents ; (ii) to compute the set of integrity constraints for the global system on the basis of the local ones ; (iii) to compare XML types of two systems in order to replace a system by another one ; (iv) to correct a new document with respect to an XML schema. Experimental results are discussed, showing the efficiency of our methods in many situations.XML est devenu le format standard d’échange de données. Nous souhaitons construire un environnement multi-système où des systèmes locaux travaillent en harmonie avec un système global, qui est une évolution conservatrice des systèmes locaux. Dans cet environnement, l’échange de données se fait dans les deux sens. Pour y parvenir nous avons besoin d’un mapping entre les schémas des systèmes. Le but du mapping est d’assurer l’évolution des schémas et de guider l’adaptation des documents entre les schémas concernés. Nous proposons des outils pour faciliter l’évolution de base de données XML. Ces outils permettent de : (i) calculer un mapping entre le schéma global et les schémas locaux, et d’adapter les documents ; (ii) calculer les contraintes d’intégrité du système global à partir de celles des systèmes locaux ; (iii) comparer les schémas de deux systèmes pour pouvoir remplacer un système par celui qui le contient ; (iv) corriger un nouveau document qui est invalide par rapport au schéma d’un système, afin de l’ajouter au système. Des expériences ont été menées sur des données synthétiques et réelles pour montrer l’efficacité de nos méthodes
    corecore