60 research outputs found

    Representation Independent Analytics Over Structured Data

    Full text link
    Database analytics algorithms leverage quantifiable structural properties of the data to predict interesting concepts and relationships. The same information, however, can be represented using many different structures and the structural properties observed over particular representations do not necessarily hold for alternative structures. Thus, there is no guarantee that current database analytics algorithms will still provide the correct insights, no matter what structures are chosen to organize the database. Because these algorithms tend to be highly effective over some choices of structure, such as that of the databases used to validate them, but not so effective with others, database analytics has largely remained the province of experts who can find the desired forms for these algorithms. We argue that in order to make database analytics usable, we should use or develop algorithms that are effective over a wide range of choices of structural organizations. We introduce the notion of representation independence, study its fundamental properties for a wide range of data analytics algorithms, and empirically analyze the amount of representation independence of some popular database analytics algorithms. Our results indicate that most algorithms are not generally representation independent and find the characteristics of more representation independent heuristics under certain representational shifts

    Schema Independent Relational Learning

    Full text link
    Learning novel concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same data set may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, the output of current relational learning algorithms tends to vary quite substantially over the choice of schema, both in terms of learning accuracy and efficiency. This variation complicates their off-the-shelf application. In this paper, we introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We study both sample-based learning algorithms, which learn from sets of labeled examples, and query-based algorithms, which learn by asking queries to an oracle. We prove that current relational learning algorithms are generally not schema independent. For query-based learning algorithms we show that the (de) composition transformations influence their query complexity. We propose Castor, a sample-based relational learning algorithm that achieves schema independence by leveraging data dependencies. We support the theoretical results with an empirical study that demonstrates the schema dependence/independence of several algorithms on existing benchmark and real-world datasets under (de) compositions

    Master of Science

    Get PDF
    thesisData quality has become a significant issue in healthcare as large preexisting databases are integrated to provide greater depth for research and process improvement. Large scale data integration exposes and compounds data quality issues latent in source systems. Although the problems related to data quality in transactional databases have been identified and well-addressed, the application of data quality constraints to large scale data repositories has not and requires novel applications of traditional concepts and methodologies. Despite an abundance of data quality theory, tools and software, there is no consensual technique available to guide developers in the identification of data integrity issues and the application of data quality rules in warehouse-type applications. Data quality measures are frequently developed on an ad hoc basis or methods designed to assure data quality in transactional systems are loosely applied to analytic data stores. These measures are inadequate to address the complex data quality issues in large, integrated data repositories particularly in the healthcare domain with its heterogeneous source systems. This study derives a taxonomy of data quality rules from relational database theory. It describes the development and implementation of data quality rules in the Analytic Health Repository at Intermountain Healthcare and situates the data quality rules in the taxonomy. Further, it identifies areas in which more rigorous data quality iv should be explored. This comparison demonstrates the superiority of a structured approach to data quality rule identification

    The normalization of frames as a superclass of relations

    Get PDF
    M.Sc. (Computer science)Knowledge representation suffers from certain problems, which is not a result of the inadequacies of knowledge representation schemes, but of the way in which they are used and implemented. In the first part of this dissertation we examine the relational model (as used in relational database management systems) and we examine frames (a knowledge representation scheme used in expert systems), as proposed by M. Minsky [MIN75]. We then provide our own definition of frames. In the second part, we examine similarities between the two models (the relational model and our frame model), establishing frames as a superclass of relations. We then define normalization for frames and examine how normalization might solve some of the problems we have identified. We then examine the integration of knowledge-based systems and database management systems and classify our normalization of frames as such an attempt. We conclude by examining the place of normalization within the expert system development life cycl

    Database design: A practical methodology.

    Get PDF

    A SQL front-end semantic data model

    Get PDF
    SQLSDM is a front end semantic data model to a SQL relational database management system (RDBMS). SQLSDM provides a more semantically complete RDBMS through the implementation of a Domain and Relational Integrity scheme. SQLSDM provides integrity definition functions and a sub-system to interpret SQL commands . Integrity system tables are created through the use of SQLSDM \u27 s domain definition command and SQL \u27 s CREATE TABLE command. As SQL database update commands are interpreted, SQLSDM uses these integrity tables to enforce domain and referential integrity. SQLSDM operates virtually transparent to the user and provides for greater database consistency and semantic control. Furthermore, SQLSDM is designed and engineered to be a portable front-end that may be implemented on any SQL relational database management system
    • …
    corecore