60 research outputs found
Representation Independent Analytics Over Structured Data
Database analytics algorithms leverage quantifiable structural properties of
the data to predict interesting concepts and relationships. The same
information, however, can be represented using many different structures and
the structural properties observed over particular representations do not
necessarily hold for alternative structures. Thus, there is no guarantee that
current database analytics algorithms will still provide the correct insights,
no matter what structures are chosen to organize the database. Because these
algorithms tend to be highly effective over some choices of structure, such as
that of the databases used to validate them, but not so effective with others,
database analytics has largely remained the province of experts who can find
the desired forms for these algorithms. We argue that in order to make database
analytics usable, we should use or develop algorithms that are effective over a
wide range of choices of structural organizations. We introduce the notion of
representation independence, study its fundamental properties for a wide range
of data analytics algorithms, and empirically analyze the amount of
representation independence of some popular database analytics algorithms. Our
results indicate that most algorithms are not generally representation
independent and find the characteristics of more representation independent
heuristics under certain representational shifts
Schema Independent Relational Learning
Learning novel concepts and relations from relational databases is an
important problem with many applications in database systems and machine
learning. Relational learning algorithms learn the definition of a new relation
in terms of existing relations in the database. Nevertheless, the same data set
may be represented under different schemas for various reasons, such as
efficiency, data quality, and usability. Unfortunately, the output of current
relational learning algorithms tends to vary quite substantially over the
choice of schema, both in terms of learning accuracy and efficiency. This
variation complicates their off-the-shelf application. In this paper, we
introduce and formalize the property of schema independence of relational
learning algorithms, and study both the theoretical and empirical dependence of
existing algorithms on the common class of (de) composition schema
transformations. We study both sample-based learning algorithms, which learn
from sets of labeled examples, and query-based algorithms, which learn by
asking queries to an oracle. We prove that current relational learning
algorithms are generally not schema independent. For query-based learning
algorithms we show that the (de) composition transformations influence their
query complexity. We propose Castor, a sample-based relational learning
algorithm that achieves schema independence by leveraging data dependencies. We
support the theoretical results with an empirical study that demonstrates the
schema dependence/independence of several algorithms on existing benchmark and
real-world datasets under (de) compositions
Master of Science
thesisData quality has become a significant issue in healthcare as large preexisting databases are integrated to provide greater depth for research and process improvement. Large scale data integration exposes and compounds data quality issues latent in source systems. Although the problems related to data quality in transactional databases have been identified and well-addressed, the application of data quality constraints to large scale data repositories has not and requires novel applications of traditional concepts and methodologies. Despite an abundance of data quality theory, tools and software, there is no consensual technique available to guide developers in the identification of data integrity issues and the application of data quality rules in warehouse-type applications. Data quality measures are frequently developed on an ad hoc basis or methods designed to assure data quality in transactional systems are loosely applied to analytic data stores. These measures are inadequate to address the complex data quality issues in large, integrated data repositories particularly in the healthcare domain with its heterogeneous source systems. This study derives a taxonomy of data quality rules from relational database theory. It describes the development and implementation of data quality rules in the Analytic Health Repository at Intermountain Healthcare and situates the data quality rules in the taxonomy. Further, it identifies areas in which more rigorous data quality iv should be explored. This comparison demonstrates the superiority of a structured approach to data quality rule identification
The normalization of frames as a superclass of relations
M.Sc. (Computer science)Knowledge representation suffers from certain problems, which is not a result of the inadequacies of knowledge representation schemes, but of the way in which they are used and implemented. In the first part of this dissertation we examine the relational model (as used in relational database management systems) and we examine frames (a knowledge representation scheme used in expert systems), as proposed by M. Minsky [MIN75]. We then provide our own definition of frames. In the second part, we examine similarities between the two models (the relational model and our frame model), establishing frames as a superclass of relations. We then define normalization for frames and examine how normalization might solve some of the problems we have identified. We then examine the integration of knowledge-based systems and database management systems and classify our normalization of frames as such an attempt. We conclude by examining the place of normalization within the expert system development life cycl
A SQL front-end semantic data model
SQLSDM is a front end semantic data model to a SQL relational database management system (RDBMS). SQLSDM provides a more semantically complete RDBMS through the implementation of a Domain and Relational Integrity scheme. SQLSDM provides integrity definition functions and a sub-system to interpret SQL commands . Integrity system tables are created through the use of SQLSDM \u27 s domain definition command and SQL \u27 s CREATE TABLE command. As SQL database update commands are interpreted, SQLSDM uses these integrity tables to enforce domain and referential integrity. SQLSDM operates virtually transparent to the user and provides for greater database consistency and semantic control. Furthermore, SQLSDM is designed and engineered to be a portable front-end that may be implemented on any SQL relational database management system
- …