9,081 research outputs found
Efficient Discovery of Ontology Functional Dependencies
Poor data quality has become a pervasive issue due to the increasing
complexity and size of modern datasets. Constraint based data cleaning
techniques rely on integrity constraints as a benchmark to identify and correct
errors. Data values that do not satisfy the given set of constraints are
flagged as dirty, and data updates are made to re-align the data and the
constraints. However, many errors often require user input to resolve due to
domain expertise defining specific terminology and relationships. For example,
in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be
captured in a pharmaceutical ontology. While functional dependencies (FDs) have
traditionally been used in existing data cleaning solutions to model syntactic
equivalence, they are not able to model broader relationships (e.g., is-a)
defined by an ontology. In this paper, we take a first step towards extending
the set of data quality constraints used in data cleaning by defining and
discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out
theoretical and practical foundations for OFDs, including a set of sound and
complete axioms, and a linear inference procedure. We then develop effective
algorithms for discovering OFDs, and a set of optimizations that efficiently
prune the search space. Our experimental evaluation using real data show the
scalability and accuracy of our algorithms.Comment: 12 page
Learning Interpretable Rules for Multi-label Classification
Multi-label classification (MLC) is a supervised learning problem in which,
contrary to standard multiclass classification, an instance can be associated
with several class labels simultaneously. In this chapter, we advocate a
rule-based approach to multi-label classification. Rule learning algorithms are
often employed when one is not only interested in accurate predictions, but
also requires an interpretable theory that can be understood, analyzed, and
qualitatively evaluated by domain experts. Ideally, by revealing patterns and
regularities contained in the data, a rule-based theory yields new insights in
the application domain. Recently, several authors have started to investigate
how rule-based models can be used for modeling multi-label data. Discussing
this task in detail, we highlight some of the problems that make rule learning
considerably more challenging for MLC than for conventional classification.
While mainly focusing on our own previous work, we also provide a short
overview of related work in this area.Comment: Preprint version. To appear in: Explainable and Interpretable Models
in Computer Vision and Machine Learning. The Springer Series on Challenges in
Machine Learning. Springer (2018). See
http://www.ke.tu-darmstadt.de/bibtex/publications/show/3077 for further
informatio
Profiling relational data: a survey
Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases
Describing the complexity of systems: multi-variable "set complexity" and the information basis of systems biology
Context dependence is central to the description of complexity. Keying on the
pairwise definition of "set complexity" we use an information theory approach
to formulate general measures of systems complexity. We examine the properties
of multi-variable dependency starting with the concept of interaction
information. We then present a new measure for unbiased detection of
multi-variable dependency, "differential interaction information." This
quantity for two variables reduces to the pairwise "set complexity" previously
proposed as a context-dependent measure of information in biological systems.
We generalize it here to an arbitrary number of variables. Critical limiting
properties of the "differential interaction information" are key to the
generalization. This measure extends previous ideas about biological
information and provides a more sophisticated basis for study of complexity.
The properties of "differential interaction information" also suggest new
approaches to data analysis. Given a data set of system measurements
differential interaction information can provide a measure of collective
dependence, which can be represented in hypergraphs describing complex system
interaction patterns. We investigate this kind of analysis using simulated data
sets. The conjoining of a generalized set complexity measure, multi-variable
dependency analysis, and hypergraphs is our central result. While our focus is
on complex biological systems, our results are applicable to any complex
system.Comment: 44 pages, 12 figures; made revisions after peer revie
Attribute Interactions in Medical Data Analysis
There is much empirical evidence about the success of naive Bayesian classification (NBC) in medical applications of attribute-based machine learning. NBC assumes conditional independence between attributes. In classification, such classifiers sum up the pieces of class-related evidence from individual attributes, independently of other attributes. The performance, however, deteriorates significantly when the “interactions” between attributes become critical. We propose an approach to handling attribute interactions within the framework of “voting” classifiers, such as NBC. We propose an operational test for detecting interactions in learning data and a procedure that takes the detected interactions into account while learning. This approach induces a structuring of the domain of attributes, it may lead to improved classifier’s performance and may provide useful novel information for the domain expert when interpreting the results of learning. We report on its application in data analysis and model construction for the prediction of clinical outcome in hip arthroplasty
- …