8,271 research outputs found
Efficient Discovery of Ontology Functional Dependencies
Poor data quality has become a pervasive issue due to the increasing
complexity and size of modern datasets. Constraint based data cleaning
techniques rely on integrity constraints as a benchmark to identify and correct
errors. Data values that do not satisfy the given set of constraints are
flagged as dirty, and data updates are made to re-align the data and the
constraints. However, many errors often require user input to resolve due to
domain expertise defining specific terminology and relationships. For example,
in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be
captured in a pharmaceutical ontology. While functional dependencies (FDs) have
traditionally been used in existing data cleaning solutions to model syntactic
equivalence, they are not able to model broader relationships (e.g., is-a)
defined by an ontology. In this paper, we take a first step towards extending
the set of data quality constraints used in data cleaning by defining and
discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out
theoretical and practical foundations for OFDs, including a set of sound and
complete axioms, and a linear inference procedure. We then develop effective
algorithms for discovering OFDs, and a set of optimizations that efficiently
prune the search space. Our experimental evaluation using real data show the
scalability and accuracy of our algorithms.Comment: 12 page
Extending dependencies for improving data quality
This doctoral thesis presents the results of my work on extending dependencies for
improving data quality, both in a centralized environment with a single database and
in a data exchange and integration environment with multiple databases.
The first part of the thesis proposes five classes of data dependencies, referred to as
CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly
found in practice in a centralized environment. For each class of these dependencies,
we investigate two central problems: the satisfiability problem and the implication
problem. The satisfiability problem is to determine given a set Σ of dependencies
defined on a database schema R, whether or not there exists a nonempty database D of
R that satisfies Σ. And the implication problem is to determine whether or not a set Σ
of dependencies defined on a database schema R entails another dependency φ on R.
That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are
important for the validation and optimization of data-cleaning processes. We establish
complexity results of the satisfiability problem and the implication problem for all
these five classes of dependencies, both in the absence of finite-domain attributes and in
the general setting with finite-domain attributes. Moreover, SQL-based techniques are
developed to detect data inconsistencies for each class of the proposed dependencies,
which can be easily implemented on the top of current database management systems.
The second part of the thesis studies three important topics for data cleaning in a
data exchange and integration environment with multiple databases.
One is the dependency propagation problem, which is to determine, given a view
defined on data sources and a set of dependencies on the sources, whether another
dependency is guaranteed to hold on the view. We investigate dependency propagation
for views defined in various fragments of relational algebra, conditional functional
dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies
given as either CFDs or traditional functional dependencies (FDs). And we establish
lower and upper bounds, all matching, ranging from PTIME to undecidable. These not
only provide the first results for CFD propagation, but also extend the classical work
of FD propagation by giving new complexity bounds in the presence of a setting with
finite domains. We finally provide the first algorithm for computing a minimal cover of
all CFDs propagated via SPC views. The algorithm has the same complexity as one of
the most efficient algorithms for computing a cover of FDs propagated via a projection
view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching
dependencies (MDs) is introduced for specifying the semantics of unreliable data. As
opposed to static constraints for schema design such as FDs, MDs are developed for
record matching, and are defined in terms of similarity metrics and a dynamic semantics. We identify a special case of MDs, referred to as relative candidate keys (RCKs),
to determine what attributes to compare and how to compare them when matching
records across possibly different relations. We also propose a mechanism for inferring MDs with a sound and complete system, a departure from traditional implication
analysis, such that when we cannot match records by comparing attributes that contain
errors, we may still find matches by using other, more reliable attributes. We finally
provide a quadratic time algorithm for inferring MDs, and an effective algorithm for
deducing quality RCKs from a given set of MDs.
The last one is finding certain fixes for data monitoring [CGGM03, SMO07], which
is to find and correct errors in a tuple when it is created, either entered manually or
generated by some process. That is, we want to ensure that a tuple t is clean before it
is used, to prevent errors introduced by adding t. As noted by [SMO07], it is far less
costly to correct a tuple at the point of entry than fixing it afterward.
Data repairing based on integrity constraints may not find certain fixes that are
absolutely correct, and worse, may introduce new errors when repairing the data. We
propose a method for finding certain fixes, based on master data, a notion of certain
regions, and a class of editing rules. A certain region is a set of attributes that are
assured correct by the users. Given a certain region and master data, editing rules tell
us what attributes to fix and how to update them. We show how the method can be used
in data monitoring and enrichment. We develop techniques for reasoning about editing
rules, to decide whether they lead to a unique fix and whether they are able to fix all
the attributes in a tuple, relative to master data and a certain region. We also provide
an algorithm to identify minimal certain regions, such that a certain fix is warranted by
editing rules and master data as long as one of the regions is correct
Discovering Entities in Process Execution Logs
Töö on kirjutatud protsessikaeve valdkonnas Artefaktikeskse teenuste koosvõime
projekti (ACSI) raames. Töö eesmärgiks oli luua meetod sündmuste logidest olemite
avastamiseks ja seda meetodit rakendada.
Loodud meetod on kirjutatud Javas ning kujutab endast pluginat ProM
raamistikule. ProM on geneeriline avatud lähtekoodiga Java raamistik protsessikaeve
algoritmide rakendamiseks pluginatena.
Olemite leidmise protsessi saab jaotada järgmisteks sammudeks:
1. Integreerimine ProM-iga.
2. Sisendandmetest (XES formaadis logifailidest) sündmuste tüüpide relatsioonide
koostamine.
3. Funktsionaalsete sõltuvuste leidmine sündmuste logide relatsioonilisest esitusest.
Funktsionaalsete sõltuvuste leidmiseks kasutatakse algoritmi TANE.
4. Funktsionaalsete sõltuvuste alusel kandidaatvõtmete leidmine. Kui relatsioonil on
mitu kandidaatvõtit, palutakse kasutajal valida neist üks primaarseks võtmeks.
5. Sama primaarse võtmega sündmustest moodustatakse üks olem.
6. Kasutajale esitatakse töö käigus moodustatud olemid väljundina või saadetakse
need järgmisele algoritmile töötlemiseks.
Meetodit testiti kahe logifaili puhul, milles olid andmed CD-poe näitel. Meetod töötas
mõlema logifaili puhul korrektselt.The thesis is written in the field of process mining and in the frames of Artifact-Centric Service Interoperation (ACSI) project. The goal of the thesis was to create a method for discovering entities in process execution logs and to implement this method.
The method is implemented as plugin for ProM open source process mining framework and is written in Java. This implementation can be divided into the following steps:
1. Integration with ProM.
2. Extracting the event type tables from the raw log input.
3. Finding functional dependencies from relational representation of event logs. The
functional dependencies are found using an algorithm called TANE.
4. Finding the candidate keys from the functional dependencies. In case a relation has
multiple candidate keys, the user is prompted to select one as primary key.
5. Grouping together the event types that have the same primary keys and integrating
them into one entity.
6. The output is shown to the user or the entities are sent to another algorithm.
Two different event log files were used to test this method. Both of these logs are based on the example of online CD-shop. The method was working correclty for the both event logs
Some combinatorial algorithms connecting hypergraphs
In the relational datamodel the combinatorial algorithms are constructed many authors. The hypergraph is a important concept in the combinatorial theory. The candidate keys play an essential role in the relational datamodel. In this paper, base on hypergraph we present a new combinatorial algorithm that finds all candidate keys of a give relation. Some another results related to the candidate keys are given
- …