572 research outputs found
The Bases of Association Rules of High Confidence
We develop a new approach for distributed computing of the association rules
of high confidence in a binary table. It is derived from the D-basis algorithm
in K. Adaricheva and J.B. Nation (TCS 2017), which is performed on multiple
sub-tables of a table given by removing several rows at a time. The set of
rules is then aggregated using the same approach as the D-basis is retrieved
from a larger set of implications. This allows to obtain a basis of association
rules of high confidence, which can be used for ranking all attributes of the
table with respect to a given fixed attribute using the relevance parameter
introduced in K. Adaricheva et al. (Proceedings of ICFCA-2015). This paper
focuses on the technical implementation of the new algorithm. Some testing
results are performed on transaction data and medical data.Comment: Presented at DTMN, Sydney, Australia, July 28, 201
Query Rewriting and Optimization for Ontological Databases
Ontological queries are evaluated against a knowledge base consisting of an
extensional database and an ontology (i.e., a set of logical assertions and
constraints which derive new intensional knowledge from the extensional
database), rather than directly on the extensional database. The evaluation and
optimization of such queries is an intriguing new problem for database
research. In this paper, we discuss two important aspects of this problem:
query rewriting and query optimization. Query rewriting consists of the
compilation of an ontological query into an equivalent first-order query
against the underlying extensional database. We present a novel query rewriting
algorithm for rather general types of ontological constraints which is
well-suited for practical implementations. In particular, we show how a
conjunctive query against a knowledge base, expressed using linear and sticky
existential rules, that is, members of the recently introduced Datalog+/-
family of ontology languages, can be compiled into a union of conjunctive
queries (UCQ) against the underlying database. Ontological query optimization,
in this context, attempts to improve this rewriting process so to produce
possibly small and cost-effective UCQ rewritings for an input query.Comment: arXiv admin note: text overlap with arXiv:1312.5914 by other author
Discovery of the D-basis in binary tables based on hypergraph dualization
Discovery of (strong) association rules, or implications, is an important
task in data management, and it nds application in arti cial intelligence,
data mining and the semantic web. We introduce a novel approach
for the discovery of a speci c set of implications, called the D-basis, that provides
a representation for a reduced binary table, based on the structure of
its Galois lattice. At the core of the method are the D-relation de ned in
the lattice theory framework, and the hypergraph dualization algorithm that
allows us to e ectively produce the set of transversals for a given Sperner hypergraph.
The latter algorithm, rst developed by specialists from Rutgers
Center for Operations Research, has already found numerous applications in
solving optimization problems in data base theory, arti cial intelligence and
game theory. One application of the method is for analysis of gene expression
data related to a particular phenotypic variable, and some initial testing is
done for the data provided by the University of Hawaii Cancer Cente
Discovery of the D-basis in binary tables based on hypergraph dualization
Discovery of (strong) association rules, or implications, is an important
task in data management, and it nds application in arti cial intelligence,
data mining and the semantic web. We introduce a novel approach
for the discovery of a speci c set of implications, called the D-basis, that provides
a representation for a reduced binary table, based on the structure of
its Galois lattice. At the core of the method are the D-relation de ned in
the lattice theory framework, and the hypergraph dualization algorithm that
allows us to e ectively produce the set of transversals for a given Sperner hypergraph.
The latter algorithm, rst developed by specialists from Rutgers
Center for Operations Research, has already found numerous applications in
solving optimization problems in data base theory, arti cial intelligence and
game theory. One application of the method is for analysis of gene expression
data related to a particular phenotypic variable, and some initial testing is
done for the data provided by the University of Hawaii Cancer Cente
Towards Next Generation Sequential and Parallel SAT Solvers
This thesis focuses on improving the SAT solving technology. The improvements focus on two major subjects: sequential SAT solving and parallel SAT solving.
To better understand sequential SAT algorithms, the abstract reduction system Generic CDCL is introduced. With Generic CDCL, the soundness of solving techniques can be modeled. Next, the conflict driven clause learning algorithm is extended with the three techniques local look-ahead, local probing and all UIP learning that allow more global reasoning during search. These techniques improve the performance of the sequential SAT solver Riss. Then, the formula simplification techniques bounded variable addition, covered literal elimination and an advanced cardinality constraint extraction are introduced. By using these techniques, the reasoning of the overall SAT solving tool chain becomes stronger than plain resolution. When using these three techniques in the formula simplification tool Coprocessor before using Riss to solve a formula, the performance can be improved further.
Due to the increasing number of cores in CPUs, the scalable parallel SAT solving approach iterative partitioning has been implemented in Pcasso for the multi-core architecture. Related work on parallel SAT solving has been studied to extract main ideas that can improve Pcasso. Besides parallel formula simplification with bounded variable elimination, the major extension is the extended clause sharing level based clause tagging, which builds the basis for conflict driven node killing. The latter allows to better identify unsatisfiable search space partitions. Another improvement is to combine scattering and look-ahead as a superior search space partitioning function. In combination with Coprocessor, the introduced extensions increase the performance of the parallel solver Pcasso. The implemented system turns out to be scalable for the multi-core architecture. Hence iterative partitioning is interesting for future parallel SAT solvers.
The implemented solvers participated in international SAT competitions. In 2013 and 2014 Pcasso showed a good performance. Riss in combination with Copro- cessor won several first, second and third prices, including two Kurt-Gödel-Medals. Hence, the introduced algorithms improved modern SAT solving technology
Classification algorithms for Big Data with applications in the urban security domain
A classification algorithm is a versatile tool, that can serve as a predictor for the
future or as an analytical tool to understand the past. Several obstacles prevent
classification from scaling to a large Volume, Velocity, Variety or Value. The aim
of this thesis is to scale distributed classification algorithms beyond current limits,
assess the state-of-practice of Big Data machine learning frameworks and validate
the effectiveness of a data science process in improving urban safety.
We found in massive datasets with a number of large-domain categorical features
a difficult challenge for existing classification algorithms. We propose associative
classification as a possible answer, and develop several novel techniques to distribute
the training of an associative classifier among parallel workers and improve the final
quality of the model. The experiments, run on a real large-scale dataset with more
than 4 billion records, confirmed the quality of the approach.
To assess the state-of-practice of Big Data machine learning frameworks and
streamline the process of integration and fine-tuning of the building blocks, we
developed a generic, self-tuning tool to extract knowledge from network traffic
measurements. The result is a system that offers human-readable models of the data
with minimal user intervention, validated by experiments on large collections of
real-world passive network measurements.
A good portion of this dissertation is dedicated to the study of a data science
process to improve urban safety. First, we shed some light on the feasibility of a
system to monitor social messages from a city for emergency relief. We then propose
a methodology to mine temporal patterns in social issues, like crimes. Finally,
we propose a system to integrate the findings of Data Science on the citizenry’s
perception of safety and communicate its results to decision makers in a timely
manner. We applied and tested the system in a real Smart City scenario, set in Turin,
Italy
Parallelization of formal concept analysis algorithms
Formal Concept Analysis provides the mathematical notations for representing concepts and
concept hierarchies making use of order and lattice theory. This has now been used in
numerous applications which include software engineering, linguistics, sociology, information
sciences, information technology, genetics, biology and in engineering. The algorithms
derived from Kustenskov's CbO were found to provide the most efficient means of computing
formal concepts in several research papers. In this thesis key enhancements to the original
CbO algorithms are discussed in detail. The effects of these key features are presented in both
isolation and combination. Eight different variations of the CbO algorithms highlighting the
key features were compared in a level playing field by presenting them using the same notation
and implementing them from the notation in the same way. The three main enhancements
considered are the partial closure with incremental closure of intents, inherited canonicity test
failures and using a combined depth first and breadth first search. The algorithms were
implemented in an un-optimized way to focus on the comparison on the algorithms themselves
and not on any efficiencies provided by optimizing code.
One of the findings were that there is a significant performance improvement when partial
closure with incremental closure of intents is used in isolation. However there is no significant
performance improvement when the combined depth and breadth first search or the inherited
canonicity test failure feature is used in isolation. The inherited canonicity test failure needs
to be combined with the combined depth and breadth first feature to obtain a performance
increase. Combining all the three enhancements brought the best performance.
The main contribution of the thesis are the four new parallel In-Close3 algorithms. The shared
memory algorithms Direct Parallel In-Close3, the Queue Parallel In-Close3 algorithm and the
Distributed Memory In-Close3 algorithm showed significant potential. The shared memory
algorithms were implemented using OpenMP and the distributed memory algorithm was
implemented using MPI. All implementations were validated and showed scalability.
Experiments were carried to test the features of the parallel algorithms and their
implementations using the UK National Super Computer Archer and Colfax Clusters. The
thesis presents the key parallelization strategies used and presents experimental results of the
parallelization
- …