4,044 research outputs found
An Efficient Genetic Algorithm for Discovering Diverse-Frequent Patterns
Working with exhaustive search on large dataset is infeasible for several
reasons. Recently, developed techniques that made pattern set mining feasible
by a general solver with long execution time that supports heuristic search and
are limited to small datasets only. In this paper, we investigate an approach
which aims to find diverse set of patterns using genetic algorithm to mine
diverse frequent patterns. We propose a fast heuristic search algorithm that
outperforms state-of-the-art methods on a standard set of benchmarks and
capable to produce satisfactory results within a short period of time. Our
proposed algorithm uses a relative encoding scheme for the patterns and an
effective twin removal technique to ensure diversity throughout the search.Comment: 2015 International Conference on Electrical Engineering and
Information Communication Technology (ICEEICT
Flexible constrained sampling with guarantees for pattern mining
Pattern sampling has been proposed as a potential solution to the infamous
pattern explosion. Instead of enumerating all patterns that satisfy the
constraints, individual patterns are sampled proportional to a given quality
measure. Several sampling algorithms have been proposed, but each of them has
its limitations when it comes to 1) flexibility in terms of quality measures
and constraints that can be used, and/or 2) guarantees with respect to sampling
accuracy. We therefore present Flexics, the first flexible pattern sampler that
supports a broad class of quality measures and constraints, while providing
strong guarantees regarding sampling accuracy. To achieve this, we leverage the
perspective on pattern mining as a constraint satisfaction problem and build
upon the latest advances in sampling solutions in SAT as well as existing
pattern mining algorithms. Furthermore, the proposed algorithm is applicable to
a variety of pattern languages, which allows us to introduce and tackle the
novel task of sampling sets of patterns. We introduce and empirically evaluate
two variants of Flexics: 1) a generic variant that addresses the well-known
itemset sampling task and the novel pattern set sampling task as well as a wide
range of expressive constraints within these tasks, and 2) a specialized
variant that exploits existing frequent itemset techniques to achieve
substantial speed-ups. Experiments show that Flexics is both accurate and
efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal
(ECML/PKDD 2017 journal track
Representation Independent Analytics Over Structured Data
Database analytics algorithms leverage quantifiable structural properties of
the data to predict interesting concepts and relationships. The same
information, however, can be represented using many different structures and
the structural properties observed over particular representations do not
necessarily hold for alternative structures. Thus, there is no guarantee that
current database analytics algorithms will still provide the correct insights,
no matter what structures are chosen to organize the database. Because these
algorithms tend to be highly effective over some choices of structure, such as
that of the databases used to validate them, but not so effective with others,
database analytics has largely remained the province of experts who can find
the desired forms for these algorithms. We argue that in order to make database
analytics usable, we should use or develop algorithms that are effective over a
wide range of choices of structural organizations. We introduce the notion of
representation independence, study its fundamental properties for a wide range
of data analytics algorithms, and empirically analyze the amount of
representation independence of some popular database analytics algorithms. Our
results indicate that most algorithms are not generally representation
independent and find the characteristics of more representation independent
heuristics under certain representational shifts
Heuristic Approaches for Generating Local Process Models through Log Projections
Local Process Model (LPM) discovery is focused on the mining of a set of
process models where each model describes the behavior represented in the event
log only partially, i.e. subsets of possible events are taken into account to
create so-called local process models. Often such smaller models provide
valuable insights into the behavior of the process, especially when no adequate
and comprehensible single overall process model exists that is able to describe
the traces of the process from start to end. The practical application of LPM
discovery is however hindered by computational issues in the case of logs with
many activities (problems may already occur when there are more than 17 unique
activities). In this paper, we explore three heuristics to discover subsets of
activities that lead to useful log projections with the goal of speeding up LPM
discovery considerably while still finding high-quality LPMs. We found that a
Markov clustering approach to create projection sets results in the largest
improvement of execution time, with discovered LPMs still being better than
with the use of randomly generated activity sets of the same size. Another
heuristic, based on log entropy, yields a more moderate speedup, but enables
the discovery of higher quality LPMs. The third heuristic, based on the
relative information gain, shows unstable performance: for some data sets the
speedup and LPM quality are higher than with the log entropy based method,
while for other data sets there is no speedup at all.Comment: paper accepted and to appear in the proceedings of the IEEE Symposium
on Computational Intelligence and Data Mining (CIDM), special session on
Process Mining, part of the Symposium Series on Computational Intelligence
(SSCI
Mining local staircase patterns in noisy data
Most traditional biclustering algorithms identify biclusters with no or little overlap. In this paper, we introduce the problem of identifying staircases of biclusters. Such staircases may be indicative for causal relationships between columns and can not easily be identified by existing biclustering algorithms. Our formalization relies on a scoring function based on the Minimum Description Length principle. Furthermore, we propose a first algorithm for identifying staircase biclusters, based on a combination of local search and constraint programming. Experiments show that the approach is promising
Cooperation between expert knowledge and data mining discovered knowledge: Lessons learned
Expert systems are built from knowledge traditionally elicited from the human expert. It is precisely knowledge elicitation from the expert that is the bottleneck in expert system construction. On the other hand, a data mining system, which automatically extracts knowledge, needs expert guidance on the successive decisions to be made in each of the system phases. In this context, expert knowledge and data mining discovered knowledge can cooperate, maximizing their individual capabilities: data mining discovered knowledge can be used as a complementary source of knowledge for the expert system, whereas expert knowledge can be used to guide the data mining process. This article summarizes different examples of systems where there is cooperation between expert knowledge and data mining discovered knowledge and reports our experience of such cooperation gathered from a medical diagnosis project called Intelligent Interpretation of Isokinetics Data, which we developed. From that experience, a series of lessons were learned throughout project development. Some of these lessons are generally applicable and others pertain exclusively to certain project types
- …