Search CORE

4,044 research outputs found

An Efficient Genetic Algorithm for Discovering Diverse-Frequent Patterns

Author: Alam Hasib Ul
Khatun Shanjida
Shatabda Swakkhar
Publication venue
Publication date: 19/07/2015
Field of study

Working with exhaustive search on large dataset is infeasible for several reasons. Recently, developed techniques that made pattern set mining feasible by a general solver with long execution time that supports heuristic search and are limited to small datasets only. In this paper, we investigate an approach which aims to find diverse set of patterns using genetic algorithm to mine diverse frequent patterns. We propose a fast heuristic search algorithm that outperforms state-of-the-art methods on a standard set of benchmarks and capable to produce satisfactory results within a short period of time. Our proposed algorithm uses a relative encoding scheme for the patterns and an effective twin removal technique to ensure diversity throughout the search.Comment: 2015 International Conference on Electrical Engineering and Information Communication Technology (ICEEICT

arXiv.org e-Print Archive

Crossref

Flexible constrained sampling with guarantees for pattern mining

Author: A Giacometti
A Zimmermann
C Bucilă
CP Gomes
F Bonchi
Luc De Raedt
M Berlingerio
M Boley
MA Hasan
Matthijs van Leeuwen
S Ermon
S Nijssen
T Calders
T Guns
T Guns
Vladimir Dzyuba
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to 1) flexibility in terms of quality measures and constraints that can be used, and/or 2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: 1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and 2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal (ECML/PKDD 2017 journal track

arXiv.org e-Print Archive

Crossref

Leiden University Scholary Publications

Representation Independent Analytics Over Structured Data

Author: Chodpathumwan Yodsawalai
Fern Alan
Picado Jose
Sun Yizhou
Termehchy Arash
Publication venue
Publication date: 08/09/2014
Field of study

Database analytics algorithms leverage quantifiable structural properties of the data to predict interesting concepts and relationships. The same information, however, can be represented using many different structures and the structural properties observed over particular representations do not necessarily hold for alternative structures. Thus, there is no guarantee that current database analytics algorithms will still provide the correct insights, no matter what structures are chosen to organize the database. Because these algorithms tend to be highly effective over some choices of structure, such as that of the databases used to validate them, but not so effective with others, database analytics has largely remained the province of experts who can find the desired forms for these algorithms. We argue that in order to make database analytics usable, we should use or develop algorithms that are effective over a wide range of choices of structural organizations. We introduce the notion of representation independence, study its fundamental properties for a wide range of data analytics algorithms, and empirically analyze the amount of representation independence of some popular database analytics algorithms. Our results indicate that most algorithms are not generally representation independent and find the characteristics of more representation independent heuristics under certain representational shifts

arXiv.org e-Print Archive

CiteSeerX

Heuristic Approaches for Generating Local Process Models through Log Projections

Author: Haakma Reinder
Sidorova Natalia
Tax Niek
van der Aalst Wil M. P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/10/2016
Field of study

Local Process Model (LPM) discovery is focused on the mining of a set of process models where each model describes the behavior represented in the event log only partially, i.e. subsets of possible events are taken into account to create so-called local process models. Often such smaller models provide valuable insights into the behavior of the process, especially when no adequate and comprehensible single overall process model exists that is able to describe the traces of the process from start to end. The practical application of LPM discovery is however hindered by computational issues in the case of logs with many activities (problems may already occur when there are more than 17 unique activities). In this paper, we explore three heuristics to discover subsets of activities that lead to useful log projections with the goal of speeding up LPM discovery considerably while still finding high-quality LPMs. We found that a Markov clustering approach to create projection sets results in the largest improvement of execution time, with discovered LPMs still being better than with the use of randomly generated activity sets of the same size. Another heuristic, based on log entropy, yields a more moderate speedup, but enables the discovery of higher quality LPMs. The third heuristic, based on the relative information gain, shows unstable performance: for some data sets the speedup and LPM quality are higher than with the log entropy based method, while for other data sets there is no speedup at all.Comment: paper accepted and to appear in the proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), special session on Process Mining, part of the Symposium Series on Computational Intelligence (SSCI

arXiv.org e-Print Archive

Crossref

Mining local staircase patterns in noisy data

Author: De Raedt Luc
Fierro Ana Carolina
Guns Tias
International workshop on Co-Clustering and Applications
Le Van Thanh
Marchal Kathleen
Nijssen Siegfried
van Leeuwen Matthijs
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

Most traditional biclustering algorithms identify biclusters with no or little overlap. In this paper, we introduce the problem of identifying staircases of biclusters. Such staircases may be indicative for causal relationships between columns and can not easily be identified by existing biclustering algorithms. Our formalization relies on a scoring function based on the Minimum Description Length principle. Furthermore, we propose a first algorithm for identifying staircase biclusters, based on a combination of local search and constraint programming. Experiments show that the approach is promising

Crossref

Ghent University Academic Bibliography

DIAL UCLouvain

Cooperation between expert knowledge and data mining discovered knowledge: Lessons learned

Author: Adomavicius
Alonso
Alonso
Alonso
Aurora Pérez
Bernstein
Cooke
Fernando Alonso
Gleeson
Hong
Huang
Juan P. Valente
Kopanas
Kusiak
Lama
Larose
Lima
Loïc Martínez
Mejía-Lavalle
Needleman
Wang
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

Expert systems are built from knowledge traditionally elicited from the human expert. It is precisely knowledge elicitation from the expert that is the bottleneck in expert system construction. On the other hand, a data mining system, which automatically extracts knowledge, needs expert guidance on the successive decisions to be made in each of the system phases. In this context, expert knowledge and data mining discovered knowledge can cooperate, maximizing their individual capabilities: data mining discovered knowledge can be used as a complementary source of knowledge for the expert system, whereas expert knowledge can be used to guide the data mining process. This article summarizes different examples of systems where there is cooperation between expert knowledge and data mining discovered knowledge and reports our experience of such cooperation gathered from a medical diagnosis project called Intelligent Interpretation of Isokinetics Data, which we developed. From that experience, a series of lessons were learned throughout project development. Some of these lessons are generally applicable and others pertain exclusively to certain project types

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM