Search CORE

4 research outputs found

The DESQ framework for declarative and scalable frequent sequence mining

Author: Beedkar Kaustubh
Gemulla Rainer
Renz-Wieland Alexander
Publication venue: Geellschaft für Informatik
Publication date: 01/01/2019
Field of study

DESQ is a general-purpose framework for declarative and scalable frequent sequence mining. Applications express their speciĄc sequence mining tasks using a simple yet powerful powerful pattern expression language, and DESQŠs computation engine automatically executes the mining task in an efficient and scalable way. In this paper, we give a brief overview of DESQ and its components

MAnnheim DOCument Server

Methods for frequent sequence mining with subsequence constraints

Author: Beedkar Kaustubh
Publication venue
Publication date: 01/01/2016
Field of study

In this thesis, we study scalable and general purpose methods for mining frequent sequences that satisfy a given subsequence constraint. Frequent sequence mining is a fundamental task in data mining and has many real-life applications like information extraction, market-basket analysis, web usage mining, or session analysis. Depending on the underlying application, we are generally interested in discovering certain frequent sequences, which are described using subsequence constraints. There exists many tools and algorithms for this task, however, they are not sufficiently scalable to deal with large amounts of data that may arise in applications and are generally not extensible across range of applications. We propose scalable, distributed sequence mining algorithms that target MapReduce. Our work builds on MG-FSM, which is a distributed framework for frequent sequence mining. We propose novel algorithms that improve and extend the basic MG-FSM framework to efficiently support traditional subsequence constraints that arise in applications. Additionally, we show that many subsequence constraints---including and beyond the traditional ones considered in literature---can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. To this end, we propose a general purpose framework that provides a set of simple and intuitive ``pattern expressions'', which allows to describe any subsequence constraint of interest and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our experimental study on real-world datasets indicates that our proposed algorithms are scalable and effective across wide range of applications

MAnnheim DOCument Server

A unified framework for frequent sequence mining with subsequence constraints

Author: Beedkar Kaustubh
Gemulla Rainer
Mertens Wim
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence constraints has been studied in the literature, including length, gap, span, regular-expression, and hierarchy constraints. In this article, we show that many subsequence constraints—including and beyond those considered in the literature—can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. In more detail, we propose a set of simple and intuitive “pattern expressions” to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our algorithms translate pattern expressions to succinct finite-state transducers, which we use as computational model, and simulate these transducers in a way suitable for frequent sequence mining. Our experimental study on real-world datasets indicates that our algorithms—although more general—are efficient and, when used for sequence mining with prior constraints studied in literature, competitive to (and in some cases superior to) state-of-the-art specialized methods

MAnnheim DOCument Server