245 research outputs found
A Subsequence Interleaving Model for Sequential Pattern Mining
Recent sequential pattern mining methods have used the minimum description
length (MDL) principle to define an encoding scheme which describes an
algorithm for mining the most compressing patterns in a database. We present a
novel subsequence interleaving model based on a probabilistic model of the
sequence database, which allows us to search for the most compressing set of
patterns without designing a specific encoding scheme. Our proposed algorithm
is able to efficiently mine the most relevant sequential patterns and rank them
using an associated measure of interestingness. The efficient inference in our
model is a direct result of our use of a structural expectation-maximization
framework, in which the expectation-step takes the form of a submodular
optimization problem subject to a coverage constraint. We show on both
synthetic and real world datasets that our model mines a set of sequential
patterns with low spuriousness and redundancy, high interpretability and
usefulness in real-world applications. Furthermore, we demonstrate that the
quality of the patterns from our approach is comparable to, if not better than,
existing state of the art sequential pattern mining algorithms.Comment: 10 pages in KDD 2016: Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Minin
A Constraint Programming Approach for Mining Sequential Patterns in a Sequence Database
Constraint-based pattern discovery is at the core of numerous data mining
tasks. Patterns are extracted with respect to a given set of constraints
(frequency, closedness, size, etc). In the context of sequential pattern
mining, a large number of devoted techniques have been developed for solving
particular classes of constraints. The aim of this paper is to investigate the
use of Constraint Programming (CP) to model and mine sequential patterns in a
sequence database. Our CP approach offers a natural way to simultaneously
combine in a same framework a large set of constraints coming from various
origins. Experiments show the feasibility and the interest of our approach
The Minimum Description Length Principle for Pattern Mining: A Survey
This is about the Minimum Description Length (MDL) principle applied to
pattern mining. The length of this description is kept to the minimum.
Mining patterns is a core task in data analysis and, beyond issues of
efficient enumeration, the selection of patterns constitutes a major challenge.
The MDL principle, a model selection method grounded in information theory, has
been applied to pattern mining with the aim to obtain compact high-quality sets
of patterns. After giving an outline of relevant concepts from information
theory and coding, as well as of work on the theory behind the MDL and similar
principles, we review MDL-based methods for mining various types of data and
patterns. Finally, we open a discussion on some issues regarding these methods,
and highlight currently active related data analysis problems
Identifying and Disentangling Interleaved Activities of Daily Living from Sensor Data
Activity discovery (AD) refers to the unsupervised extraction of structured activity data from a stream of sensor readings in a real-world or virtual environment. Activity discovery is part of the broader topic of activity recognition, which has potential uses in fields as varied as social work and elder care, psychology and intrusion detection. Since activity recognition datasets are both hard to come by, and very time consuming to label, the development of reliable activity discovery systems could be of significant utility to the researchers and developers working in the field, as well as to the wider machine learning community.
This thesis focuses on the investigation of activity discovery systems that can deal with interleaving, which refers to the phenomenon of continuous switching between multiple high-level activities over a short period of time. This is a common characteristic of the real-world datastreams that activity discovery systems have to deal with, but it is one that is unfortunately often left unaddressed in the existing literature.
As part of the research presented in this thesis, the fact that activities exist at multiple levels of abstraction is highlighted. A single activity is often a constituent element of a larger, more complex activity, and in turn has constituents of its own that are activities. Thus this investigation necessarily considers activity discovery systems that can find these hierarchies.
The primary contribution of this thesis is the development and evaluation of an activity discovery system that is capable of identifying interleaved activities in sequential data. Starting from a baseline system implemented using a topic model, novel approaches are proposed making use of modern language models taken from the field of natural language processing, before moving on to more advanced language modelling that can handle complex, interleaved data. As well as the identification of activities, the thesis also proposes the abstraction of activities into larger, more complex activities. This allows for the construction of hierarchies of activities that more closely reflect the complex inherent structure of activities present in real-world datasets compared to other approaches.
The thesis also discusses a number of important issues relating to the evaluation of activity discovery systems, and examines how existing evaluation metrics may at times be misleading. This includes highlighting the existence of differing abstraction issues in activity discovery evaluation, and suggestions for how this problem can be mitigated. Finally, alternative evaluation metrics are investigated.
Naturally, this dissertation does not fully solve the problem of activity discovery, and work remains to be done. However, a number of the most pressing issues that affect real-world activity discovery systems are tackled head-on, and show that useful progress can indeed be made on them. This work aims to benefit systems that are as “clean slate as possible, and hence incorporate no domain-specific knowledge. This is perhaps somewhat of an artificial handicap to impose in this problem domain, but it does have the advantage of making this work applicable to as broad a range of domains as possible
- …