9 research outputs found
Using Answer Set Programming for pattern mining
Serial pattern mining consists in extracting the frequent sequential patterns
from a unique sequence of itemsets. This paper explores the ability of a
declarative language, such as Answer Set Programming (ASP), to solve this issue
efficiently. We propose several ASP implementations of the frequent sequential
pattern mining task: a non-incremental and an incremental resolution. The
results show that the incremental resolution is more efficient than the
non-incremental one, but both ASP programs are less efficient than dedicated
algorithms. Nonetheless, this approach can be seen as a first step toward a
generic framework for sequential pattern mining with constraints.Comment: Intelligence Artificielle Fondamentale (2014
Mining top-k regular episodes from sensor streams
International audienceThe monitoring of human activities plays an important role in health-care applications and for the data mining community. Existing approaches work on activities recognition occurring in sensor data streams. However, regular behaviors have not been studied. Thus, we here introduce a new approach to discover top-k most regular episodes from sensors streams, TKRES. The top-k approach allows us to control the size of the output, thus preventing overwhelming result analysis for the supervisor. TKRES is based on the use of a simple top-k list and a k-tree structure for maintaining the top-k episodes and their occurrence information. We also investigate and report the performances of TKRES on two real-life smart home datasets
Mining Weighted Frequent Closed Episodes over Multiple Sequences
Frequent episode discovery is introduced to mine useful and interesting temporal patterns from sequential data. The existing episode mining methods mainly focused on mining from a single long sequence consisting of events with time constraints. However, there can be multiple sequences of different importance as the persons or entities associated with each sequence can be of different importance. Aiming to mine episodes in multiple sequences of different importance, we first define a new kind of episodes, i.e., the weighted frequent closed episodes, to take sequence importance, episode distribution and occurrence frequency into account together. Secondly, to facilitate the mining of such new episodes, we present a new concept called maximal duration serial episodes to cut a whole sequence into multiple maximum episodes using duration constraints, and discuss its properties for episode shrinking processing. Finally, based on the theoretical properties, we propose a two-phase approach to efficiently mine these new episodes. In Phase I, we adopt a level-wise episode shrinking framework to discover the candidate frequent closed episodes with the same prefixes, and in Phase II, we match the candidates with different prefixes to find the frequent close episodes. Experiments on simulated and real datasets demonstrate that the proposed episode mining strategy has good mining effectiveness and efficiency
Mining Positional Data Streams
Abstract. We study frequent pattern mining from positional data streams. Existing approaches require discretised data to identify atomic events and are not applicable in our continuous setting. We propose an efficient trajectory-based preprocessing to identify similar movements and a distributed pattern mining algorithm to identify frequent trajectories. We empirically evaluate all parts of the processing pipeline
Code Clone Discovery Based on Concolic Analysis
Software is often large, complicated and expensive to build and maintain. Redundant
code can make these applications even more costly and difficult to maintain. Duplicated
code is often introduced into these systems for a variety of reasons. Some of which
include developer churn, deficient developer application comprehension and lack of
adherence to proper development practices.
Code redundancy has several adverse effects on a software application including an
increased size of the codebase and inconsistent developer changes due to elevated
program comprehension needs. A code clone is defined as multiple code fragments that
produce similar results when given the same input. There are generally four types of
clones that are recognized. They range from simple type-1 and 2 clones, to the more
complicated type-3 and 4 clones. Numerous clone detection mechanisms are able to
identify the simpler types of code clone candidates, but far fewer claim the ability to find
the more difficult type-3 clones. Before CCCD, MeCC and FCD were the only clone
detection techniques capable of finding type-4 clones. A drawback of MeCC is the
excessive time required to detect clones and the likely exploration of an unreasonably
large number of possible paths. FCD requires extensive amounts of random data and a
significant period of time in order to discover clones.
This dissertation presents a new process for discovering code clones known as Concolic
Code Clone Discovery (CCCD). This technique discovers code clone candidates based on
the functionality of the application, not its syntactical nature. This means that things like
naming conventions and comments in the source code have no effect on the proposed
clone detection process. CCCD finds clones by first performing concolic analysis on the
targeted source code. Concolic analysis combines concrete and symbolic execution in
order to traverse all possible paths of the targeted program. These paths are represented
by the generated concolic output. A diff tool is then used to determine if the concolic
output for a method is identical to the output produced for another method. Duplicated
output is indicative of a code clone.
CCCD was validated against several open source applications along with clones of all
four types as defined by previous research. The results demonstrate that CCCD was able
to detect all types of clone candidates with a high level of accuracy.
In the future, CCCD will be used to examine how software developers work with type-3
and type-4 clones. CCCD will also be applied to various areas of security research,
including intrusion detection mechanisms
Mining Closed Episodes with Simultaneous Events
Sequential pattern discovery is a well-studied field in data mining. Episodes are sequential patterns describing events that often occur in the vicinity of each other. Episodes can impose restrictions to the order of the events, which makes them a versatile technique for describing complex patterns in the sequence. Most of the research on episodes deals with special cases such as serial, parallel, and injective episodes, while discovering general episodes is understudied. In this paper we extend the definition of an episode in order to be able to represent cases where events often occur simultaneously. We present an efficient and novel miner for discovering frequent and closed general episodes. Such a task presents unique challenges. Firstly, we cannot define closure based on frequency. We solve this by computing a more conservative closure that we use to reduce the search space and discover the closed episodes as a postprocessing step. Secondly, episodes are traditionally presented as directed acyclic graphs. We argue that this representation has drawbacks leading to redundancy in the output. We solve these drawbacks by defining a subset relationship in such a way that allows us to remove the redundant episodes. We demonstrate the efficiency of our algorithm and the need for using closed episodes empirically on synthetic and real-world datasets