3,925 research outputs found
Temporal data mining for root-cause analysis of machine faults in automotive assembly lines
Engine assembly is a complex and heavily automated distributed-control
process, with large amounts of faults data logged everyday. We describe an
application of temporal data mining for analyzing fault logs in an engine
assembly plant. Frequent episode discovery framework is a model-free method
that can be used to deduce (temporal) correlations among events from the logs
in an efficient manner. In addition to being theoretically elegant and
computationally efficient, frequent episodes are also easy to interpret in the
form actionable recommendations. Incorporation of domain-specific information
is critical to successful application of the method for analyzing fault logs in
the manufacturing domain. We show how domain-specific knowledge can be
incorporated using heuristic rules that act as pre-filters and post-filters to
frequent episode discovery. The system described here is currently being used
in one of the engine assembly plants of General Motors and is planned for
adaptation in other plants. To the best of our knowledge, this paper presents
the first real, large-scale application of temporal data mining in the
manufacturing domain. We believe that the ideas presented in this paper can
help practitioners engineer tools for analysis in other similar or related
application domains as well
Inferring Neuronal Network Connectivity using Time-constrained Episodes
Discovering frequent episodes in event sequences is an interesting data
mining task. In this paper, we argue that this framework is very effective for
analyzing multi-neuronal spike train data. Analyzing spike train data is an
important problem in neuroscience though there are no data mining approaches
reported for this. Motivated by this application, we introduce different
temporal constraints on the occurrences of episodes. We present algorithms for
discovering frequent episodes under temporal constraints. Through simulations,
we show that our method is very effective for analyzing spike train data for
unearthing underlying connectivity patterns.Comment: 9 pages. See also http://neural-code.cs.vt.edu
Utility Mining Across Multi-Dimensional Sequences
Knowledge extraction from database is the fundamental task in database and
data mining community, which has been applied to a wide range of real-world
applications and situations. Different from the support-based mining models,
the utility-oriented mining framework integrates the utility theory to provide
more informative and useful patterns. Time-dependent sequence data is commonly
seen in real life. Sequence data has been widely utilized in many applications,
such as analyzing sequential user behavior on the Web, influence maximization,
route planning, and targeted marketing. Unfortunately, all the existing
algorithms lose sight of the fact that the processed data not only contain rich
features (e.g., occur quantity, risk, profit, etc.), but also may be associated
with multi-dimensional auxiliary information, e.g., transaction sequence can be
associated with purchaser profile information. In this paper, we first
formulate the problem of utility mining across multi-dimensional sequences, and
propose a novel framework named MDUS to extract Multi-Dimensional
Utility-oriented Sequential useful patterns. Two algorithms respectively named
MDUS_EM and MDUS_SD are presented to address the formulated problem. The former
algorithm is based on database transformation, and the later one performs
pattern joins and a searching method to identify desired patterns across
multi-dimensional sequences. Extensive experiments are carried on five
real-life datasets and one synthetic dataset to show that the proposed
algorithms can effectively and efficiently discover the useful knowledge from
multi-dimensional sequential databases. Moreover, the MDUS framework can
provide better insight, and it is more adaptable to real-life situations than
the current existing models.Comment: Under review in IEEE TKDE, 14 page
Summarizing Event Sequences with Serial Episodes: A Statistical Model and an Application
In this paper we address the problem of discovering a small set of frequent
serial episodes from sequential data so as to adequately characterize or
summarize the data. We discuss an algorithm based on the Minimum Description
Length (MDL) principle and the algorithm is a slight modification of an earlier
method, called CSC-2. We present a novel generative model for sequence data
containing prominent pairs of serial episodes and, using this, provide some
statistical justification for the algorithm. We believe this is the first
instance of such a statistical justification for an MDL based algorithm for
summarizing event sequence data. We then present a novel application of this
data mining algorithm in text classification. By considering text documents as
temporal sequences of words, the data mining algorithm can find a set of
characteristic episodes for all the training data as a whole. The words that
are part of these characteristic episodes could then be considered the only
relevant words for the dictionary thus resulting in a considerably reduced
feature vector dimension. We show, through simulation experiments using
benchmark data sets, that the discovered frequent episodes can be used to
achieve more than four-fold reduction in dictionary size without losing any
classification accuracy.Comment: 12 pages. Under review for IEEE TKD
Relationship-aware sequential pattern mining
Relationship-aware sequential pattern mining is the problem of mining
frequent patterns in sequences in which the events of a sequence are mutually
related by one or more concepts from some respective hierarchical taxonomies,
based on the type of the events. Additionally events themselves are also
described with a certain number of taxonomical concepts. We present RaSP an
algorithm that is able to mine relationship-aware patterns over such sequences;
RaSP follows a two stage approach. In the first stage it mines for frequent
type patterns and {\em all} their occurrences within the different sequences.
In the second stage it performs hierarchical mining where for each frequent
type pattern and its occurrences it mines for more specific frequent patterns
in the lower levels of the taxonomies. We test RaSP on a real world medical
application, that provided the inspiration for its development, in which we
mine for frequent patterns of medical behavior in the antibiotic treatment of
microbes and show that it has a very good computational performance given the
complexity of the relationship-aware sequential pattern mining problem
Fast Utility Mining on Complex Sequences
High-utility sequential pattern mining is an emerging topic in the field of
Knowledge Discovery in Databases. It consists of discovering subsequences
having a high utility (importance) in sequences, referred to as high-utility
sequential patterns (HUSPs). HUSPs can be applied to many real-life
applications, such as market basket analysis, E-commerce recommendation,
click-stream analysis and scenic route planning. For example, in economics and
targeted marketing, understanding economic behavior of consumers is quite
challenging, such as finding credible and reliable information on product
profitability. Several algorithms have been proposed to address this problem by
efficiently mining utility-based useful sequential patterns. Nevertheless, the
performance of these algorithms can be unsatisfying in terms of runtime and
memory usage due to the combinatorial explosion of the search space for low
utility threshold and large databases. Hence, this paper proposes a more
efficient algorithm for the task of high-utility sequential pattern mining,
called HUSP-ULL. It utilizes a lexicographic sequence (LS)-tree and a
utility-linked (UL)-list structure to fast discover HUSPs. Furthermore, two
pruning strategies are introduced in HUSP-ULL to obtain tight upper-bounds on
the utility of candidate sequences, and reduce the search space by pruning
unpromising candidates early. Substantial experiments both on real-life and
synthetic datasets show that the proposed algorithm can effectively and
efficiently discover the complete set of HUSPs and outperforms the
state-of-the-art algorithms.Comment: Under review in IEEE TKDE, 15 page
ProUM: Projection-based Utility Mining on Sequence Data
Utility is an important concept in economics. A variety of applications
consider utility in real-life situations, which has lead to the emergence of
utility-oriented mining (also called utility mining) in the recent decade.
Utility mining has attracted a great amount of attention, but most of the
existing studies have been developed to deal with itemset-based data.
Time-ordered sequence data is more commonly seen in real-world situations,
which is different from itemset-based data. Since they are time-consuming and
require large amount of memory usage, current utility mining algorithms still
have limitations when dealing with sequence data. In addition, the mining
efficiency of utility mining on sequence data still needs to be improved,
especially for long sequences or when there is a low minimum utility threshold.
In this paper, we propose an efficient Projection-based Utility Mining (ProUM)
approach to discover high-utility sequential patterns from sequence data. The
utility-array structure is designed to store the necessary information of the
sequence-order and utility. ProUM can significantly improve the mining
efficiency by utilizing the projection technique in generating utility-array,
and it effectively reduces the memory consumption. Furthermore, a new upper
bound named sequence extension utility is proposed and several pruning
strategies are further applied to improve the efficiency of ProUM. By taking
utility theory into account, the derived high-utility sequential patterns have
more insightful and interesting information than other kinds of patterns.
Experimental results showed that the proposed ProUM algorithm significantly
outperformed the state-of-the-art algorithms in terms of execution time, memory
usage, and scalability.Comment: Elsevier Information Science, 17 pages, 4 figure
Privacy Preserving Utility Mining: A Survey
In big data era, the collected data usually contains rich information and
hidden knowledge. Utility-oriented pattern mining and analytics have shown a
powerful ability to explore these ubiquitous data, which may be collected from
various fields and applications, such as market basket analysis, retail,
click-stream analysis, medical analysis, and bioinformatics. However, analysis
of these data with sensitive private information raises privacy concerns. To
achieve better trade-off between utility maximizing and privacy preserving,
Privacy-Preserving Utility Mining (PPUM) has become a critical issue in recent
years. In this paper, we provide a comprehensive overview of PPUM. We first
present the background of utility mining, privacy-preserving data mining and
PPUM, then introduce the related preliminaries and problem formulation of PPUM,
as well as some key evaluation criteria for PPUM. In particular, we present and
discuss the current state-of-the-art PPUM algorithms, as well as their
advantages and deficiencies in detail. Finally, we highlight and discuss some
technical challenges and open directions for future research on PPUM.Comment: 2018 IEEE International Conference on Big Data, 10 page
Discovering Predictive Event Sequences in Criminal Careers
In this work, we consider the problem of predicting criminal behavior, and propose a method for discovering predictive patterns in criminal histories. Quantitative criminal career analysis typically involves clustering individuals according to frequency of a particular event type over time, using cluster membership as a basis for comparison. We demonstrate the effectiveness of hazard pattern mining for the discovery of relationships between different types of events that may occur in criminal careers. Hazard pattern mining is an extension of event sequence mining, with the additional restriction that each event in the pattern is the first subsequent event of the specified type. This restriction facilitates application of established time based measures such as those used in survival analysis. We evaluate hazard patterns using a relative risk model and an accelerated failure time model. The results show that hazard patterns can reliably capture unexpected relationships between events of different types
- …