26,318 research outputs found
Discovering Compressing Serial Episodes from Event Sequences
Most pattern mining methods output a very large number of frequent patterns
and isolating a small but relevant subset is a challenging problem of current
interest in frequent pattern mining. In this paper we consider discovery of a
small set of relevant frequent episodes from data sequences. We make use of the
Minimum Description Length principle to formulate the problem of selecting a
subset of episodes. Using an interesting class of serial episodes with
inter-event constraints and a novel encoding scheme for data using such
episodes, we present algorithms for discovering small set of episodes that
achieve good data compression. Using an example of the data streams obtained
from distributed sensors in a composable coupled conveyor system, we show that
our method is very effective in unearthing highly relevant episodes and that
our scheme also achieves good data compression.Comment: 27 pages 3 figur
A Chronological Edge-Driven Approach to Temporal Subgraph Isomorphism
Many real world networks are considered temporal networks, in which the
chronological ordering of the edges has importance to the meaning of the data.
Performing temporal subgraph matching on such graphs requires the edges in the
subgraphs to match the order of the temporal graph motif we are searching for.
Previous methods for solving this rely on the use of static subgraph matching
to find potential matches first, before filtering them based on edge order to
find the true temporal matches. We present a new algorithm for temporal
subgraph isomorphism that performs the subgraph matching directly on the
chronologically sorted edges. By restricting our search to only the subgraphs
with chronologically correct edges, we can improve the performance of the
algorithm significantly. We present experimental timing results to show
significant performance improvements on publicly available datasets for a
number of different temporal query graph motifs with four or more nodes. We
also demonstrate a practical example of how temporal subgraph isomorphism can
produce more meaningful results than traditional static subgraph searches
Spatio-Temporal Data Mining: A Survey of Problems and Methods
Large volumes of spatio-temporal data are increasingly collected and studied
in diverse domains including, climate science, social sciences, neuroscience,
epidemiology, transportation, mobile health, and Earth sciences.
Spatio-temporal data differs from relational data for which computational
approaches are developed in the data mining community for multiple decades, in
that both spatial and temporal attributes are available in addition to the
actual measurements/attributes. The presence of these attributes introduces
additional challenges that needs to be dealt with. Approaches for mining
spatio-temporal data have been studied for over a decade in the data mining
community. In this article we present a broad survey of this relatively young
field of spatio-temporal data mining. We discuss different types of
spatio-temporal data and the relevant data mining questions that arise in the
context of analyzing each of these datasets. Based on the nature of the data
mining problem studied, we classify literature on spatio-temporal data mining
into six major categories: clustering, predictive learning, change detection,
frequent pattern mining, anomaly detection, and relationship mining. We discuss
the various forms of spatio-temporal data mining problems in each of these
categories.Comment: Accepted for publication at ACM Computing Survey
Spatio-temporal Video Parsing for Abnormality Detection
Abnormality detection in video poses particular challenges due to the
infinite size of the class of all irregular objects and behaviors. Thus no (or
by far not enough) abnormal training samples are available and we need to find
abnormalities in test data without actually knowing what they are.
Nevertheless, the prevailing concept of the field is to directly search for
individual abnormal local patches or image regions independent of another. To
address this problem, we propose a method for joint detection of abnormalities
in videos by spatio-temporal video parsing. The goal of video parsing is to
find a set of indispensable normal spatio-temporal object hypotheses that
jointly explain all the foreground of a video, while, at the same time, being
supported by normal training samples. Consequently, we avoid a direct detection
of abnormalities and discover them indirectly as those hypotheses which are
needed for covering the foreground without finding an explanation for
themselves by normal samples. Abnormalities are localized by MAP inference in a
graphical model and we solve it efficiently by formulating it as a convex
optimization problem. We experimentally evaluate our approach on several
challenging benchmark sets, improving over the state-of-the-art on all standard
benchmarks both in terms of abnormality classification and localization.Comment: 15 pages, 12 figures, 3 table
HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks
The unsupervised detection of anomalies in time series data has important
applications in user behavioral modeling, fraud detection, and cybersecurity.
Anomaly detection has, in fact, been extensively studied in categorical
sequences. However, we often have access to time series data that represent
paths through networks. Examples include transaction sequences in financial
networks, click streams of users in networks of cross-referenced documents, or
travel itineraries in transportation networks. To reliably detect anomalies, we
must account for the fact that such data contain a large number of independent
observations of paths constrained by a graph topology. Moreover, the
heterogeneity of real systems rules out frequency-based anomaly detection
techniques, which do not account for highly skewed edge and degree statistics.
To address this problem, we introduce HYPA, a novel framework for the
unsupervised detection of anomalies in large corpora of variable-length
temporal paths in a graph. HYPA provides an efficient analytical method to
detect paths with anomalous frequencies that result from nodes being traversed
in unexpected chronological order.Comment: 11 pages with 8 figures and supplementary material. To appear at SIAM
Data Mining (SDM 2020
Mining Top-k Sequential Patterns in Database Graphs:A New Challenging Problem and a Sampling-based Approach
In many real world networks, a vertex is usually associated with a
transaction database that comprehensively describes the behaviour of the
vertex. A typical example is the social network, where the behaviour of every
user is depicted by a transaction database that stores his daily posted
contents. A transaction database is a set of transactions, where a transaction
is a set of items. Every path of the network is a sequence of vertices that
induces multiple sequences of transactions. The sequences of transactions
induced by all of the paths in the network forms an extremely large sequence
database. Finding frequent sequential patterns from such sequence database
discovers interesting subsequences that frequently appear in many paths of the
network. However, it is a challenging task, since the sequence database induced
by a database graph is too large to be explicitly induced and stored. In this
paper, we propose the novel notion of database graph, which naturally models a
wide spectrum of real world networks by associating each vertex with a
transaction database. Our goal is to find the top-k frequent sequential
patterns in the sequence database induced from a database graph. We prove that
this problem is #P-hard. To tackle this problem, we propose an efficient
two-step sampling algorithm that approximates the top-k frequent sequential
patterns with provable quality guarantee. Extensive experimental results on
synthetic and real-world data sets demonstrate the effectiveness and efficiency
of our method
A sampling framework for counting temporal motifs
Pattern counting in graphs is fundamental to network science tasks, and there
are many scalable methods for approximating counts of small patterns, often
called motifs, in large graphs. However, modern graph datasets now contain
richer structure, and incorporating temporal information in particular has
become a critical part of network analysis. Temporal motifs, which are
generalizations of small subgraph patterns that incorporate temporal ordering
on edges, are an emerging part of the network analysis toolbox. However, there
are no algorithms for fast estimation of temporal motifs counts; moreover, we
show that even counting simple temporal star motifs is NP-complete. Thus, there
is a need for fast and approximate algorithms. Here, we present the first
frequency estimation algorithms for counting temporal motifs. More
specifically, we develop a sampling framework that sits as a layer on top of
existing exact counting algorithms and enables fast and accurate
memory-efficient estimates of temporal motif counts. Our results show that we
can achieve one to two orders of magnitude speedups with minimal and
controllable loss in accuracy on a number of datasets.Comment: 9 pages, 4 figure
A Neural Network Approach to Joint Modeling Social Networks and Mobile Trajectories
The accelerated growth of mobile trajectories in location-based services
brings valuable data resources to understand users' moving behaviors. Apart
from recording the trajectory data, another major characteristic of these
location-based services is that they also allow the users to connect whomever
they like. A combination of social networking and location-based services is
called as location-based social networks (LBSN). As shown in previous works,
locations that are frequently visited by socially-related persons tend to be
correlated, which indicates the close association between social connections
and trajectory behaviors of users in LBSNs. In order to better analyze and mine
LBSN data, we present a novel neural network model which can joint model both
social networks and mobile trajectories. In specific, our model consists of two
components: the construction of social networks and the generation of mobile
trajectories. We first adopt a network embedding method for the construction of
social networks: a networking representation can be derived for a user. The key
of our model lies in the component of generating mobile trajectories. We have
considered four factors that influence the generation process of mobile
trajectories, namely user visit preference, influence of friends, short-term
sequential contexts and long-term sequential contexts. To characterize the last
two contexts, we employ the RNN and GRU models to capture the sequential
relatedness in mobile trajectories at different levels, i.e., short term or
long term. Finally, the two components are tied by sharing the user network
representations. Experimental results on two important applications demonstrate
the effectiveness of our model. Especially, the improvement over baselines is
more significant when either network structure or trajectory data is sparse.Comment: Accepted by ACM TOI
A Survey of Parallel Sequential Pattern Mining
With the growing popularity of shared resources, large volumes of complex
data of different types are collected automatically. Traditional data mining
algorithms generally have problems and challenges including huge memory cost,
low processing speed, and inadequate hard disk space. As a fundamental task of
data mining, sequential pattern mining (SPM) is used in a wide variety of
real-life applications. However, it is more complex and challenging than other
pattern mining tasks, i.e., frequent itemset mining and association rule
mining, and also suffers from the above challenges when handling the
large-scale data. To solve these problems, mining sequential patterns in a
parallel or distributed computing environment has emerged as an important issue
with many applications. In this paper, an in-depth survey of the current status
of parallel sequential pattern mining (PSPM) is investigated and provided,
including detailed categorization of traditional serial SPM approaches, and
state of the art parallel SPM. We review the related work of parallel
sequential pattern mining in detail, including partition-based algorithms for
PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for
PSPM, and provide deep description (i.e., characteristics, advantages,
disadvantages and summarization) of these parallel approaches of PSPM. Some
advanced topics for PSPM, including parallel quantitative / weighted / utility
sequential pattern mining, PSPM from uncertain data and stream data, hardware
acceleration for PSPM, are further reviewed in details. Besides, we review and
provide some well-known open-source software of PSPM. Finally, we summarize
some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page
Mining Maximal Dynamic Spatial Co-Location Patterns
A spatial co-location pattern represents a subset of spatial features whose
instances are prevalently located together in a geographic space. Although many
algorithms of mining spatial co-location pattern have been proposed, there are
still some problems: 1) they miss some meaningful patterns (e.g.,
{Ganoderma_lucidumnew, maple_treedead} and {water_hyacinthnew(increase),
algaedead(decrease)}), and get the wrong conclusion that the instances of two
or more features increase/decrease (i.e., new/dead) in the same/approximate
proportion, which has no effect on prevalent patterns. 2) Since the number of
prevalent spatial co-location patterns is very large, the efficiency of
existing methods is very low to mine prevalent spatial co-location patterns.
Therefore, first, we propose the concept of dynamic spatial co-location pattern
that can reflect the dynamic relationships among spatial features. Second, we
mine small number of prevalent maximal dynamic spatial co-location patterns
which can derive all prevalent dynamic spatial co-location patterns, which can
improve the efficiency of obtaining all prevalent dynamic spatial co-location
patterns. Third, we propose an algorithm for mining prevalent maximal dynamic
spatial co-location patterns and two pruning strategies. Finally, the
effectiveness and efficiency of the method proposed as well as the pruning
strategies are verified by extensive experiments over real/synthetic datasets.Comment: 10 pages,7 figure
- …