Search CORE

9,413 research outputs found

A Review of Subsequence Time Series Clustering

Author: Saeed Aghabozorgi
Seyedjamal Zolhavarieh
Ying Wah Teh
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies

Crossref

Directory of Open Access Journals

PubMed Central

Exploring Decomposition for Solving Pattern Mining Problems

Author: Djenouri Youcef
Lin Jerry Chun-Wei
Nørvåg Kjetil
Ramampiaro Heri
Yu Philip S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

This article introduces a highly efficient pattern mining technique called Clustering-based Pattern Mining (CBPM). This technique discovers relevant patterns by studying the correlation between transactions in the transaction database based on clustering techniques. The set of transactions is first clustered, such that highly correlated transactions are grouped together. Next, we derive the relevant patterns by applying a pattern mining algorithm to each cluster. We present two different pattern mining algorithms, one applying an approximation-based strategy and another based on an exact strategy. The approximation-based strategy takes into account only the clusters, whereas the exact strategy takes into account both clusters and shared items between clusters. To boost the performance of the CBPM, a GPU-based implementation is investigated. To evaluate the CBPM framework, we perform extensive experiments on several pattern mining problems. The results from the experimental evaluation show that the CBPM provides a reduction in both the runtime and memory usage. Also, CBPM based on the approximate strategy provides good accuracy, demonstrating its effectiveness and feasibility. Our GPU implementation achieves significant speedup of up to 552× on a single GPU using big transaction databases.publishedVersio

SINTEF Open

Recommended from our members

Data Summarizations for Scalable, Robust and Privacy-Aware Learning in High Dimensions

Author: Manousakas Dionysios
Publication venue: University of Cambridge
Publication date: 30/10/2021
Field of study

The advent of large-scale datasets has offered unprecedented amounts of information for building statistically powerful machines, but, at the same time, also introduced a remarkable computational challenge: how can we efficiently process massive data? This thesis presents a suite of data reduction methods that make learning algorithms scale on large datasets, via extracting a succinct model-specific representation that summarizes the full data collection—a coreset. Our frameworks support by design datasets of arbitrary dimensionality, and can be used for general purpose Bayesian inference under real-world constraints, including privacy preservation and robustness to outliers, encompassing diverse uncertainty-aware data analysis tasks, such as density estimation, classification and regression. We motivate the necessity for novel data reduction techniques in the first place by developing a reidentification attack on coarsened representations of private behavioural data. Analysing longitudinal records of human mobility, we detect privacy-revealing structural patterns, that remain preserved in reduced graph representations of individuals’ information with manageable size. These unique patterns enable mounting linkage attacks via structural similarity computations on longitudinal mobility traces, revealing an overlooked, yet existing, privacy threat. We then propose a scalable variational inference scheme for approximating posteriors on large datasets via learnable weighted pseudodata, termed pseudocoresets. We show that the use of pseudodata enables overcoming the constraints on minimum summary size for given approximation quality, that are imposed on all existing Bayesian coreset constructions due to data dimensionality. Moreover, it allows us to develop a scheme for pseudocoresets-based summarization that satisfies the standard framework of differential privacy by construction; in this way, we can release reduced size privacy-preserving representations for sensitive datasets that are amenable to arbitrary post-processing. Subsequently, we consider summarizations for large-scale Bayesian inference in scenarios when observed datapoints depart from the statistical assumptions of our model. Using robust divergences, we develop a method for constructing coresets resilient to model misspecification. Crucially, this method is able to automatically discard outliers from the generated data summaries. Thus we deliver robustified scalable representations for inference, that are suitable for applications involving contaminated and unreliable data sources. We demonstrate the performance of proposed summarization techniques on multiple parametric statistical models, and diverse simulated and real-world datasets, from music genre features to hospital readmission records, considering a wide range of data dimensionalities.Nokia Bell Labs, Lundgren Fund, Darwin College, University of Cambridge Department of Computer Science & Technology, University of Cambridg

Apollo (Cambridge)

Skipping-Based Collaborative Recommendations inspired from Statistical Language Modeling

Author: Anne Boyer
Armelle Brun
Geoffray Bonnin
Publication venue: 'IntechOpen'
Publication date: 01/03/2010
Field of study

Due to the almost unlimited resource space on the Web, efficient search engines and recommender systems have become a key element for users to find resources corresponding to their needs. Recommender systems aims at helping users in this task by providing them some pertinent resources according to their context and their profiles, by applying various techniques such as statistical and knowledge discovery algorithms. One of the most successful approaches is Collaborative Filtering, which consists in considering user ratings to provide recommendations, without considering the content of the resources; however the ratings are the only criterion taken into account to provide the recommendations, although including some other criterion should enhance their accuracy. One such criterion is the context, which can be geographical, meteorological, social, etc. In this chapter we focus on the temporal context, more specifically on the order in which the resources were consulted. The appropriateness of considering the order is domain dependent: for instance, it seems of little help in domains such as online moviestores, in which user transactions are barely sequential; however it is especially appropriate for domains such as Web navigation, which has a sequential structure. We propose to follow this direction for this domain, the challenge being to find a low enough complexity sequential model while providing a better accuracy. We first put forward similarities between Web navigation and natural language, and propose to adapt statistical language models to Web navigation to compute recommendations. Second, we propose a new model inspired from the n-gram skipping model. This model has several advantages: (1) It has both a low time and a low space complexity while providing a full coverage, (2) it is able to handle parallel navigations and noise, (3) it is able to perform recommendations in an anytime framework, (4) weighting schemes are used to alleviate the importance of distant resources. Third, we provide a comparison of this SLM inspired model to the state of the art in terms of features, complexity, accuracy and robustness and present experimental results. Tests are performed on a browsing dataset extracted from Intranet logs provided by a French bank. Results show that the use of exponential decay weighting schemes when taking into account non contiguous resources highly improves the accuracy, and that the anytime configuration is able to provide a satisfying trade-off between an even lower computation time and a good accuracy while conserving a good coverage

IntechOpen

Crossref

INRIA a CCSD electronic archive server

BicPAMS: software for biological data analysis with pattern-based biclustering

Author: A Ben-Dor
A Rosenwald
A Serin
A Tanay
AP Gasch
AP Lee
D Szklarczyk
Francisco L. Ferreira
G Getz
J Han
JLY Koh
K Eren
K Sim
M Charrad
MC Teixeira
MV Kuleshov
NR Mabroukeh
R Henriques
R Henriques
R Henriques
R Henriques
R Henriques
R Henriques
R Henriques
R Henriques
R Henriques
R Henriques
R Henriques
R Martinez
R Santamaría
Rui Henriques
S Barkow
S Hochreiter
Sara C. Madeira
SC Madeira
W Lee
Y Okada
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Online Spectral Clustering on Network Streams

Author: Jia Yi
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2012
Field of study

Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling

KU ScholarWorks

Mining Predictive Patterns and Extension to Multivariate Temporal Data

Author: Batal Iyad
Publication venue
Publication date: 01/01/2012
Field of study

An important goal of knowledge discovery is the search for patterns in the data that can help explaining its underlying structure. To be practically useful, the discovered patterns should be novel (unexpected) and easy to understand by humans. In this thesis, we study the problem of mining patterns (defining subpopulations of data instances) that are important for predicting and explaining a specific outcome variable. An example is the task of identifying groups of patients that respond better to a certain treatment than the rest of the patients. We propose and present efficient methods for mining predictive patterns for both atemporal and temporal (time series) data. Our first method relies on frequent pattern mining to explore the search space. It applies a novel evaluation technique for extracting a small set of frequent patterns that are highly predictive and have low redundancy. We show the benefits of this method on several synthetic and public datasets. Our temporal pattern mining method works on complex multivariate temporal data, such as electronic health records, for the event detection task. It first converts time series into time-interval sequences of temporal abstractions and then mines temporal patterns backwards in time, starting from patterns related to the most recent observations. We show the benefits of our temporal pattern mining method on two real-world clinical tasks

CiteSeerX

D-Scholarship@Pitt