256 research outputs found
Towards Distributed Convoy Pattern Mining
Mining movement data to reveal interesting behavioral patterns has gained
attention in recent years. One such pattern is the convoy pattern which
consists of at least m objects moving together for at least k consecutive time
instants where m and k are user-defined parameters. Existing algorithms for
detecting convoy patterns, however do not scale to real-life dataset sizes.
Therefore a distributed algorithm for convoy mining is inevitable. In this
paper, we discuss the problem of convoy mining and analyze different data
partitioning strategies to pave the way for a generic distributed convoy
pattern mining algorithm.Comment: SIGSPATIAL'15 November 03-06, 2015, Bellevue, WA, US
Fast Clustering Using a Grid-Based Underlying Density Function Approximation
Clustering is an unsupervised machine learning task that seeks to partition a set of data into smaller groupings, referred to as âclustersâ, where items within the same cluster are somehow alike, while differing from those in other clusters. There are many different algorithms for clustering, but many of them are overly complex and scale poorly with larger data sets. In this paper, a new algorithm for clustering is proposed to solve some of these issues. Density-based clustering algorithms use a concept called the âunderlying density functionâ, which is a conceptual higher-dimension function that describes the possible results from the continuous data set that our input data is just a discrete sample of. The algorithm proposed in this paper seeks to use this concept by creating a piecewise approximation of the underlying density function, and then merging points towards local density maxima from this higher-dimensioned space. First, the data space is divided into a grid-based structure and the density of each grid is calculated. Second, each of these âgrid-squaresâ determines the densest space in its local area. Finally, the grid squares are merged together in the direction of their local density maximum, ultimately merging with one of the density maxima that form the root of a cluster. The experimental results show significant time improvements over standard algorithms such as DBSCAN with no accuracy penalty. Furthermore, the algorithm is also suitable for use with parallel and distributed systems, as an implementation with Apache Spark showed proper parallel scaling with low data set sizes required to overtake the serial implementation
Distributed mining of convoys in large scale datasets
Tremendous increase in the use of the mobile devices equipped with the GPS and other location sensors has resulted in the generation of a huge amount of movement data. In recent years, mining this data to understand the collective mobility behavior of humans, animals and other objects has become popular. Numerous mobility patterns, or their mining algorithms have been proposed, each representing a specific movement behavior. Convoy pattern is one such pattern which can be used to find groups of people moving together in public transport or to prevent traffic jams. A convoy is a set of at least m objects moving together for at least k consecutive time stamps where m and k are user-defined parameters. Existing algorithms for detecting convoy patterns do not scale to real-life dataset sizes. Therefore in this paper, we propose a generic distributed convoy pattern mining algorithm called DCM and show how such an algorithm can be implemented using the MapReduce framework. We present a cost model for DCM and a detailed theoretical analysis backed by experimental results. We show the effect of partition size on the performance of DCM. The results from our experiments on different data-sets and hardware setups, show that our distributed algorithm is scalable in terms of data size and number of nodes, and more efficient than any existing sequential as well as distributed convoy pattern mining algorithm, showing speed-ups of up to 16 times over SPARE, the state of the art distributed co-movement pattern mining framework. DCM is thus able to process large datasets which SPARE is unable to.SCOPUS: ar.jDecretOANoAutActifinfo:eu-repo/semantics/publishe
Theoretically-Efficient and Practical Parallel DBSCAN
The DBSCAN method for spatial clustering has received significant attention
due to its applicability in a variety of data analysis tasks. There are fast
sequential algorithms for DBSCAN in Euclidean space that take work
for two dimensions, sub-quadratic work for three or more dimensions, and can be
computed approximately in linear work for any constant number of dimensions.
However, existing parallel DBSCAN algorithms require quadratic work in the
worst case, making them inefficient for large datasets. This paper bridges the
gap between theory and practice of parallel DBSCAN by presenting new parallel
algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the
work bounds of their sequential counterparts, and are highly parallel
(polylogarithmic depth). We present implementations of our algorithms along
with optimizations that improve their practical performance. We perform a
comprehensive experimental evaluation of our algorithms on a variety of
datasets and parameter settings. Our experiments on a 36-core machine with
hyper-threading show that we outperform existing parallel DBSCAN
implementations by up to several orders of magnitude, and achieve speedups by
up to 33x over the best sequential algorithms
Exploring Decomposition for Solving Pattern Mining Problems
This article introduces a highly efficient pattern mining technique called Clustering-based Pattern Mining (CBPM). This technique discovers relevant patterns by studying the correlation between transactions in the transaction database based on clustering techniques. The set of transactions is first clustered, such that highly correlated transactions are grouped together. Next, we derive the relevant patterns by applying a pattern mining algorithm to each cluster. We present two different pattern mining algorithms, one applying an approximation-based strategy and another based on an exact strategy. The approximation-based strategy takes into account only the clusters, whereas the exact strategy takes into account both clusters and shared items between clusters. To boost the performance of the CBPM, a GPU-based implementation is investigated. To evaluate the CBPM framework, we perform extensive experiments on several pattern mining problems. The results from the experimental evaluation show that the CBPM provides a reduction in both the runtime and memory usage. Also, CBPM based on the approximate strategy provides good accuracy, demonstrating its effectiveness and feasibility. Our GPU implementation achieves significant speedup of up to 552Ă on a single GPU using big transaction databases.publishedVersio
Prescription Based Recommender System for Diabetic Patients Using Efficient Map Reduce
Healthcare sector has been deprived of leveraging knowledge gained through data insights, due to manual processes and legacy record-keeping methods. Outdated methods for maintaining healthcare records have not been proven sufficient for treating chronic diseases like diabetes. Data analysis methods such as Recommendation System (RS) can serve as a boon for treating diabetes. RS leverages predictive analysis and provides clinicians with information needed to determine the treatments to patients. Prescription-based Health Recommender System (HRS) is proposed in this paper which aids in recommending treatments by learning from the treatments prescribed to other patients diagnosed with diabetes. An Advanced Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering is also proposed to cluster the data for deriving recommendations by using winnowing algorithm as a similarity measure. A parallel processing of data is applied using map-reduce to increase the efficiency & scalability of clustering process for effective treatment of diabetes. This paper provides a good picture of how the Map Reduce can benefit in increasing the efficiency and scalability of the HRS using clustering
- âŠ