83 research outputs found

    Towards Distributed Convoy Pattern Mining

    Full text link
    Mining movement data to reveal interesting behavioral patterns has gained attention in recent years. One such pattern is the convoy pattern which consists of at least m objects moving together for at least k consecutive time instants where m and k are user-defined parameters. Existing algorithms for detecting convoy patterns, however do not scale to real-life dataset sizes. Therefore a distributed algorithm for convoy mining is inevitable. In this paper, we discuss the problem of convoy mining and analyze different data partitioning strategies to pave the way for a generic distributed convoy pattern mining algorithm.Comment: SIGSPATIAL'15 November 03-06, 2015, Bellevue, WA, US

    Dynamic load balancing in parallel KD-tree k-means

    Get PDF
    One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy

    Scalability of efficient parallel K-Means

    Get PDF
    Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues

    Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks

    Get PDF
    The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale

    Scalable storage for a DBMS using transparent distribution

    Get PDF
    Scalable Distributed Data Structures (SDDSs) provide a self-managing and self-organizing data storage of potentially unbounded size. This stands in contrast to common distribution schemas deployed in conventional distributed DBMS. SDDSs, however, have mostly been used in synthetic scenarios to investigate their properties. In this paper we concentrate on the integration of the LH* SDDS into our efficient and extensible DBMS, called Monet. We show that this merge permits processing very large sets of distributed data. In our implementation we extended the relational algebra interpreter in such a way that access to data, whether it is distributed or locally stored, is transparent to the user. The on-the-fly optimization of operations --- heavily used in Monet --- to deploy different strategies and scenarios inside the primary operators associated with an SDDS adds self-adaptiveness to the query system; it dynamically adopts itself to unforeseen situations. We illustrate the performance efficiency by experiments on a network of workstations. The transparent integration of SDDSs opens new perspectives for very large self-managing database systems

    Designing SSI clusters with hierarchical checkpointing and single I/O space

    Get PDF
    Adopting a new hierarchical checkpointing architecture, the authors develop a single I/O address space for building highly available clusters of computers. They propose a systematic approach to achieving a single system image by integrating existing middleware support with the newly developed features.published_or_final_versio

    An Application Perspective on High-Performance Computing and Communications

    Get PDF
    We review possible and probable industrial applications of HPCC focusing on the software and hardware issues. Thirty-three separate categories are illustrated by detailed descriptions of five areas -- computational chemistry; Monte Carlo methods from physics to economics; manufacturing; and computational fluid dynamics; command and control; or crisis management; and multimedia services to client computers and settop boxes. The hardware varies from tightly-coupled parallel supercomputers to heterogeneous distributed systems. The software models span HPF and data parallelism, to distributed information systems and object/data flow parallelism on the Web. We find that in each case, it is reasonably clear that HPCC works in principle, and postulate that this knowledge can be used in a new generation of software infrastructure based on the WebWindows approach, and discussed in an accompanying paper

    Distributed mining of convoys in large scale datasets

    Get PDF
    Tremendous increase in the use of the mobile devices equipped with the GPS and other location sensors has resulted in the generation of a huge amount of movement data. In recent years, mining this data to understand the collective mobility behavior of humans, animals and other objects has become popular. Numerous mobility patterns, or their mining algorithms have been proposed, each representing a specific movement behavior. Convoy pattern is one such pattern which can be used to find groups of people moving together in public transport or to prevent traffic jams. A convoy is a set of at least m objects moving together for at least k consecutive time stamps where m and k are user-defined parameters. Existing algorithms for detecting convoy patterns do not scale to real-life dataset sizes. Therefore in this paper, we propose a generic distributed convoy pattern mining algorithm called DCM and show how such an algorithm can be implemented using the MapReduce framework. We present a cost model for DCM and a detailed theoretical analysis backed by experimental results. We show the effect of partition size on the performance of DCM. The results from our experiments on different data-sets and hardware setups, show that our distributed algorithm is scalable in terms of data size and number of nodes, and more efficient than any existing sequential as well as distributed convoy pattern mining algorithm, showing speed-ups of up to 16 times over SPARE, the state of the art distributed co-movement pattern mining framework. DCM is thus able to process large datasets which SPARE is unable to.SCOPUS: ar.jDecretOANoAutActifinfo:eu-repo/semantics/publishe
    corecore