7 research outputs found

    Performance evaluation of a distributed clustering approach for spatial datasets

    Get PDF
    The analysis of big data requires powerful, scalable, and accurate data analytics techniques that the traditional data mining and machine learning do not have as a whole. Therefore, new data analytics frameworks are needed to deal with the big data challenges such as volumes, velocity, veracity, variety of the data. Distributed data mining constitutes a promising approach for big data sets, as they are usually produced in distributed locations, and processing them on their local sites will reduce significantly the response times, communications, etc. In this paper, we propose to study the performance of a distributed clustering, called Dynamic Distributed Clustering (DDC). DDC has the ability to remotely generate clusters and then aggregate them using an efficient aggregation algorithm. The technique is developed for spatial datasets. We evaluated the DDC using two types of communications (synchronous and asynchronous), and tested using various load distributions. The experimental results show that the approach has super-linear speed-up, scales up very well, and can take advantage of the recent programming models, such as MapReduce model, as its results are not affected by the types of communication

    Parallel and distributed clustering framework for big spatial data mining

    Get PDF
    Clustering techniques are very attractive for identifying and extracting patterns of interests from datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality, heterogeneity, and high complexity of some algorithms. Distributed clustering techniques constitute a very good alternative to the Big Data challenges (e.g., Volume, Variety, Veracity, and Velocity). In this paper, we developed and implemented a Dynamic Parallel and Distributed clustering (DPDC) approach that can analyse Big Data within a reasonable response time and produce accurate results, by using existing and current computing and storage infrastructure, such as cloud computing. The DPDC approach consists of two phases. The first phase is fully parallel and it generates local clusters and the second phase aggregates the local results to obtain global clusters. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. DPDC was thoroughly tested and compared to well-known clustering algorithms BIRCH and CURE. The results show that the approach not only produces high-quality results but also scales up very well by taking advantage of the Hadoop MapReduce paradigm or any distributed system

    Identification of Change in a Dynamic Dot Pattern and its use in the Maintenance of Footprints

    Get PDF
    Examples of spatio-temporal data that can be represented as sets of points (called dot patterns) are pervasive in many applications, for example when tracking herds of migrating animals, ships in busy shipping channels and crowds of people in everyday life. The use of this type of data extends beyond the standard remit of Geographic Information Science (GISc), as classification and optimisation problems can often be visualised in the same manner. A common task within these fields is the assignment of a region (called a footprint) that is representative of the underlying pattern. The ways in which this footprint can be generated has been the subject of much research with many algorithms having been produced. Much of this research has focused on the dot patterns and footprints as static entities, however for many of the applications the data is prone to change. This thesis proposes that the footprint need not necessarily be updated each time the dot pattern changes; that the footprint can remain an appropriate representation of the pattern if the amount of change is slight. To ascertain the appropriate times at which to update the footprint, and when to leave it as it is, this thesis introduces the concept of change identifiers as simple measures of change between two dot patterns. Underlying the change identifiers is an in-depth examination of the data inherent in the dot pattern and the creation of descriptors that represent this data. The experimentation performed by this thesis shows that change identifiers are able to distinguish between different types of change across dot patterns from different sources. In doing so the change identifiers reduce the number of updates of the footprint while maintaining a measurably good representation of the dot pattern

    Visual analytics of geo-related multidimensional data

    Get PDF
    In recent years, both the volume and the availability of urban data related to various social issues, such as real estate, crime and population are rapidly increasing. Analysing such urban data can help the government make evidence-based decisions leading to better-informed policies; the citizens can also benefit in many scenarios such as home-seeking. However, the analytic design process can be challenging since (i) the urban data often has multiple attributes (e.g., the distance to supermarket, the distance to work, schools zone in real estate data) that are highly related to geography; and (ii) users might have various analysis/exploration tasks that are hard to define (e.g., different home-buyers might have requirements for housing properties and many of them might not know what they want before they understand the local real estate market). In this thesis, we use visual analytics techniques to study such geo-related multidimensional urban data and answer the following research questions. In the first research question, we propose a visual analytics framework/system for geo-related multidimensional data. Since visual analytics and visualization designs are highly domain-specific, we use the real estate domain as an example to study the problem. Specifically, we first propose a problem abstraction to satisfy the requirements from users (e.g., home buyers, investors). Second, we collect, integrate and clean the last ten year's real estate sold records in Australia as well as their location-related education, facility and transportation profiles, to generate a real multi-dimensional data repository. Third, we propose an interactive visual analytic procedure to help less informed users gradually learn about the local real estate market, upon which users exploit this learned knowledge to specify their personalized requirements in property seeking. Fourth, we propose a series of designs to visualize properties/suburbs in different dimensions and different granularity. Finally, we implement a system prototype for public access (http://115.146.89.158), and present case studies based on real-world datasets and real scenario to demonstrate the usefulness and effectiveness of our system. Our second research question extends the first one and studies the scalability problem to support cluster-based visualization for large-scale geo-related multidimensional data. Particularly, we first propose a design space for cluster-based geographic visualization. To calculate the geographic boundary of each cluster, we propose a concave hull algorithm which can avoid complex shapes, large empty area inside the boundary and overlaps among different clusters. Supported by the concave hull algorithm, we design a cluster-based data structure named ConcaveCubes to efficiently support interactive response to users' visual exploration on large-scale geo-related multidimensional data. Finally, we build a demo system (http://115.146.89.158/ConcaveCubes) to demonstrate the cluster-based geographic visualization, and present extensive experiments using real-world datasets and compare ConcaveCubes with state-of-the-art cube-based structures to verify the efficiency and effectiveness of ConcaveCubes. The last research question studies the problem related to visual analytics of urban areas of interest (AOIs), where we visualize geographic points that satisfy the user query as a limited number of regions (AOIs) instead of a large number of individual points (POIs). After proposing a design space for AOI visualization, we design a parameter-free footprint method named AOI-shapes to effectively capture the region of an AOI based on POIs that satisfy the user query and those that do not. We also propose two incremental methods which generate the AOI-shapes by reusing previous calculations as per users' update of their AOI query. Finally, we implement an online demo (http://www.aoishapes.com) and conduct extensive experiments to demonstrate the efficiency and effectiveness of the proposed AOI-shapes
    corecore