2 research outputs found

    An Effective Approach to Predicting Large Dataset in Spatial Data Mining Area

    Get PDF
    Due to enormous quantities of spatial satellite images, telecommunication images, health related tools etc., it is often impractical for users to have detailed and thorough examination of spatial data (S). Large dataset is very common and pervasive in a number of application areas. Discovering or predicting patterns from these datasets is very vital. This research focused on developing new methods, models and techniques for accomplishing advanced spatial data mining (ASDM) tasks. The algorithms were designed to challenge state-of-the-art data technologies and they are tested with randomly generated and actual real-world data. Two main approaches were adopted to achieve the objectives (1) identifying the actual data types (DTs), data structures and spatial content of a given dataset (to make our model versatile and robust) and (2) integrating these data types into an appropriate database management system (DBMS) framework, for easy management and manipulation. These two approaches helped to discover the general and varying types of patterns that exist within any given dataset non-spatial, spatial or even temporal (because spatial data are always influenced by temporal agents) datasets. An iterative method was adopted for system development methodology in this study. The method was adopted as a strategy to combat the irregularity that often exists within spatial datasets. In the course of this study, some of the challenges we encountered which also doubled as current challenges facing spatial data mining includes: (a) time complexity in availing useful data for analysis, (b) time complexity in loading data to storage and (c) difficulties in discovering spatial, non-spatial and temporal correlations between different data objects. However, despite the above challenges, there are some opportunities that spatial data can benefit from including: Cloud computing, Spark technology, Parallelisation, and Bulk-loading methods. Techniques and application areas of spatial data mining (SDM) were identified and their strength and limitations were equally documented. Finally, new methods and algorithms for mining very large data of spatial/non-spatial bias were created. The proposed models/systems are documented in the sections as follows: (a) Development of a new technique for parallel indexing of large dataset (PaX-DBSCAN), (b) Development of new techniques for clustering (X-DBSCAN) in a learning process, (c) Development of a new technique for detecting human skin in an image, (d) Development of a new technique for finding face in an image, (e) Development of a novel technique for management of large spatial and non-spatial datasets (aX-tree). The most prominent among our methods is the new structure used in (c) above -- packed maintained k-dimensional tree (Pmkd-tree), for fast spatial indexing and querying. The structure is a combination system that combines all the proposed algorithms to produce one solid, standard, useful and quality system. The intention of the new final algorithm (system) is to combine the entire initial proposed algorithms to come up with one strong generic effective tool for predicting large dataset SDM area, which it is capable of finding patterns that exist among spatial or non-spatial objects in a DBMS. In addition to Pmkd-tree, we also implemented a novel spatial structure, packed quad-tree (Pquad-Tree), to balance and speed up the performance of the regular quad-tree. Our systems so far have shown a manifestation of efficiency in terms of performance, storage and speed. The final Systems (Pmkd-tree and Pquad-Tree) are generic systems that are flexible, robust, light and stable. They are explicit spatial models for analysing any given problem and for predicting objects as spatially distributed events, using basic SDM algorithms. They can be applied to pattern matching, image processing, computer vision, bioinformatics, information retrieval, machine learning (classification and clustering) and many other computational tasks

    Mining complex data in highly streaming environments

    Get PDF
    Data is growing at a rapid rate because of advanced hardware and software technologies and platforms such as e-health systems, sensor networks, and social media. One of the challenging problems is storing, processing and transferring this big data in an efficient and effective way. One solution to tackle these challenges is to construct synopsis by means of data summarization techniques. Motivated by the fact that without summarization, processing, analyzing and communicating this vast amount of data is inefficient, this thesis introduces new summarization frameworks with the main goals of reducing communication costs and accelerating data mining processes in different application scenarios. Specifically, we study the following big data summarizaion techniques:(i) dimensionality reduction;(ii)clustering,and(iii)histogram, considering their importance and wide use in various areas and domains. In our work, we propose three different frameworks using these summarization techniques to cover three different aspects of big data:"Volume","Velocity"and"Variety" in centralized and decentralized platforms. We use dimensionality reduction techniques for summarizing large 2D-arrays, clustering and histograms for processing multiple data streams. With respect to the importance and rapid growth of emerging e-health applications such as tele-radiology and tele-medicine that require fast, low cost, and often lossless access to massive amounts of medical images and data over band limited channels,our first framework attempts to summarize streams of large volume medical images (e.g. X-rays) for the purpose of compression. Significant amounts of correlation and redundancy exist across different medical images. These can be extracted and used as a data summary to achieve better compression, and consequently less storage and less communication overheads on the network. We propose a novel memory-assisted compression framework as a learning-based universal coding, which can be used to complement any existing algorithm to further eliminate redundancies/similarities across images. This approach is motivated by the fact that, often in medical applications, massive amounts of correlated images from the same family are available as training data for learning the dependencies and deriving appropriate reference or synopses models. The models can then be used for compression of any new image from the same family. In particular, dimensionality reduction techniques such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) are applied on a set of images from training data to form the required reference models. The proposed memory-assisted compression allows each image to be processed independently of other images, and hence allows individual image access and transmission. In the second part of our work,we investigate the problem of summarizing distributed multidimensional data streams using clustering. We devise a distributed clustering framework, DistClusTree, that extends the centralized ClusTree approach. The main difficulty in distributed clustering is balancing communication costs and clustering quality. We tackle this in DistClusTree through combining spatial index summaries and online tracking for efficient local and global incremental clustering. We demonstrate through extensive experiments the efficacy of the framework in terms of communication costs and approximate clustering quality. In the last part, we use a multidimensional index structure to merge distributed summaries in the form of a centralized histogram as another widely used summarization technique with the application in approximate range query answering. In this thesis, we propose the index-based Distributed Mergeable Summaries (iDMS) framework based on kd-trees that addresses these challenges with data generative models of Gaussian mixture models (GMMs) and a Generative Adversarial Network (GAN). iDMS maintains a global approximate kd-tree at a central site via GMMs or GANs upon new arrivals of streaming data at local sites. Experimental results validate the effectiveness and efficiency of iDMS against baseline distributed settings in terms of approximation error and communication costs
    corecore