17 research outputs found

    Representative Points and Cluster Attributes Based Incremental Sequence Clustering Algorithm

    Get PDF
    In order to improve the execution time and clustering quality of sequence clustering algorithm in large-scale dynamic dataset, a novel algorithm RPCAISC (Representative Points and Cluster Attributes Based Incremental Sequence Clustering) was presented. In this paper, density factor is defined. The primary representative point that has a density factor less than the prescribed threshold will be deleted directly. New representative points can be reselected from non-representative points. Moreover, the representative points of each cluster are modeled using the K-nearest neighbor method. The definition of the relevant degree (RD) between clusters was also proposed. The RD is computed by comprehensively considering the correlations of objects within a cluster and between different clusters. Then, whether the two clusters need to merge is determined. Additionally, the cluster attributes of the initial clustering are retained with this process. By calculating the matching degree between the incremental sequence and the existing cluster attributes, dynamic sequence clustering can be achieved. The theoretic experimental results and analysis prove that RPCAISC has better correct rate of clustering results and execution efficiency

    Hot Zone Identification: Analyzing Effects of Data Sampling On Spam Clustering

    Get PDF
    Email is the most common and comparatively the most efficient means of exchanging information in today\u27s world. However, given the widespread use of emails in all sectors, they have been the target of spammers since the beginning. Filtering spam emails has now led to critical actions such as forensic activities based on mining spam email. The data mine for spam emails at the University of Alabama at Birmingham is considered to be one of the most prominent resources for mining and identifying spam sources. It is a widely researched repository used by researchers from different global organizations. The usual process of mining the spam data involves going through every email in the data mine and clustering them based on their different attributes. However, given the size of the data mine, it takes an exceptionally long time to execute the clustering mechanism each time. In this paper, we have illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone’ from the spam campaigns. We have provided detailed comparative analysis of the quality of the clusters after sampling, the overall distribution of clusters on the spam data, and timing measurements for our sampling approach. Additionally, we present different strategies which allowed us to optimize the sampling process using data-preprocessing and using the database engine\u27s computational resources, and thus improving the performance of the clustering process

    Hot Zone Identification: Analyzing Effects of Data Sampling on SPAM Clustering

    Get PDF
    Email is the most common and comparatively the most efficient means of exchanging information in today\u27s world. However, given the widespread use of emails in all sectors, they have been the target of spammers since the beginning. Filtering spam emails has now led to critical actions such as forensic activities based on mining spam email. The data mine for spam emails at the University of Alabama at Birmingham is considered to be one of the most prominent resources for mining and identifying spam sources. It is a widely researched repository used by researchers from different global organizations. The usual process of mining the spam data involves going through every email in the data mine and clustering them based on their different attributes. However, given the size of the data mine, it takes an exceptionally long time to execute the clustering mechanism each time. In this paper, we have illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone’ from the spam campaigns. We have provided detailed comparative analysis of the quality of the clusters after sampling, the overall distribution of clusters on the spam data, and timing measurements for our sampling approach. Additionally, we present different strategies which allowed us to optimize the sampling process using data-preprocessing and using the database engine\u27s computational resources, and thus improving the performance of the clustering process. Keywords: Clustering, Data mining, Monte-Carlo Sampler, Sampling, Spam, Step Sequence Sampler, Stepping Random Sampler, Hot Zon

    Mrdbscan: An efficient parallel density-based clustering algorithm using mapreduce

    Get PDF
    Abstract-Data clustering is an important data mining technology that plays a crucial role in numerous scientific applications. However, it is challenging due to the size of datasets has been growing rapidly to extra-large scale in the real world. Meanwhile, MapReduce is a desirable parallel programming platform that is widely applied in kinds of data process fields. In this paper, we propose an efficient parallel density-based clustering algorithm and implement it by a 4-stages MapReduce paradigm. Furthermore, we adopt a quick partitioning strategy for large scale non-indexed data. We study the metric of merge among bordering partitions and make optimizations on it. At last, we evaluate our work on real large scale datasets using Hadoop platform. Results reveal that the speedup and scaleup of our work are very efficient

    Performance evaluation of a distributed clustering approach for spatial datasets

    Get PDF
    The analysis of big data requires powerful, scalable, and accurate data analytics techniques that the traditional data mining and machine learning do not have as a whole. Therefore, new data analytics frameworks are needed to deal with the big data challenges such as volumes, velocity, veracity, variety of the data. Distributed data mining constitutes a promising approach for big data sets, as they are usually produced in distributed locations, and processing them on their local sites will reduce significantly the response times, communications, etc. In this paper, we propose to study the performance of a distributed clustering, called Dynamic Distributed Clustering (DDC). DDC has the ability to remotely generate clusters and then aggregate them using an efficient aggregation algorithm. The technique is developed for spatial datasets. We evaluated the DDC using two types of communications (synchronous and asynchronous), and tested using various load distributions. The experimental results show that the approach has super-linear speed-up, scales up very well, and can take advantage of the recent programming models, such as MapReduce model, as its results are not affected by the types of communication

    Towards Optimal Execution of Density-based Clustering on Heterogeneous Hardware

    Get PDF
    Abstract Data Clustering is an important and highly utilized data mining technique in various application domains. With ever increasing data volumes in the era of big data, the efficient execution of clustering algorithms is a fundamental prerequisite to gain understanding and acquire novel, previously unknown knowledge from data. To establish an efficient execution, the clustering algorithms have to be re-engineered to fully exploit the provided hardware capabilities. Shared-memory multiprocessor systems like graphics processing units (GPUs) provide extremely high parallelism combined with a high bandwidth transfer at low cost. The availability of such computing units increases with upcoming processors, where a common CPU and various computing units, like GPU, are tightly coupled using a unified shared memory hierarchy. In this paper, we consider density-based clustering for such heterogeneous systems. In particular, we optimize the configuration of CUDA-DClust -a density-based clustering algorithm for GPUs -and show that our configuration approach enables an efficient and deterministic execution. Our configuration approach is based on data as well as hardware properties, so that we are able to adjust the algorithm execution in both directions. In our evaluation, we show the applicability of our approach and present open challenges which have to be solved next

    Novelty Detection And Cluster Analysis In Time Series Data Using Variational Autoencoder Feature Maps

    Get PDF
    The identification of atypical events and anomalies in complex data systems is an essential yet challenging task. The dynamic nature of these systems produces huge volumes of data that is often heterogeneous, and the failure to account for this will impede the detection of anomalies. Time series data encompass these issues and its high dimensional nature intensifies these challenges. This research presents a framework for the identification of anomalies in temporal data. A comparative analysis of Centroid, Density and Neural Network-based clustering techniques was performed and their scalability was assessed. This facilitated the development of a new algorithm called the Variational Autoencoder Feature Map (VAEFM) which is an ensemble method that is based on Kohonen’s Self-Organizing Maps (SOM) and Variational Autoencoders. The VAEFM is an unsupervised learning algorithm that models the distribution of temporal data without making a priori assumptions. It incorporates principles of novelty detection to enhance the representational capacity of SOMs neurons, which improves their ability to generalize with novel data. The VAEFM technique was demonstrated on a dataset of accumulated aircraft sensor recordings, to detect atypical events that transpired in the approach phase of flight. This is a proactive means of accident prevention and is therefore advantageous to the Aviation industry. Furthermore, accumulated aircraft data presents big data challenges, which requires scalable analytical solutions. The results indicated that VAEFM successfully identified temporal dependencies in the flight data and produced several clusters and outliers. It analyzed over 2500 flights in under 5 minutes and identified 12 clusters, two of which contained stabilized approaches. The remaining comprised of aborted approaches, excessively high/fast descent patterns and other contributory factors for unstabilized approaches. Outliers were detected which revealed oscillations in aircraft trajectories; some of which would have a lower detection rate using traditional flight safety analytical techniques. The results further indicated that VAEFM facilitates large-scale analysis and its scaling efficiency was demonstrated on a High Performance Computing System, by using an increased number of processors, where it achieved an average speedup of 70%

    Intelligent Tourist Routes

    Get PDF
    A maior parte das pessoas gosta de viajar e o Porto foi eleita a cidade da Europa mais interessante para visitar em 2019. Com grande potencial de atratividade, o Porto conta com infindáveis opções de rotas turísticas. Investigações recentes mostram que um operador eficiente de viagens não só deve ter em conta as necessidades e constrangimentos do utilizador, mas também permitir algum grau de livre exploração da cidade, adaptando a oferta de acordo com as preferências do utilizador. A imagem global do contexto é um bom ponto de partida para uma viagem memorável. Nesta dissertação pretende-se desenvolver sistema inteligente capaz de maximizar a satisfação do visitante, criando percursos dinâmicos e personalizados em função de preferências e interesses dos utilizadores. Estes serão aferidos diretamente através de técnicas modernas de segmentação e descoberta de perfil e indiretamente através da pontuação atribuída pelos utilizadores a sets de fotografias (normais e 360) dos locais de interesse. Ao longo do percurso o utilizador poderá dar feedback sobre os locais de interesse sugeridos por forma a potenciar a aprendizagem do sistema
    corecore