1,370 research outputs found

    Parallel Hierarchical Affinity Propagation with MapReduce

    Full text link
    The accelerated evolution and explosion of the Internet and social media is generating voluminous quantities of data (on zettabyte scales). Paramount amongst the desires to manipulate and extract actionable intelligence from vast big data volumes is the need for scalable, performance-conscious analytics algorithms. To directly address this need, we propose a novel MapReduce implementation of the exemplar-based clustering algorithm known as Affinity Propagation. Our parallelization strategy extends to the multilevel Hierarchical Affinity Propagation algorithm and enables tiered aggregation of unstructured data with minimal free parameters, in principle requiring only a similarity measure between data points. We detail the linear run-time complexity of our approach, overcoming the limiting quadratic complexity of the original algorithm. Experimental validation of our clustering methodology on a variety of synthetic and real data sets (e.g. images and point data) demonstrates our competitiveness against other state-of-the-art MapReduce clustering techniques

    Efficient classification using parallel and scalable compressed model and Its application on intrusion detection

    Full text link
    In order to achieve high efficiency of classification in intrusion detection, a compressed model is proposed in this paper which combines horizontal compression with vertical compression. OneR is utilized as horizontal com-pression for attribute reduction, and affinity propagation is employed as vertical compression to select small representative exemplars from large training data. As to be able to computationally compress the larger volume of training data with scalability, MapReduce based parallelization approach is then implemented and evaluated for each step of the model compression process abovementioned, on which common but efficient classification methods can be directly used. Experimental application study on two publicly available datasets of intrusion detection, KDD99 and CMDC2012, demonstrates that the classification using the compressed model proposed can effectively speed up the detection procedure at up to 184 times, most importantly at the cost of a minimal accuracy difference with less than 1% on average

    Incremental Affinity Propagation based on Cluster Consolidation and Stratification

    Full text link
    Modern data mining applications require to perform incremental clustering over dynamic datasets by tracing temporal changes over the resulting clusters. In this paper, we propose A-Posteriori affinity Propagation (APP), an incremental extension of Affinity Propagation (AP) based on cluster consolidation and cluster stratification to achieve faithfulness and forgetfulness. APP enforces incremental clustering where i) new arriving objects are dynamically consolidated into previous clusters without the need to re-execute clustering over the entire dataset of objects, and ii) a faithful sequence of clustering results is produced and maintained over time, while allowing to forget obsolete clusters with decremental learning functionalities. Four popular labeled datasets are used to test the performance of APP with respect to benchmark clustering performances obtained by conventional AP and Incremental Affinity Propagation based on Nearest neighbor Assignment (IAPNA) algorithms. Experimental results show that APP achieves comparable clustering performance while enforcing scalability at the same time

    EVALUATION OF THE CLUSTERING PERFORMANCE OF AFFINITY PROPAGATION ALGORITHM CONSIDERING THE INFLUENCE OF PREFERENCE PARAMETER AND DAMPING FACTOR

    Get PDF
    The identification of significant underlying data patterns such as image composition and spatial arrangements is fundamental in remote sensing tasks. Therefore, the development of an effective approach for information extraction is crucial to achieve this goal. Affinity propagation (AP) algorithm is a novel powerful technique with the ability of handling with unusual data, containing both categorical and numerical attributes. However, AP has some limitations related to the choice of initial preference parameter, occurrence of oscillations and processing of large data sets. This paper evaluates the clustering performance of AP algorithm taking into account the influence of preference parameter and damping factor. The study was conducted considering the AP algorithm, the adaptive AP and partition AP. According to the experiments, the choice of preference and damping greatly influences on the quality and the final number of clusters

    Extracting user spatio-temporal profiles from location based social networks

    Get PDF
    Report de RecercaLocation Based Social Networks (LBSN) like Twitter or Instagram are a good source for user spatio-temporal behavior. These social network provide a low rate sampling of user's location information during large intervals of time that can be used to discover complex behaviors, including mobility profiles, points of interest or unusual events. This information is important for different domains like mobility route planning, touristic recommendation systems or city planning. Other approaches have used the data from LSBN to categorize areas of a city depending on the categories of the places that people visit or to discover user behavioral patterns from their visits. The aim of this paper is to analyze how the spatio-temporal behavior of a large number of users in a well limited geographical area can be segmented in different profiles. These behavioral profiles are obtained by means of clustering algorithms that show the different behaviors that people have when living and visiting a city. The data analyzed was obtained from the public data feeds of Twitter and Instagram inside the area of the city of Barcelona for a period of several months. The analysis of these data shows that these kind of algorithms can be successfully applied to data from any city (or any general area) to discover useful profiles that can be described on terms of the city singular places and areas and their temporal relationships. These profiles can be used as a basis for making decisions in different application domains, specially those related with mobility inside and outside a city.Preprin

    Unsupervised learning for long-term autonomy

    Get PDF
    This thesis investigates methods to enable a robot to build and maintain an environment model in an automatic manner. Such capabilities are especially important in long-term autonomy, where robots operate for extended periods of time without human intervention. In such scenarios we can no longer assume that the environment and the models will remain static. Rather changes are expected and the robot needs to adapt to the new, unseen, circumstances automatically. The approach described in this thesis is based on clustering the robot’s sensing information. This provides a compact representation of the data which can be updated as more information becomes available. The work builds on affinity propagation (Frey and Dueck, 2007), a recent clustering method which obtains high quality clusters while only requiring similarities between pairs of points, and importantly, selecting the number of clusters automatically. This is essential for real autonomy as we typically do not know “a priori” how many clusters best represent the data. The contributions of this thesis a three fold. First a self-supervised method capable of learning a visual appearance model in long-term autonomy settings is presented. Secondly, affinity propagation is extended to handle multiple sensor modalities, often occurring in robotics, in a principle way. Third, a method for joint clustering and outlier selection is proposed which selects a user defined number of outlier while clustering the data. This is solved using an extension of affinity propagation as well as a Lagrangian duality approach which provides guarantees on the optimality of the solution

    Clustering Approaches for Multi-source Entity Resolution

    Get PDF
    Entity Resolution (ER) or deduplication aims at identifying entities, such as specific customer or product descriptions, in one or several data sources that refer to the same real-world entity. ER is of key importance for improving data quality and has a crucial role in data integration and querying. The previous generation of ER approaches focus on integrating records from two relational databases or performing deduplication within a single database. Nevertheless, in the era of Big Data the number of available data sources is increasing rapidly. Therefore, large-scale data mining or querying systems need to integrate data obtained from numerous sources. For example, in online digital libraries or E-Shops, publications or products are incorporated from a large number of archives or suppliers across the world or within a specified region or country to provide a unified view for the user. This process requires data consolidation from numerous heterogeneous data sources, which are mostly evolving. By raising the number of sources, data heterogeneity and velocity as well as the variance in data quality is increased. Therefore, multi-source ER, i.e. finding matching entities in an arbitrary number of sources, is a challenging task. Previous efforts for matching and clustering entities between multiple sources (> 2) mostly treated all sources as a single source. This approach excludes utilizing metadata or provenance information for enhancing the integration quality and leads up to poor results due to ignorance of the discrepancy between quality of sources. The conventional ER pipeline consists of blocking, pair-wise matching of entities, and classification. In order to meet the new needs and requirements, holistic clustering approaches that are capable of scaling to many data sources are needed. The holistic clustering-based ER should further overcome the restriction of pairwise linking of entities by making the process capable of grouping entities from multiple sources into clusters. The clustering step aims at removing false links while adding missing true links across sources. Additionally, incremental clustering and repairing approaches need to be developed to cope with the ever-increasing number of sources and new incoming entities. To this end, we developed novel clustering and repairing schemes for multi-source entity resolution. The approaches are capable of grouping entities from multiple clean (duplicate-free) sources, as well as handling data from an arbitrary combination of clean and dirty sources. The multi-source clustering schemes exclusively developed for multi-source ER can obtain superior results compared to general purpose clustering algorithms. Additionally, we developed incremental clustering and repairing methods in order to handle the evolving sources. The proposed incremental approaches are capable of incorporating new sources as well as new entities from existing sources. The more sophisticated approach is able to repair previously determined clusters, and consequently yields improved quality and a reduced dependency on the insert order of the new entities. To ensure scalability, the parallel variation of all approaches are implemented on top of the Apache Flink framework which is a distributed processing engine. The proposed methods have been integrated in a new end-to-end ER tool named FAMER (FAst Multi-source Entity Resolution system). The FAMER framework is comprised of Linking and Clustering components encompassing both batch and incremental ER functionalities. The output of Linking part is recorded as a similarity graph where each vertex represents an entity and each edge maintains the similarity relationship between two entities. Such a similarity graph is the input of the Clustering component. The comprehensive comparative evaluations overall show that the proposed clustering and repairing approaches for both batch and incremental ER achieve high quality while maintaining the scalability

    Semisupervised Clustering for Networks Based on Fast Affinity Propagation

    Get PDF
    Most of the existing clustering algorithms for networks are unsupervised, which cannot help improve the clustering quality by utilizing a small number of prior knowledge. We propose a semisupervised clustering algorithm for networks based on fast affinity propagation (SCAN-FAP), which is essentially a kind of similarity metric learning method. Firstly, we define a new constraint similarity measure integrating the structural information and the pairwise constraints, which reflects the effective similarities between nodes in networks. Then, taking the constraint similarities as input, we propose a fast affinity propagation algorithm which keeps the advantages of the original affinity propagation algorithm while increasing the time efficiency by passing only the messages between certain nodes. Finally, by extensive experimental studies, we demonstrate that the proposed algorithm can take fully advantage of the prior knowledge and improve the clustering quality significantly. Furthermore, our algorithm has a superior performance to some of the state-of-art approaches

    Adaptive semi-supervised affinity propagation clustering algorithm based on structural similarity

    Get PDF
    Uzimajući u obzir nezadovoljavajuće djelovanje grupiranja srodnog širenja algoritma grupiranja, kada se radi o nizovima podataka složenih struktura, u ovom se radu predlaže prilagodljivi nadzirani algoritam grupiranja srodnog širenja utemeljen na strukturnoj sličnosti (SAAP-SS). Najprije se predlaže nova strukturna sličnost rješavanjem nelinearnog problema zastupljenosti niskoga ranga. Zatim slijedi srodno širenje na temelju podešavanja matrice sličnosti primjenom poznatih udvojenih ograničenja. Na kraju se u postupak algoritma uvodi ideja eksplozija kod vatrometa. Prilagodljivo pretražujući preferencijalni prostor u dva smjera, uravnotežuju se globalne i lokalne pretraživačke sposobnosti algoritma u cilju pronalaženja optimalne strukture grupiranja. Rezultati eksperimenata i sa sintetičkim i s realnim nizovima podataka pokazuju poboljšanja u radu predloženog algoritma u usporedbi s AP, FEO-SAP i K-means metodama.In view of the unsatisfying clustering effect of affinity propagation (AP) clustering algorithm when dealing with data sets of complex structures, an adaptive semi-supervised affinity propagation clustering algorithm based on structural similarity (SAAP-SS) is proposed in this paper. First, a novel structural similarity is proposed by solving a non-linear, low-rank representation problem. Then we perform affinity propagation on the basis of adjusting the similarity matrix by utilizing the known pairwise constraints. Finally, the idea of fireworks explosion is introduced into the process of the algorithm. By adaptively searching the preference space bi-directionally, the algorithm’s global and local searching abilities are balanced in order to find the optimal clustering structure. The results of the experiments with both synthetic and real data sets show performance improvements of the proposed algorithm compared with AP, FEO-SAP and K-means methods
    corecore