223,932 research outputs found

    Distributed Document Clustering and Cluster Summarization in Peer-to-Peer Environments

    Get PDF
    This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed. We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation. The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters. The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality. The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization. The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible

    Cluster Ensembles for Big Data Mining Problems

    Get PDF
    Mining big data involves several problems and new challenges, in addition to the huge volume of information. One the one hand, these data generally come from autonomous and decentralized sources, thus its dimensionality is heterogeneous and diverse, and generally involves privacy issues. On the other hand, algorithms for mining data such as clustering methods, have particular characteristics that make them useful for different types of data mining problems. Due to the huge amount of information, the task of choosing a single clustering approach becomes even more difficult. For instance, k-means, a very popular algorithm, always assumes spherical clusters in data; hierarchical approaches can be used when there is interest in finding this type of structure; expectationmaximization iteratively adjusts the parameters of a statistical model to fit the observed data. Moreover, all these methods work properly only with relatively small data sets. Large-volume data often make their application unfeasible, not to mention if data come from autonomous sources that are constantly growing and evolving. In the last years, a new clustering approach has emerged, called consensus clustering or cluster ensembles. Instead of running a single algorithm, this approach produces, at first, a set of data partitions (ensemble) by employing different clustering techniques on the same original data set. Then, this ensemble is processed by a consensus function, which produces a single consensus partition that outperforms individual solutions in the input ensemble. This approach has been successfully employed for distributed data mining, what makes it very interesting and applicable in the big data context. Although many techniques have been proposed for large data sets, most of them mainly focus on making individual components more efficient, instead of improving the whole consensus approach for the case of big data.Sociedad Argentina de Informática e Investigación Operativa (SADIO

    Cluster Ensembles for Big Data Mining Problems

    Get PDF
    Mining big data involves several problems and new challenges, in addition to the huge volume of information. One the one hand, these data generally come from autonomous and decentralized sources, thus its dimensionality is heterogeneous and diverse, and generally involves privacy issues. On the other hand, algorithms for mining data such as clustering methods, have particular characteristics that make them useful for different types of data mining problems. Due to the huge amount of information, the task of choosing a single clustering approach becomes even more difficult. For instance, k-means, a very popular algorithm, always assumes spherical clusters in data; hierarchical approaches can be used when there is interest in finding this type of structure; expectationmaximization iteratively adjusts the parameters of a statistical model to fit the observed data. Moreover, all these methods work properly only with relatively small data sets. Large-volume data often make their application unfeasible, not to mention if data come from autonomous sources that are constantly growing and evolving. In the last years, a new clustering approach has emerged, called consensus clustering or cluster ensembles. Instead of running a single algorithm, this approach produces, at first, a set of data partitions (ensemble) by employing different clustering techniques on the same original data set. Then, this ensemble is processed by a consensus function, which produces a single consensus partition that outperforms individual solutions in the input ensemble. This approach has been successfully employed for distributed data mining, what makes it very interesting and applicable in the big data context. Although many techniques have been proposed for large data sets, most of them mainly focus on making individual components more efficient, instead of improving the whole consensus approach for the case of big data.Sociedad Argentina de Informática e Investigación Operativa (SADIO

    Online Spectral Clustering on Network Streams

    Get PDF
    Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling

    Cooperative Clustering Model and Its Applications

    Get PDF
    Data clustering plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. This thesis addresses some of these challenges through cooperation between multiple clustering approaches. We introduce a Cooperative Clustering (CC) model that involves multiple clustering techniques; the goal of the cooperative model is to increase the homogeneity of objects within clusters through cooperation by developing two data structures, cooperative contingency graph and histogram representation of pair-wise similarities. The two data structures are designed to find the matching sub-clusters between different clusterings and to obtain the final set of cooperative clusters through a merging process. Obtaining the co-occurred objects from the different clusterings enables the cooperative model to group objects based on a multiple agreement between the invoked clustering techniques. In addition, merging this set of sub-clusters using histograms poses a new trend of grouping objects into more homogenous clusters. The cooperative model is consistent, reusable, and scalable in terms of the number of the adopted clustering approaches. In order to deal with noisy data, a novel Cooperative Clustering Outliers Detection (CCOD) algorithm is implemented through the implication of the cooperation methodology for better detection of outliers in data. The new detection approach is designed in four phases, (1) Global non-cooperative Clustering, (2) Cooperative Clustering, (3) Possible outlier’s Detection, and finally (4) Candidate Outliers Detection. The detection of outliers is established in a bottom-up scenario. The thesis also addresses cooperative clustering in distributed Peer-to-Peer (P2P) networks. Mining large and inherently distributed datasets poses many challenges, one of which is the extraction of a global model as a global summary of the clustering solutions generated from all nodes for the purpose of interpreting the clustering quality of the distributed dataset as if it was located at one node. We developed distributed cooperative model and architecture that work on a two-tier super-peer P2P network. The model is called Distributed Cooperative Clustering in Super-peer P2P Networks (DCCP2P). This model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as two layers of peer neighborhoods and super-peers. Summarization of the global distributed clusters is achieved through a distributed version of the cooperative clustering model. Three clustering algorithms, k-means (KM), Bisecting k-means (BKM) and Partitioning Around Medoids (PAM) are invoked in the cooperative model. Results on various gene expression and text documents datasets with different properties, configurations and different degree of outliers reveal that: (i) the cooperative clustering model achieves significant improvement in the quality of the clustering solutions compared to that of the non-cooperative individual approaches; (ii) the cooperative detection algorithm discovers the nonconforming objects in data with better accuracy than the contemporary approaches, and (iii) the distributed cooperative model attains the same quality or even better as the centralized approach and achieves decent speedup by increasing number of nodes. The distributed model offers high degree of flexibility, scalability, and interpretability of large distributed repositories. Achieving the same results using current methodologies requires polling the data first to one center location, which is sometimes not feasible

    New Algorithm for Clustering Distributed Data Using k-Means

    Get PDF
    The internet era and high speed networks have ushered in the capabilities to have ready access to large amounts of geographically distributed data. Individuals, businesses, and governments recognize the value of this available resource to those who can transform the data into information. These databases, though valuable as individual entities, become significantly more valuable when they function as parts of a federated database and their data can be aggregated for collective mining or computations. This requires new algorithms to shift their focus from working with single databases to efficiently working with federated databases. In this paper, we propose a new decomposable version of the popular k-means clustering algorithm that works in this desired manner with a set of networked databases. We show that it is possible to perform global computation in a reasonably secure manner for either horizontally or vertically distributed databases. The computation is completed by only exchanging a few local summaries among the databases. An empirical and analytical validation of our results is also presented

    New member candidates of Upper Scorpius from Gaia DR1

    Full text link
    Context. Selecting a cluster in proper motion space is an established method for identifying members of a star forming region. The first data release from Gaia (DR1) provides an extremely large and precise stellar catalogue, which when combined with the Tycho-2 catalogue gives the 2.5 million parallaxes and proper motions contained within the Tycho-Gaia Astrometric Solution (TGAS). Aims. We aim to identify new member candidates of the nearby Upper Scorpius subgroup of the Scorpius-Centaurus Complex within the TGAS catalogue. In doing so, we also aim to validate the use of the DBSCAN clustering algorithm on spatial and kinematic data as a robust member selection method. Methods. We constructed a method for member selection using a density-based clustering algorithm (DBSCAN) applied over proper motion and distance. We then applied this method to Upper Scorpius, and evaluated the results and performance of the method. Results. We identified 167 member candidates of Upper Scorpius, of which 78 are new, distributed within a 10∘^{\circ} radius from its core. These member candidates have a mean distance of 145.6 ±\pm 7.5 pc, and a mean proper motion of (-11.4, -23.5) ±\pm (0.7, 0.4) mas/yr. These values are consistent with measured distances and proper motions of previously identified bona-fide members of the Upper Scorpius association.Comment: 8 pages, 9 figures and 1 table. Accepted for publication in Astronomy and Astrophysic

    A novel clustering methodology based on modularity optimisation for detecting authorship affinities in Shakespearean era plays

    Full text link
    © 2016 Naeni et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. In this study we propose a novel, unsupervised clustering methodology for analyzing large datasets. This new, efficient methodology converts the general clustering problem into the community detection problem in graph by using the Jensen-Shannon distance, a dissimilarity measure originating in Information Theory. Moreover, we use graph theoretic concepts for the generation and analysis of proximity graphs. Our methodology is based on a newly proposed memetic algorithm (iMA-Net) for discovering clusters of data elements by maximizing the modularity function in proximity graphs of literary works. To test the effectiveness of this general methodology, we apply it to a text corpus dataset, which contains frequencies of approximately 55,114 unique words across all 168 written in the Shakespearean era (16th and 17th centuries), to analyze and detect clusters of similar plays. Experimental results and comparison with state-of-the-art clustering methods demonstrate the remarkable performance of our new method for identifying high quality clusters which reflect the commonalities in the literary style of the plays

    TPICDS: A Two-Phase Parallel Approach for Incremental Clustering of Data Streams

    Get PDF
    Parallel and distributed solutions are essential for clustering data streams due to the large volumes of data. This paper first examines a direct adaptation of a recently developed prototype-based algorithm into three existing parallel frameworks. Based on the evaluation of performance, the paper then presents a customised pipeline framework that combines incremental and twophase learning into a balanced approach that dynamically allocates the available processing resources. This new framework is evaluated on a collection of synthetic datasets. The experimental results reveal that the framework not only produces correct final clusters on the one hand, but also significantly improves the clustering efficienc
    • …
    corecore