3,695 research outputs found

    Parallel Hierarchical Affinity Propagation with MapReduce

    Full text link
    The accelerated evolution and explosion of the Internet and social media is generating voluminous quantities of data (on zettabyte scales). Paramount amongst the desires to manipulate and extract actionable intelligence from vast big data volumes is the need for scalable, performance-conscious analytics algorithms. To directly address this need, we propose a novel MapReduce implementation of the exemplar-based clustering algorithm known as Affinity Propagation. Our parallelization strategy extends to the multilevel Hierarchical Affinity Propagation algorithm and enables tiered aggregation of unstructured data with minimal free parameters, in principle requiring only a similarity measure between data points. We detail the linear run-time complexity of our approach, overcoming the limiting quadratic complexity of the original algorithm. Experimental validation of our clustering methodology on a variety of synthetic and real data sets (e.g. images and point data) demonstrates our competitiveness against other state-of-the-art MapReduce clustering techniques

    Developed Clustering Algorithms for Engineering Applications: A Review

    Get PDF
    Clustering algorithms play a pivotal role in the field of engineering, offering valuable insights into complex datasets. This review paper explores the landscape of developed clustering algorithms with a focus on their applications in engineering. The introduction provides context for the significance of clustering algorithms, setting the stage for an in-depth exploration. The overview section delineates fundamental clustering concepts and elucidates the workings of these algorithms. Categorization of clustering algorithms into partitional, hierarchical, and density-based forms lay the groundwork for a comprehensive discussion. The core of the paper delves into an extensive review of clustering algorithms tailored for engineering applications. Each algorithm is scrutinized in dedicated subsections, unraveling their specific contributions, applications, and advantages. A comparative analysis assesses the performance of these algorithms, delineating their strengths and limitations. Trends and advancements in the realm of clustering algorithms for engineering applications are thoroughly examined. The review concludes with a reflection on the challenges faced by existing clustering algorithms and proposes avenues for future research. This paper aims to provide a valuable resource for researchers, engineers, and practitioners, guiding them in the selection and application of clustering algorithms for diverse engineering scenarios

    Analyzing Clustered Latent Dirichlet Allocation

    Get PDF
    Dynamic Topic Models (DTM) are a way to extract time-variant information from a collection of documents. The only available implementation of this is slow, taking days to process a corpus of 533,588 documents. In order to see how topics - both their key words and their proportional size in all documents - change over time, we analyze Clustered Latent Dirichlet Allocation (CLDA) as an alternative to DTM. This algorithm is based on existing parallel components, using Latent Dirichlet Allocation (LDA) to extract topics at local times, and k-means clustering to combine topics from dierent time periods. This method is two orders of magnitude faster than DTM, and allows for more freedom of experiment design. Results show that most topics generated by this algorithm are similar to those generated by DTM at both the local and global level using the Jaccard index and Sørensen-Dice coecient, and that this method\u27s perplexity compares favorably to DTM. We also explore tradeos in CLDA method parameters

    Regression-clustering for Improved Accuracy and Training Cost with Molecular-Orbital-Based Machine Learning

    Full text link
    Machine learning (ML) in the representation of molecular-orbital-based (MOB) features has been shown to be an accurate and transferable approach to the prediction of post-Hartree-Fock correlation energies. Previous applications of MOB-ML employed Gaussian Process Regression (GPR), which provides good prediction accuracy with small training sets; however, the cost of GPR training scales cubically with the amount of data and becomes a computational bottleneck for large training sets. In the current work, we address this problem by introducing a clustering/regression/classification implementation of MOB-ML. In a first step, regression clustering (RC) is used to partition the training data to best fit an ensemble of linear regression (LR) models; in a second step, each cluster is regressed independently, using either LR or GPR; and in a third step, a random forest classifier (RFC) is trained for the prediction of cluster assignments based on MOB feature values. Upon inspection, RC is found to recapitulate chemically intuitive groupings of the frontier molecular orbitals, and the combined RC/LR/RFC and RC/GPR/RFC implementations of MOB-ML are found to provide good prediction accuracy with greatly reduced wall-clock training times. For a dataset of thermalized geometries of 7211 organic molecules of up to seven heavy atoms, both implementations reach chemical accuracy (1 kcal/mol error) with only 300 training molecules, while providing 35000-fold and 4500-fold reductions in the wall-clock training time, respectively, compared to MOB-ML without clustering. The resulting models are also demonstrated to retain transferability for the prediction of large-molecule energies with only small-molecule training data. Finally, it is shown that capping the number of training datapoints per cluster leads to further improvements in prediction accuracy with negligible increases in wall-clock training time.Comment: 31 pages, 10 figures, with an S
    corecore