12,137 research outputs found

    Scalable Distributed Approximation of Internal Measures for Clustering Evaluation

    Full text link
    The most widely used internal measure for clustering evaluation is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, which is clearly unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm hinges on a Probability Proportional to Size (PPS) sampling scheme, and, for any fixed ε,δ∈(0,1)\varepsilon, \delta \in (0,1), it approximates the silhouette coefficient within a mere additive error O(ε)O(\varepsilon) with probability 1−δ1-\delta, using a very small number of distance calculations. We also prove that the algorithm can be adapted to obtain rigorous approximations of other internal measures of clustering quality, such as cohesion and separation. Importantly, we provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, which makes our estimation approach applicable to big data scenarios. We perform an extensive experimental evaluation of our silhouette approximation algorithm, comparing its performance to a number of baseline heuristics on real and synthetic datasets. The experiments provide evidence that, unlike other heuristics, our estimation strategy not only provides tight theoretical guarantees but is also able to return highly accurate estimations while running in a fraction of the time required by the exact computation, and that its distributed implementation is highly scalable, thus enabling the computation of internal measures for very large datasets for which the exact computation is prohibitive.Comment: 16 pages, 4 tables, 1 figur

    Hierarchically Clustered Representation Learning

    Full text link
    The joint optimization of representation learning and clustering in the embedding space has experienced a breakthrough in recent years. In spite of the advance, clustering with representation learning has been limited to flat-level categories, which often involves cohesive clustering with a focus on instance relations. To overcome the limitations of flat clustering, we introduce hierarchically-clustered representation learning (HCRL), which simultaneously optimizes representation learning and hierarchical clustering in the embedding space. Compared with a few prior works, HCRL firstly attempts to consider a generation of deep embeddings from every component of the hierarchy, not just leaf components. In addition to obtaining hierarchically clustered embeddings, we can reconstruct data by the various abstraction levels, infer the intrinsic hierarchical structure, and learn the level-proportion features. We conducted evaluations with image and text domains, and our quantitative analyses showed competent likelihoods and the best accuracies compared with the baselines.Comment: 10 pages, 7 figures, Under review as a conference pape

    Scalable aggregation predictive analytics: a query-driven machine learning approach

    Get PDF
    We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method

    Distributed Graph Clustering using Modularity and Map Equation

    Full text link
    We study large-scale, distributed graph clustering. Given an undirected graph, our objective is to partition the nodes into disjoint sets called clusters. A cluster should contain many internal edges while being sparsely connected to other clusters. In the context of a social network, a cluster could be a group of friends. Modularity and map equation are established formalizations of this internally-dense-externally-sparse principle. We present two versions of a simple distributed algorithm to optimize both measures. They are based on Thrill, a distributed big data processing framework that implements an extended MapReduce model. The algorithms for the two measures, DSLM-Mod and DSLM-Map, differ only slightly. Adapting them for similar quality measures is straight-forward. We conduct an extensive experimental study on real-world graphs and on synthetic benchmark graphs with up to 68 billion edges. Our algorithms are fast while detecting clusterings similar to those detected by other sequential, parallel and distributed clustering algorithms. Compared to the distributed GossipMap algorithm, DSLM-Map needs less memory, is up to an order of magnitude faster and achieves better quality.Comment: 14 pages, 3 figures; v3: Camera ready for Euro-Par 2018, more details, more results; v2: extended experiments to include comparison with competing algorithms, shortened for submission to Euro-Par 201

    Mal-Netminer: Malware Classification Approach based on Social Network Analysis of System Call Graph

    Get PDF
    As the security landscape evolves over time, where thousands of species of malicious codes are seen every day, antivirus vendors strive to detect and classify malware families for efficient and effective responses against malware campaigns. To enrich this effort, and by capitalizing on ideas from the social network analysis domain, we build a tool that can help classify malware families using features driven from the graph structure of their system calls. To achieve that, we first construct a system call graph that consists of system calls found in the execution of the individual malware families. To explore distinguishing features of various malware species, we study social network properties as applied to the call graph, including the degree distribution, degree centrality, average distance, clustering coefficient, network density, and component ratio. We utilize features driven from those properties to build a classifier for malware families. Our experimental results show that influence-based graph metrics such as the degree centrality are effective for classifying malware, whereas the general structural metrics of malware are less effective for classifying malware. Our experiments demonstrate that the proposed system performs well in detecting and classifying malware families within each malware class with accuracy greater than 96%.Comment: Mathematical Problems in Engineering, Vol 201
    • …
    corecore