15 research outputs found

    PROPAGATE: a seed propagation framework to compute Distance-based metrics on Very Large Graphs

    Full text link
    We propose PROPAGATE, a fast approximation framework to estimate distance-based metrics on very large graphs such as the (effective) diameter, the (effective) radius, or the average distance within a small error. The framework assigns seeds to nodes and propagates them in a BFS-like fashion, computing the neighbors set until we obtain either the whole vertex set (the diameter) or a given percentage (the effective diameter). At each iteration, we derive compressed Boolean representations of the neighborhood sets discovered so far. The PROPAGATE framework yields two algorithms: PROPAGATE-P, which propagates all the ss seeds in parallel, and PROPAGATE-s which propagates the seeds sequentially. For each node, the compressed representation of the PROPAGATE-P algorithm requires ss bits while that of PROPAGATE-S only 11 bit. Both algorithms compute the average distance, the effective diameter, the diameter, and the connectivity rate within a small error with high probability: for any ε>0\varepsilon>0 and using s=Θ(lognε2)s=\Theta\left(\frac{\log n}{\varepsilon^2}\right) sample nodes, the error for the average distance is bounded by ξ=εΔα\xi = \frac{\varepsilon \Delta}{\alpha}, the error for the effective diameter and the diameter are bounded by ξ=εα\xi = \frac{\varepsilon}{\alpha}, and the error for the connectivity rate is bounded by ε\varepsilon where Δ\Delta is the diameter and α\alpha is a measure of connectivity of the graph. The time complexity is O(mΔlognε2)\mathcal{O}\left(m\Delta \frac{\log n}{\varepsilon^2}\right), where mm is the number of edges of the graph. The experimental results show that the PROPAGATE framework improves the current state of the art both in accuracy and speed. Moreover, we experimentally show that PROPAGATE-S is also very efficient for solving the All Pair Shortest Path problem in very large graphs

    MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

    Full text link
    We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This feature is a strong improvement over previously proposed solutions that could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks

    KADABRA is an ADaptive Algorithm for Betweenness via Random Approximation

    Get PDF
    International audienceWe present KADABRA, a new algorithm to approximate betweenness centrality in directed and undirected graphs, which significantly outperforms all previous approaches on real-world complex networks. The efficiency of the new algorithm relies on two new theoretical contributions, of independent interest. The first contribution focuses on sampling shortest paths, a subroutine used by most algorithms that approximate betweenness centrality. We show that, on realistic random graph models, we can perform this task in time |E| 1 2 +o (1) with high probability, obtaining a significant speedup with respect to the Θ(|E|) worst-case performance. We experimentally show that this new technique achieves similar speedups on real-world complex networks, as well. The second contribution is a new rigorous application of the adaptive sampling technique. This approach decreases the total number of shortest paths that need to be sampled to compute all betweenness centralities with a given absolute error, and it also handles more general problems, such as computing the k most central nodes. Furthermore, our analysis is general, and it might be extended to other settings

    Amostragem para grandes volumes de dados: uma aplicação em redes complexas

    Get PDF
    The main objective of this work is to implement and to evaluate options of sampling plans of algorithms for calculation of betweenness centrality, a measure used to identify important and influential vertices in complex networks aiming to improve the quality of the estimates. For statistical evaluation of variability of the estimates, indicators used in sampling, but not yet in data mining in complex networks, will be proposed. The techniques used in combination to reach the objectives and propose a new algorithm were: sampling, clustering (or community detection) and parallel computing. The sampling feature has been widely used as a tool to reduce dimensionality in data mining problems to streamline processes and reduce costs with data storage. The techniques of grouping for the detection of communities have a high correlation with the measure to be estimated, the betweenness centrality. One of the factors used in choosing the methods used in the implementation of the algorithms was the possibility of using parallel or distributed computing. After the review of the literature and evaluation of the results of the experiments carried out, it is concluded that the proposed algorithm contributes to the state of the art of the use of sampling to estimate betweenness centrality in large complex networks, a challenge in the current scenario of big data, by adding several techniques that optimize the extraction of data knowledge. The proposed algorithm, in addition to improving the quality of the estimates, presented a reduction in the processing time while keeping the scalability.Este trabalho tem como objetivo principal implementar e avaliar opções de planos amostrais de algoritmos para cálculo de centralidade de intermediação - uma medida utilizada para identificar vértices importantes e influentes - em redes complexas, visando melhorar a qualidade das estimativas. A avaliação estatística da qualidade dessas estimativas será feita através de indicadores propostos, já utilizados em amostragem mas não em mineração de dados em redes complexas. As t´técnicas utilizadas de forma combinada para atingir os objetivos e propor um novo algoritmo foram: amostragem, agrupamento (ou detecção de comunidades) e computação paralela. O recurso de amostragem vem sendo utilizado amplamente como ferramenta de redução de dimensionalidade em problemas de mineração de dados para agilizar processos e diminuir custos com armazenagem de dados. As t´técnicas de agrupamento para detecção de comunidades possuem alta correlação com a medida que se deseja estimar, a centralidade de intermediação. Um dos fatores considerados na escolha dos m´métodos empregados na implementação dos algoritmos foi a possibilidade de se utilizar computação paralela ou distribuída. Após revisão da literatura e avaliação dos resultados dos experimentos realizados, conclui-se que o algoritmo proposto pelo presente estudo contribui para o estado da arte da utilização de amostragem para estimar centralidade de intermediação em grandes redes complexas, um desafio no cenário atual de big data, ao agregar várias t´técnicas que otimizam a extração de conhecimento de dados. O algoritmo proposto, além de melhorar a qualidade das estimativas, apresentou redução no tempo de processamento mantendo a escalabilidade
    corecore