6 research outputs found

    Enumerating Top-k Quasi-Cliques

    Get PDF
    Quasi-cliques are dense incomplete subgraphs of a graph that generalize the notion of cliques. Enumerating quasi-cliques from a graph is a robust way to detect densely connected structures with applications to bio-informatics and social network analysis. However, enumerating quasi-cliques in a graph is a challenging problem, even harder than the problem of enumerating cliques. We consider the enumeration of top-k degree-based quasi-cliques, and make the following contributions: (1) We show that even the problem of detecting if a given quasi-clique is maximal (i.e. not contained within another quasi-clique) is NP-hard (2) We present a novel heuristic algorithm KernelQC to enumerate the k largest quasi-cliques in a graph. Our method is based on identifying kernels of extremely dense subgraphs within a graph, following by growing subgraphs around these kernels, to arrive at quasi-cliques with the required densities (3) Experimental results show that our algorithm accurately enumerates quasi-cliques from a graph, is much faster than current state-of-the-art methods for quasi-clique enumeration (often more than three orders of magnitude faster), and can scale to larger graphs than current methods.Comment: 10 page

    Network higher-order structure dismantling

    Full text link
    Diverse higher-order structures, foundational for supporting a network's "meta-functions", play a vital role in structure, functionality, and the emergence of complex dynamics. Nevertheless, the problem of dismantling them has been consistently overlooked. In this paper, we introduce the concept of dismantling higher-order structures, with the objective of disrupting not only network connectivity but also eradicating all higher-order structures in each branch, thereby ensuring thorough functional paralysis. Given the diversity and unknown specifics of higher-order structures, identifying and targeting them individually is not practical or even feasible. Fortunately, their close association with k-cores arises from their internal high connectivity. Thus, we transform higher-order structure measurement into measurements on k-cores with corresponding orders. Furthermore, we propose the Belief Propagation-guided High-order Dismantling (BPDH) algorithm, minimizing dismantling costs while achieving maximal disruption to connectivity and higher-order structures, ultimately converting the network into a forest. BPDH exhibits the explosive vulnerability of network higher-order structures, counterintuitively showcasing decreasing dismantling costs with increasing structural complexity. Our findings offer a novel approach for dismantling malignant networks, emphasizing the substantial challenges inherent in safeguarding against such malicious attacks.Comment: 14 pages, 5 figures, 2 table

    Navigating Diverse Datasets in the Face of Uncertainty

    Get PDF
    When exploring big volumes of data, one of the challenging aspects is their diversity of origin. Multiple files that have not yet been ingested into a database system may contain information of interest to a researcher, who must curate, understand and sieve their content before being able to extract knowledge. Performance is one of the greatest difficulties in exploring these datasets. On the one hand, examining non-indexed, unprocessed files can be inefficient. On the other hand, any processing before its understanding introduces latency and potentially un- necessary work if the chosen schema matches poorly the data. We have surveyed the state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle data in-situ performantly. Another major difficulty is matching files from multiple origins since their schema and layout may not be compatible or properly documented. Most surveyed solutions overlook this problem, especially for numeric, uncertain data, as is typical in fields like astronomy. The main objective of our research is to assist data scientists during the exploration of unprocessed, numerical, raw data distributed across multiple files based solely on its intrinsic distribution. In this thesis, we first introduce the concept of Equally-Distributed Dependencies, which provides the foundations to match this kind of dataset. We propose PresQ, a novel algorithm that finds quasi-cliques on hypergraphs based on their expected statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can be assumed to be the same. Finally, we propose a two-sample statistical test based on Self-Organizing Maps (SOM). This method can outperform, in terms of power, other classifier-based two- sample tests, being in some cases comparable to kernel-based methods, with the advantage of being interpretable. Both PresQ and the SOM-based statistical test can provide insights that drive serendipitous discoveries

    Navigating diverse datasets in the face of uncertainty

    Get PDF
    When exploring big volumes of data, one of the challenging aspects is their diversity of origin. Multiple files that have not yet been ingested into a database system may contain information of interest to a researcher, who must curate, understand and sieve their content before being able to extract knowledge. Performance is one of the greatest difficulties in exploring these datasets. On the one hand, examining non-indexed, unprocessed files can be inefficient. On the other hand, any processing before its understanding introduces latency and potentially unnecessary work if the chosen schema matches poorly the data. We have surveyed the state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle data in-situ performantly. Another major difficulty is matching files from multiple origins since their schema and layout may not be compatible or properly documented. Most surveyed solutions overlook this problem, especially for numeric, uncertain data, as is typical in fields like astronomy. The main objective of our research is to assist data scientists during the exploration of unprocessed, numerical, raw data distributed across multiple files based solely on its intrinsic distribution. In this thesis, we first introduce the concept of Equally-Distributed Dependencies, which provides the foundations to match this kind of dataset. We propose PresQ, a novel algorithm that finds quasi-cliques on hypergraphs based on their expected statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can be assumed to be the same. Finally, we propose a two-sample statistical test based on Self-Organizing Maps (SOM). This method can outperform, in terms of power, other classifier-based twosample tests, being in some cases comparable to kernel-based methods, with the advantage of being interpretable. Both PresQ and the SOM-based statistical test can provide insights that drive serendipitous discoveries.Uno de los mayores problemas del big data es el origen diverso de los datos. Un investigador puede estar interesado en agregar datos provenientes de múltiples ficheros que aún no han sido pre-procesados e insertados en un sistema de bases de datos, debiendo depurar y filtrar el contenido antes de poder extraer conocimiento. La exploración directa de estos ficheros presentará serios problemas de rendimiento: examinar archivos sin ningún tipo de preparación ni indexación puede ser ineficiente tanto en términos de lectura de datos como de tiempo de ejecución. Por otro lado, ingerirlos en un sistema de base de datos antes de entenderlos introduce latencia y trabajo potencialmente redundante si el esquema elegido no se ajusta a las consultas que se ejecutarán. Afortunadamente, nuestra revisión del estado del arte demuestra que existen múltiples soluciones posibles para explorar datos in-situ de manera efectiva. Otra gran dificultad es la gestión de archivos de diversas procedencias, ya que su esquema y disposición pueden no ser compatibles, o no estar correctamente documentados. La mayoría de las soluciones encontradas pasan por alto esta problemática, especialmente en lo referente a datos numéricos e inciertos, como, por ejemplo, aquellos relacionados con atributos físicos generados en campos como la astronomía. Nuestro objetivo principal es ayudar a los investigadores a explorar este tipo de datos sin procesamiento previo, almacenados en múltiples archivos, y empleando únicamente su distribución intrínseca. En esta tesis primero introducimos el concepto de Equally-Distributed Dependencies (EDD) (Dependencias de Igualdad de Distribución), estableciendo las bases necesarias para ser capaz de emparejar conjuntos de datos con esquemas diferentes, pero con atributos en común. Luego, presentamos PresQ, un nuevo algoritmo probabilístico de búsqueda de quasi-cliques en hiper-grafos. El enfoque estadístico de PresQ permite proyectar el problema de búsqueda de EDD en el de búsqueda de quasi-cliques. Por último, proponemos una prueba estadística basada en Self-Organizing Maps (SOM) (Mapa autoorganizado). Este método puede superar, en términos de poder estadístico, otras técnicas basadas en clasificadores, siendo en algunos casos comparable a métodos basados en kernels, con la ventaja adicional de ser interpretable. Tanto PresQ como la prueba estadística basada en SOM pueden impulsar descubrimientos serendípicos.211 página

    Enumerating Top-k Quasi-Cliques

    No full text
    Quasi-cliques are dense incomplete subgraphs of a graph that generalize the notion of cliques. Enumerating quasi-cliques from a graph is a robust way to detect densely connected structures with applications to bio-informatics and social network analysis. However, enumerating quasi-cliques in a graph is a challenging problem, even harder than the problem of enumerating cliques. We consider the enumeration of top-k degree-based quasi-cliques, and make the following contributions: (1) We show that even the problem of detecting if a given quasi-clique is maximal (i.e. not contained within another quasi-clique) is NP-hard (2) We present a novel heuristic algorithm KernelQC to enumerate the k largest quasi-cliques in a graph. Our method is based on identifying kernels of extremely dense subgraphs within a graph, following by growing subgraphs around these kernels, to arrive at quasi-cliques with the required densities (3) Experimental results show that our algorithm accurately enumerates quasi-cliques from a graph, is much faster than current state-of-the-art methods for quasi-clique enumeration (often more than three orders of magnitude faster), and can scale to larger graphs than current methods.This is a pre-print of the article Sanei-Mehri, Seyed-Vahid, Apurba Das, and Srikanta Tirthapura. "Enumerating Top-k Quasi-Cliques." arXiv preprint arXiv:1808.09531 (2018). Posted with permission.</p
    corecore