6 research outputs found
Enumerating Top-k Quasi-Cliques
Quasi-cliques are dense incomplete subgraphs of a graph that generalize the
notion of cliques. Enumerating quasi-cliques from a graph is a robust way to
detect densely connected structures with applications to bio-informatics and
social network analysis. However, enumerating quasi-cliques in a graph is a
challenging problem, even harder than the problem of enumerating cliques. We
consider the enumeration of top-k degree-based quasi-cliques, and make the
following contributions: (1) We show that even the problem of detecting if a
given quasi-clique is maximal (i.e. not contained within another quasi-clique)
is NP-hard (2) We present a novel heuristic algorithm KernelQC to enumerate the
k largest quasi-cliques in a graph. Our method is based on identifying kernels
of extremely dense subgraphs within a graph, following by growing subgraphs
around these kernels, to arrive at quasi-cliques with the required densities
(3) Experimental results show that our algorithm accurately enumerates
quasi-cliques from a graph, is much faster than current state-of-the-art
methods for quasi-clique enumeration (often more than three orders of magnitude
faster), and can scale to larger graphs than current methods.Comment: 10 page
Network higher-order structure dismantling
Diverse higher-order structures, foundational for supporting a network's
"meta-functions", play a vital role in structure, functionality, and the
emergence of complex dynamics. Nevertheless, the problem of dismantling them
has been consistently overlooked. In this paper, we introduce the concept of
dismantling higher-order structures, with the objective of disrupting not only
network connectivity but also eradicating all higher-order structures in each
branch, thereby ensuring thorough functional paralysis. Given the diversity and
unknown specifics of higher-order structures, identifying and targeting them
individually is not practical or even feasible. Fortunately, their close
association with k-cores arises from their internal high connectivity. Thus, we
transform higher-order structure measurement into measurements on k-cores with
corresponding orders. Furthermore, we propose the Belief Propagation-guided
High-order Dismantling (BPDH) algorithm, minimizing dismantling costs while
achieving maximal disruption to connectivity and higher-order structures,
ultimately converting the network into a forest. BPDH exhibits the explosive
vulnerability of network higher-order structures, counterintuitively showcasing
decreasing dismantling costs with increasing structural complexity. Our
findings offer a novel approach for dismantling malignant networks, emphasizing
the substantial challenges inherent in safeguarding against such malicious
attacks.Comment: 14 pages, 5 figures, 2 table
Navigating Diverse Datasets in the Face of Uncertainty
When exploring big volumes of data, one of the challenging aspects is their diversity
of origin. Multiple files that have not yet been ingested into a database system may
contain information of interest to a researcher, who must curate, understand and sieve
their content before being able to extract knowledge.
Performance is one of the greatest difficulties in exploring these datasets. On the
one hand, examining non-indexed, unprocessed files can be inefficient. On the other
hand, any processing before its understanding introduces latency and potentially un-
necessary work if the chosen schema matches poorly the data. We have surveyed the
state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle
data in-situ performantly.
Another major difficulty is matching files from multiple origins since their schema
and layout may not be compatible or properly documented. Most surveyed solutions
overlook this problem, especially for numeric, uncertain data, as is typical in fields
like astronomy.
The main objective of our research is to assist data scientists during the exploration
of unprocessed, numerical, raw data distributed across multiple files based solely on
its intrinsic distribution.
In this thesis, we first introduce the concept of Equally-Distributed Dependencies,
which provides the foundations to match this kind of dataset. We propose PresQ,
a novel algorithm that finds quasi-cliques on hypergraphs based on their expected
statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can
be assumed to be the same.
Finally, we propose a two-sample statistical test based on Self-Organizing Maps
(SOM). This method can outperform, in terms of power, other classifier-based two-
sample tests, being in some cases comparable to kernel-based methods, with the
advantage of being interpretable.
Both PresQ and the SOM-based statistical test can provide insights that drive
serendipitous discoveries
Navigating diverse datasets in the face of uncertainty
When exploring big volumes of data, one of the challenging aspects is their diversity
of origin. Multiple files that have not yet been ingested into a database system may
contain information of interest to a researcher, who must curate, understand and sieve
their content before being able to extract knowledge.
Performance is one of the greatest difficulties in exploring these datasets. On the
one hand, examining non-indexed, unprocessed files can be inefficient. On the other
hand, any processing before its understanding introduces latency and potentially unnecessary work if the chosen schema matches poorly the data. We have surveyed the
state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle
data in-situ performantly.
Another major difficulty is matching files from multiple origins since their schema
and layout may not be compatible or properly documented. Most surveyed solutions
overlook this problem, especially for numeric, uncertain data, as is typical in fields
like astronomy.
The main objective of our research is to assist data scientists during the exploration
of unprocessed, numerical, raw data distributed across multiple files based solely on
its intrinsic distribution.
In this thesis, we first introduce the concept of Equally-Distributed Dependencies,
which provides the foundations to match this kind of dataset. We propose PresQ,
a novel algorithm that finds quasi-cliques on hypergraphs based on their expected
statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can
be assumed to be the same.
Finally, we propose a two-sample statistical test based on Self-Organizing Maps
(SOM). This method can outperform, in terms of power, other classifier-based twosample tests, being in some cases comparable to kernel-based methods, with the
advantage of being interpretable.
Both PresQ and the SOM-based statistical test can provide insights that drive
serendipitous discoveries.Uno de los mayores problemas del big data es el origen diverso de los datos. Un
investigador puede estar interesado en agregar datos provenientes de múltiples ficheros
que aún no han sido pre-procesados e insertados en un sistema de bases de datos,
debiendo depurar y filtrar el contenido antes de poder extraer conocimiento.
La exploración directa de estos ficheros presentará serios problemas de rendimiento:
examinar archivos sin ningún tipo de preparación ni indexación puede ser ineficiente
tanto en términos de lectura de datos como de tiempo de ejecución. Por otro lado,
ingerirlos en un sistema de base de datos antes de entenderlos introduce latencia y trabajo potencialmente redundante si el esquema elegido no se ajusta a las consultas que
se ejecutarán. Afortunadamente, nuestra revisión del estado del arte demuestra que
existen múltiples soluciones posibles para explorar datos in-situ de manera efectiva.
Otra gran dificultad es la gestión de archivos de diversas procedencias, ya que su
esquema y disposición pueden no ser compatibles, o no estar correctamente documentados. La mayoría de las soluciones encontradas pasan por alto esta problemática,
especialmente en lo referente a datos numéricos e inciertos, como, por ejemplo, aquellos relacionados con atributos físicos generados en campos como la astronomía.
Nuestro objetivo principal es ayudar a los investigadores a explorar este tipo de datos
sin procesamiento previo, almacenados en múltiples archivos, y empleando únicamente
su distribución intrínseca.
En esta tesis primero introducimos el concepto de Equally-Distributed Dependencies
(EDD) (Dependencias de Igualdad de Distribución), estableciendo las bases necesarias
para ser capaz de emparejar conjuntos de datos con esquemas diferentes, pero con
atributos en común. Luego, presentamos PresQ, un nuevo algoritmo probabilístico de
búsqueda de quasi-cliques en hiper-grafos. El enfoque estadístico de PresQ permite
proyectar el problema de búsqueda de EDD en el de búsqueda de quasi-cliques.
Por último, proponemos una prueba estadística basada en Self-Organizing Maps
(SOM) (Mapa autoorganizado). Este método puede superar, en términos de poder estadístico, otras técnicas basadas en clasificadores, siendo en algunos casos comparable
a métodos basados en kernels, con la ventaja adicional de ser interpretable.
Tanto PresQ como la prueba estadística basada en SOM pueden impulsar descubrimientos serendípicos.211 página
Enumerating Top-k Quasi-Cliques
Quasi-cliques are dense incomplete subgraphs of a graph that generalize the notion of cliques. Enumerating quasi-cliques from a graph is a robust way to detect densely connected structures with applications to bio-informatics and social network analysis. However, enumerating quasi-cliques in a graph is a challenging problem, even harder than the problem of enumerating cliques. We consider the enumeration of top-k degree-based quasi-cliques, and make the following contributions: (1) We show that even the problem of detecting if a given quasi-clique is maximal (i.e. not contained within another quasi-clique) is NP-hard (2) We present a novel heuristic algorithm KernelQC to enumerate the k largest quasi-cliques in a graph. Our method is based on identifying kernels of extremely dense subgraphs within a graph, following by growing subgraphs around these kernels, to arrive at quasi-cliques with the required densities (3) Experimental results show that our algorithm accurately enumerates quasi-cliques from a graph, is much faster than current state-of-the-art methods for quasi-clique enumeration (often more than three orders of magnitude faster), and can scale to larger graphs than current methods.This is a pre-print of the article Sanei-Mehri, Seyed-Vahid, Apurba Das, and Srikanta Tirthapura. "Enumerating Top-k Quasi-Cliques." arXiv preprint arXiv:1808.09531 (2018). Posted with permission.</p