327 research outputs found
Navigating Diverse Datasets in the Face of Uncertainty
When exploring big volumes of data, one of the challenging aspects is their diversity
of origin. Multiple files that have not yet been ingested into a database system may
contain information of interest to a researcher, who must curate, understand and sieve
their content before being able to extract knowledge.
Performance is one of the greatest difficulties in exploring these datasets. On the
one hand, examining non-indexed, unprocessed files can be inefficient. On the other
hand, any processing before its understanding introduces latency and potentially un-
necessary work if the chosen schema matches poorly the data. We have surveyed the
state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle
data in-situ performantly.
Another major difficulty is matching files from multiple origins since their schema
and layout may not be compatible or properly documented. Most surveyed solutions
overlook this problem, especially for numeric, uncertain data, as is typical in fields
like astronomy.
The main objective of our research is to assist data scientists during the exploration
of unprocessed, numerical, raw data distributed across multiple files based solely on
its intrinsic distribution.
In this thesis, we first introduce the concept of Equally-Distributed Dependencies,
which provides the foundations to match this kind of dataset. We propose PresQ,
a novel algorithm that finds quasi-cliques on hypergraphs based on their expected
statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can
be assumed to be the same.
Finally, we propose a two-sample statistical test based on Self-Organizing Maps
(SOM). This method can outperform, in terms of power, other classifier-based two-
sample tests, being in some cases comparable to kernel-based methods, with the
advantage of being interpretable.
Both PresQ and the SOM-based statistical test can provide insights that drive
serendipitous discoveries
Navigating diverse datasets in the face of uncertainty
When exploring big volumes of data, one of the challenging aspects is their diversity
of origin. Multiple files that have not yet been ingested into a database system may
contain information of interest to a researcher, who must curate, understand and sieve
their content before being able to extract knowledge.
Performance is one of the greatest difficulties in exploring these datasets. On the
one hand, examining non-indexed, unprocessed files can be inefficient. On the other
hand, any processing before its understanding introduces latency and potentially unnecessary work if the chosen schema matches poorly the data. We have surveyed the
state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle
data in-situ performantly.
Another major difficulty is matching files from multiple origins since their schema
and layout may not be compatible or properly documented. Most surveyed solutions
overlook this problem, especially for numeric, uncertain data, as is typical in fields
like astronomy.
The main objective of our research is to assist data scientists during the exploration
of unprocessed, numerical, raw data distributed across multiple files based solely on
its intrinsic distribution.
In this thesis, we first introduce the concept of Equally-Distributed Dependencies,
which provides the foundations to match this kind of dataset. We propose PresQ,
a novel algorithm that finds quasi-cliques on hypergraphs based on their expected
statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can
be assumed to be the same.
Finally, we propose a two-sample statistical test based on Self-Organizing Maps
(SOM). This method can outperform, in terms of power, other classifier-based twosample tests, being in some cases comparable to kernel-based methods, with the
advantage of being interpretable.
Both PresQ and the SOM-based statistical test can provide insights that drive
serendipitous discoveries.Uno de los mayores problemas del big data es el origen diverso de los datos. Un
investigador puede estar interesado en agregar datos provenientes de múltiples ficheros
que aún no han sido pre-procesados e insertados en un sistema de bases de datos,
debiendo depurar y filtrar el contenido antes de poder extraer conocimiento.
La exploración directa de estos ficheros presentará serios problemas de rendimiento:
examinar archivos sin ningún tipo de preparación ni indexación puede ser ineficiente
tanto en términos de lectura de datos como de tiempo de ejecución. Por otro lado,
ingerirlos en un sistema de base de datos antes de entenderlos introduce latencia y trabajo potencialmente redundante si el esquema elegido no se ajusta a las consultas que
se ejecutarán. Afortunadamente, nuestra revisión del estado del arte demuestra que
existen múltiples soluciones posibles para explorar datos in-situ de manera efectiva.
Otra gran dificultad es la gestión de archivos de diversas procedencias, ya que su
esquema y disposición pueden no ser compatibles, o no estar correctamente documentados. La mayoría de las soluciones encontradas pasan por alto esta problemática,
especialmente en lo referente a datos numéricos e inciertos, como, por ejemplo, aquellos relacionados con atributos físicos generados en campos como la astronomía.
Nuestro objetivo principal es ayudar a los investigadores a explorar este tipo de datos
sin procesamiento previo, almacenados en múltiples archivos, y empleando únicamente
su distribución intrínseca.
En esta tesis primero introducimos el concepto de Equally-Distributed Dependencies
(EDD) (Dependencias de Igualdad de Distribución), estableciendo las bases necesarias
para ser capaz de emparejar conjuntos de datos con esquemas diferentes, pero con
atributos en común. Luego, presentamos PresQ, un nuevo algoritmo probabilístico de
búsqueda de quasi-cliques en hiper-grafos. El enfoque estadístico de PresQ permite
proyectar el problema de búsqueda de EDD en el de búsqueda de quasi-cliques.
Por último, proponemos una prueba estadística basada en Self-Organizing Maps
(SOM) (Mapa autoorganizado). Este método puede superar, en términos de poder estadístico, otras técnicas basadas en clasificadores, siendo en algunos casos comparable
a métodos basados en kernels, con la ventaja adicional de ser interpretable.
Tanto PresQ como la prueba estadística basada en SOM pueden impulsar descubrimientos serendípicos.211 página
Visual Exploration System for Analyzing Trends in Annual Recruitment Using Time-varying Graphs
Annual recruitment data of new graduates are manually analyzed by human
resources specialists (HR) in industries, which signifies the need to evaluate
the recruitment strategy of HR specialists. Every year, different applicants
send in job applications to companies. The relationships between applicants'
attributes (e.g., English skill or academic credential) can be used to analyze
the changes in recruitment trends across multiple years' data. However, most
attributes are unnormalized and thus require thorough preprocessing. Such
unnormalized data hinder the effective comparison of the relationship between
applicants in the early stage of data analysis. Thus, a visual exploration
system is highly needed to gain insight from the overview of the relationship
between applicants across multiple years. In this study, we propose the
Polarizing Attributes for Network Analysis of Correlation on Entities
Association (Panacea) visualization system. The proposed system integrates a
time-varying graph model and dynamic graph visualization for heterogeneous
tabular data. Using this system, human resource specialists can interactively
inspect the relationships between two attributes of prospective employees
across multiple years. Further, we demonstrate the usability of Panacea with
representative examples for finding hidden trends in real-world datasets and
then describe HR specialists' feedback obtained throughout Panacea's
development. The proposed Panacea system enables HR specialists to visually
explore the annual recruitment of new graduates
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
Attributed Stream Hypergraphs: temporal modeling of node-attributed high-order interactions
Recent advances in network science have resulted in two distinct research
directions aimed at augmenting and enhancing representations for complex
networks. The first direction, that of high-order modeling, aims to focus on
connectivity between sets of nodes rather than pairs, whereas the second one,
that of feature-rich augmentation, incorporates into a network all those
elements that are driven by information which is external to the structure,
like node properties or the flow of time. This paper proposes a novel toolbox,
that of Attributed Stream Hypergraphs (ASHs), unifying both high-order and
feature-rich elements for representing, mining, and analyzing complex networks.
Applied to social network analysis, ASHs can characterize complex social
phenomena along topological, dynamic and attributive elements. Experiments on
real-world face-to-face and online social media interactions highlight that
ASHs can easily allow for the analyses, among others, of high-order groups'
homophily, nodes' homophily with respect to the hyperedges in which nodes
participate, and time-respecting paths between hyperedges.Comment: Submitted to "Applied Network Science
- …