14 research outputs found
VoG: Summarizing and Understanding Large Graphs
How can we succinctly describe a million-node graph with a few simple
sentences? How can we measure the "importance" of a set of discovered subgraphs
in a large graph? These are exactly the problems we focus on. Our main ideas
are to construct a "vocabulary" of subgraph-types that often occur in real
graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the
most succinct description of a graph in terms of this vocabulary. We measure
success in a well-founded way by means of the Minimum Description Length (MDL)
principle: a subgraph is included in the summary if it decreases the total
description length of the graph.
Our contributions are three-fold: (a) formulation: we provide a principled
encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop
\method, an efficient method to minimize the description cost, and (c)
applicability: we report experimental results on multi-million-edge real
graphs, including Flickr and the Notre Dame web graph.Comment: SIAM International Conference on Data Mining (SDM) 201
{VoG}: {Summarizing} and Understanding Large Graphs
How can we succinctly describe a million-node graph with a few simple sentences? How can we measure the "importance" of a set of discovered subgraphs in a large graph? These are exactly the problems we focus on. Our main ideas are to construct a "vocabulary" of subgraph-types that often occur in real graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the most succinct description of a graph in terms of this vocabulary. We measure success in a well-founded way by means of the Minimum Description Length (MDL) principle: a subgraph is included in the summary if it decreases the total description length of the graph. Our contributions are three-fold: (a) formulation: we provide a principled encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop \method, an efficient method to minimize the description cost, and (c) applicability: we report experimental results on multi-million-edge real graphs, including Flickr and the Notre Dame web graph
Reducing the loss of information through annealing text distortion
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Granados, A. ;Cebrian, M. ; Camacho, D. ; de Borja Rodriguez, F. "Reducing the Loss of Information through Annealing Text Distortion". IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7 pp. 1090 - 1102, July 2011Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.This work was supported by the Spanish Ministry of Education and Science under TIN2010-19872 and TIN2010-19607 projects
LeCo: Lightweight Compression via Learning Serial Correlations
Lightweight data compression is a key technique that allows column stores to
exhibit superior performance for analytical queries. Despite a comprehensive
study on dictionary-based encodings to approach Shannon's entropy, few prior
works have systematically exploited the serial correlation in a column for
compression. In this paper, we propose LeCo (i.e., Learned Compression), a
framework that uses machine learning to remove the serial redundancy in a value
sequence automatically to achieve an outstanding compression ratio and
decompression performance simultaneously. LeCo presents a general approach to
this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR),
Delta Encoding, and Run-Length Encoding (RLE) special cases under our
framework. Our microbenchmark with three synthetic and six real-world data sets
shows that a prototype of LeCo achieves a Pareto improvement on both
compression ratio and random access speed over the existing solutions. When
integrating LeCo into widely-used applications, we observe up to 3.9x speed up
in filter-scanning a Parquet file and a 16% increase in Rocksdb's throughput
Clustering and Community Detection in Directed Networks: A Survey
Networks (or graphs) appear as dominant structures in diverse domains,
including sociology, biology, neuroscience and computer science. In most of the
aforementioned cases graphs are directed - in the sense that there is
directionality on the edges, making the semantics of the edges non symmetric.
An interesting feature that real networks present is the clustering or
community structure property, under which the graph topology is organized into
modules commonly called communities or clusters. The essence here is that nodes
of the same community are highly similar while on the contrary, nodes across
communities present low similarity. Revealing the underlying community
structure of directed complex networks has become a crucial and
interdisciplinary topic with a plethora of applications. Therefore, naturally
there is a recent wealth of research production in the area of mining directed
graphs - with clustering being the primary method and tool for community
detection and evaluation. The goal of this paper is to offer an in-depth review
of the methods presented so far for clustering directed networks along with the
relevant necessary methodological background and also related applications. The
survey commences by offering a concise review of the fundamental concepts and
methodological base on which graph clustering algorithms capitalize on. Then we
present the relevant work along two orthogonal classifications. The first one
is mostly concerned with the methodological principles of the clustering
algorithms, while the second one approaches the methods from the viewpoint
regarding the properties of a good cluster in a directed network. Further, we
present methods and metrics for evaluating graph clustering results,
demonstrate interesting application domains and provide promising future
research directions.Comment: 86 pages, 17 figures. Physics Reports Journal (To Appear
Association Discovery in Two-View Data
International audienceTwo-view datasets are datasets whose attributes are naturally split into two sets, each providing a different view on the same set of objects. We introduce the task of finding small and non-redundant sets of associations that describe how the two views are related. To achieve this, we propose a novel approach in which sets of rules are used to translate one view to the other and vice versa. Our models, dubbed translation tables, contain both unidirectional and bidirectional rules that span both views and provide lossless translation from either of the views to the opposite view. To be able to evaluate different translation tables and perform model selection, we present a score based on the Minimum Description Length (MDL) principle. Next, we introduce three TRANSLATOR algorithms to find good models according to this score. The first algorithm is parameter-free and iteratively adds the rule that improves compression most. The other two algorithms use heuristics to achieve better trade-offs between runtime and compression. The empirical evaluation on real-world data demonstrates that only modest numbers of associations are needed to characterize the two-view structure present in the data, while the obtained translation rules are easily interpretable and provide insight into the data
Modelos de compressão e ferramentas para dados ómicos
The ever-increasing growth of the development of high-throughput sequencing
technologies and as a consequence, generation of a huge volume of data,
has revolutionized biological research and discovery. Motivated by that, we
investigate in this thesis the methods which are capable of providing an
efficient representation of omics data in compressed or encrypted manner,
and then, we employ them to analyze omics data.
First and foremost, we describe a number of measures for the purpose
of quantifying information in and between omics sequences. Then, we
present finite-context models (FCMs), substitution-tolerant Markov models
(STMMs) and a combination of the two, which are specialized in modeling
biological data, in order for data compression and analysis.
To ease the storage of the aforementioned data deluge, we design two lossless
data compressors for genomic and one for proteomic data. The methods
work on the basis of (a) a combination of FCMs and STMMs or (b) the mentioned
combination along with repeat models and a competitive prediction
model. Tested on various synthetic and real data showed their outperformance
over the previously proposed methods in terms of compression ratio.
Privacy of genomic data is a topic that has been recently focused by developments
in the field of personalized medicine. We propose a tool that is
able to represent genomic data in a securely encrypted fashion, and at the
same time, is able to compact FASTA and FASTQ sequences by a factor
of three. It employs AES encryption accompanied by a shuffling mechanism
for improving the data security. The results show it is faster than
general-purpose and special-purpose algorithms.
Compression techniques can be employed for analysis of omics data. Having
this in mind, we investigate the identification of unique regions in a species
with respect to close species, that can give us an insight into evolutionary
traits. For this purpose, we design two alignment-free tools that can accurately
find and visualize distinct regions among two collections of DNA or
protein sequences. Tested on modern humans with respect to Neanderthals,
we found a number of absent regions in Neanderthals that may express new
functionalities associated with evolution of modern humans.
Finally, we investigate the identification of genomic rearrangements, that
have important roles in genetic disorders and cancer, by employing a compression
technique. For this purpose, we design a tool that is able to accurately
localize and visualize small- and large-scale rearrangements between
two genomic sequences. The results of applying the proposed tool on several
synthetic and real data conformed to the results partially reported by
wet laboratory approaches, e.g., FISH analysis.O crescente crescimento do desenvolvimento de tecnologias de sequenciamento
de alto rendimento e, como consequência, a geração de um enorme
volume de dados, revolucionou a pesquisa e descoberta biológica. Motivados
por isso, nesta tese investigamos os métodos que fornecem uma
representação eficiente de dados ómicros de maneira compactada ou criptografada
e, posteriormente, os usamos para análise.
Em primeiro lugar, descrevemos uma série de medidas com o objetivo de
quantificar informação em e entre sequencias ómicas. Em seguida, apresentamos
modelos de contexto finito (FCMs), modelos de Markov tolerantes
a substituição (STMMs) e uma combinação dos dois, especializados na
modelagem de dados biológicos, para compactação e análise de dados.
Para facilitar o armazenamento do dilúvio de dados acima mencionado, desenvolvemos
dois compressores de dados sem perda para dados genómicos e
um para dados proteómicos. Os métodos funcionam com base em (a) uma
combinação de FCMs e STMMs ou (b) na combinação mencionada, juntamente
com modelos de repetição e um modelo de previsão competitiva.
Testados em vários dados sintéticos e reais mostraram a sua eficiência sobre
os métodos do estado-de-arte em termos de taxa de compressão.
A privacidade dos dados genómicos é um tópico recentemente focado nos
desenvolvimentos do campo da medicina personalizada. Propomos uma
ferramenta capaz de representar dados genómicos de maneira criptografada
com segurança e, ao mesmo tempo, compactando as sequencias FASTA e
FASTQ para um fator de três. Emprega criptografia AES acompanhada de
um mecanismo de embaralhamento para melhorar a segurança dos dados.
Os resultados mostram que ´e mais rápido que os algoritmos de uso geral e
específico.
As técnicas de compressão podem ser exploradas para análise de dados
ómicos. Tendo isso em mente, investigamos a identificação de regiões
únicas em uma espécie em relação a espécies próximas, que nos podem
dar uma visão das características evolutivas. Para esse fim, desenvolvemos
duas ferramentas livres de alinhamento que podem encontrar e visualizar
com precisão regiões distintas entre duas coleções de sequências de DNA
ou proteínas. Testados em humanos modernos em relação a neandertais,
encontrámos várias regiões ausentes nos neandertais que podem expressar
novas funcionalidades associadas à evolução dos humanos modernos.
Por último, investigamos a identificação de rearranjos genómicos, que têm
papéis importantes em desordens genéticas e cancro, empregando uma
técnica de compressão. Para esse fim, desenvolvemos uma ferramenta capaz
de localizar e visualizar com precisão os rearranjos em pequena e grande
escala entre duas sequências genómicas. Os resultados da aplicação da ferramenta
proposta, em vários dados sintéticos e reais, estão em conformidade
com os resultados parcialmente relatados por abordagens laboratoriais, por
exemplo, análise FISH.Programa Doutoral em Engenharia Informátic
Algorithms to Explore the Structure and Evolution of Biological Networks
High-throughput experimental protocols have revealed thousands of relationships amongst genes and proteins under various conditions. These putative associations are being aggressively mined to decipher the structural and functional architecture of the cell. One useful tool for exploring this data has been computational network analysis. In this thesis, we propose a collection of novel algorithms to explore the structure and evolution of large, noisy, and sparsely annotated biological networks.
We first introduce two information-theoretic algorithms to extract interesting patterns and modules embedded in large graphs. The first, graph summarization, uses the minimum description length principle to find compressible parts of the graph. The second, VI-Cut, uses the variation of information to non-parametrically find groups of topologically cohesive and similarly annotated nodes in the network. We show that both algorithms find structure in biological data that is consistent with known biological processes, protein complexes, genetic diseases, and operational taxonomic units. We also propose several algorithms to systematically generate an ensemble of near-optimal network clusterings and show how these multiple views can be used together to identify clustering dynamics that any single solution approach would miss.
To facilitate the study of ancient networks, we introduce a framework called ``network archaeology'') for reconstructing the node-by-node and edge-by-edge arrival history of a network. Starting with a present-day network, we apply a probabilistic growth model backwards in time to find high-likelihood previous states of the graph. This allows us to explore how interactions and modules may have evolved over time. In experiments with real-world social and biological networks, we find that our algorithms can recover significant features of ancestral networks that have long since disappeared.
Our work is motivated by the need to understand large and complex biological systems that are being revealed to us by imperfect data. As data continues to pour in, we believe that computational network analysis will continue to be an essential tool towards this end