10,209 research outputs found
Compression techniques for extreme-scale graphs and matrices: sequential and parallel algorithms
A graph G = (V, E) is an ordered tuple where V is a non-empty set of elements called vertices (nodes), and E is a set of an unordered pair of elements called links (edges), and a time-evolving graph is a change in the states of the edges over time. With the growing popularity of social networks and the massive influx of users, it is becoming a challenging task to store the network/graph and process them as fast as possible before the property of the graph changes with the graph evolution.
Graphs or networks are a collection of entities (individuals in a social network) and their relationships (friends, followers); ways to represent a graph can help how the information could be extracted. The increase in the number of users increases the relationship the user has, which makes the graphs massive and nearly impossible to store them in friendly structures such as a matrix or an adjacency list. Therefore, an exciting area of research is storing these massive graphs with a smaller memory footprint and processing with very little extra memory.
But there is always a trade-off with time and space; to get a small memory footprint, one has to remove the redundancy rigorously, which consumes time. In the same way, when traversing these tight spaces, the time required to query also increases compared to a matrix or an adjacency list.
In this dissertation, we provide the encoding technique to store the arrays in the Compressed Sparse Row (CSR) data structure and extend the encoding to store time-evolving graphs in the form of a CSR. We also propose combinations of two structures (CSR + CBT) to store the time-evolving graphs and to improve the time and space trade-off. Encoding also enables one to access a node without decompressing the entire structure, which means that the data structure can be accessed.
We then provide four ways to store multi-dimensional data, which represents intricate relations within the social network. Once the data are stored in compressed format, it is important to provide algorithms that support the structures. One such computation which is the basis for any graph algorithm is matrix-multiplication. We now extend our work to perform value-based matrix multiplication on compressed structures. We test our algorithm on extremely large matrices, in the order of 100s of millions with various levels of sparsity. Using matrix-matrix multiplication and keeping the theme of storing the data in small spaces, we propose another way of compression is through the dimensionality reduction, which is referred to as Matrix Factorization.
Performing any of these operations on a compressed structure without decompressing would be time consuming. Therefore, in this dissertation, we introduce a parallel technique to construct the graph and also run a list of queries using the querying algorithms, such as fetching neighbor or edge existence in parallel. We also extend our work to propose parallel time-evolving differential compression of CSR using the prefix sum approach
Compressing and Performing Algorithms on Massively Large Networks
Networks are represented as a set of nodes (vertices) and the arcs (links) connecting them. Such networks can model various real-world structures such as social networks (e.g., Facebook), information networks (e.g., citation networks), technological networks (e.g., the Internet), and biological networks (e.g., gene-phenotype network). Analysis of such structures is a heavily studied area with many applications. However, in this era of big data, we find ourselves with networks so massive that the space requirements inhibit network analysis.
Since many of these networks have nodes and arcs on the order of billions to trillions, even basic data structures such as adjacency lists could cost petabytes to zettabytes of storage. Storing these networks in secondary memory would require I/O access (i.e., disk access) during analysis, thus drastically slowing analysis time. To perform analysis efficiently on such extensive data, we either need enough main memory for the data structures and algorithms, or we need to develop compressions that require much less space while still being able to answer queries efficiently.
In this dissertation, we develop several compression techniques that succinctly represent these real-world networks while still being able to efficiently query the network (e.g., check if an arc exists between two nodes). Furthermore, since many of these networks continue to grow over time, our compression techniques also support the ability to add and remove nodes and edges directly on the compressed structure. We also provide a way to compress the data quickly without any intermediate structure, thus giving minimal memory overhead. We provide detailed analysis and prove that our compression is indeed succinct (i.e., achieves the information-theoretic lower bound). Also, we empirically show that our compression rates outperform or are equal to existing compression algorithms on many benchmark datasets.
We also extend our technique to time-evolving networks. That is, we store the entire state of the network at each time frame. Studying time-evolving networks allows us to find patterns throughout the time that would not be available in regular, static network analysis. A succinct representation for time-evolving networks is arguably more important than static graphs, due to the extra dimension inflating the space requirements of basic data structures even more. Again, we manage to achieve succinctness while also providing fast encoding, minimal memory overhead during encoding, fast queries, and fast, direct modification. We also compare against several benchmarks and empirically show that we achieve compression rates better than or equal to the best performing benchmark for each dataset.
Finally, we also develop both static and time-evolving algorithms that run directly on our compressed structures. Using our static graph compression combined with our differential technique, we find that we can speed up matrix-vector multiplication by reusing previously computed products. We compare our results against a similar technique using the Webgraph Framework, and we see that not only are our base query speeds faster, but we also gain a more significant speed-up from reusing products. Then, we use our time-evolving compression to solve the earliest arrival paths problem and time-evolving transitive closure. We found that not only were we the first to run such algorithms directly on compressed data, but that our technique was particularly efficient at doing so
Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning
Many problems in sequential decision making and stochastic control often have
natural multiscale structure: sub-tasks are assembled together to accomplish
complex goals. Systematically inferring and leveraging hierarchical structure,
particularly beyond a single level of abstraction, has remained a longstanding
challenge. We describe a fast multiscale procedure for repeatedly compressing,
or homogenizing, Markov decision processes (MDPs), wherein a hierarchy of
sub-problems at different scales is automatically determined. Coarsened MDPs
are themselves independent, deterministic MDPs, and may be solved using
existing algorithms. The multiscale representation delivered by this procedure
decouples sub-tasks from each other and can lead to substantial improvements in
convergence rates both locally within sub-problems and globally across
sub-problems, yielding significant computational savings. A second fundamental
aspect of this work is that these multiscale decompositions yield new transfer
opportunities across different problems, where solutions of sub-tasks at
different levels of the hierarchy may be amenable to transfer to new problems.
Localized transfer of policies and potential operators at arbitrary scales is
emphasized. Finally, we demonstrate compression and transfer in a collection of
illustrative domains, including examples involving discrete and continuous
statespaces.Comment: 86 pages, 15 figure
The Thermodynamics of Network Coding, and an Algorithmic Refinement of the Principle of Maximum Entropy
The principle of maximum entropy (Maxent) is often used to obtain prior
probability distributions as a method to obtain a Gibbs measure under some
restriction giving the probability that a system will be in a certain state
compared to the rest of the elements in the distribution. Because classical
entropy-based Maxent collapses cases confounding all distinct degrees of
randomness and pseudo-randomness, here we take into consideration the
generative mechanism of the systems considered in the ensemble to separate
objects that may comply with the principle under some restriction and whose
entropy is maximal but may be generated recursively from those that are
actually algorithmically random offering a refinement to classical Maxent. We
take advantage of a causal algorithmic calculus to derive a thermodynamic-like
result based on how difficult it is to reprogram a computer code. Using the
distinction between computable and algorithmic randomness we quantify the cost
in information loss associated with reprogramming. To illustrate this we apply
the algorithmic refinement to Maxent on graphs and introduce a Maximal
Algorithmic Randomness Preferential Attachment (MARPA) Algorithm, a
generalisation over previous approaches. We discuss practical implications of
evaluation of network randomness. Our analysis provides insight in that the
reprogrammability asymmetry appears to originate from a non-monotonic
relationship to algorithmic probability. Our analysis motivates further
analysis of the origin and consequences of the aforementioned asymmetries,
reprogrammability, and computation.Comment: 30 page
- …