10,209 research outputs found

    Compression techniques for extreme-scale graphs and matrices: sequential and parallel algorithms

    Get PDF
    A graph G = (V, E) is an ordered tuple where V is a non-empty set of elements called vertices (nodes), and E is a set of an unordered pair of elements called links (edges), and a time-evolving graph is a change in the states of the edges over time. With the growing popularity of social networks and the massive influx of users, it is becoming a challenging task to store the network/graph and process them as fast as possible before the property of the graph changes with the graph evolution. Graphs or networks are a collection of entities (individuals in a social network) and their relationships (friends, followers); ways to represent a graph can help how the information could be extracted. The increase in the number of users increases the relationship the user has, which makes the graphs massive and nearly impossible to store them in friendly structures such as a matrix or an adjacency list. Therefore, an exciting area of research is storing these massive graphs with a smaller memory footprint and processing with very little extra memory. But there is always a trade-off with time and space; to get a small memory footprint, one has to remove the redundancy rigorously, which consumes time. In the same way, when traversing these tight spaces, the time required to query also increases compared to a matrix or an adjacency list. In this dissertation, we provide the encoding technique to store the arrays in the Compressed Sparse Row (CSR) data structure and extend the encoding to store time-evolving graphs in the form of a CSR. We also propose combinations of two structures (CSR + CBT) to store the time-evolving graphs and to improve the time and space trade-off. Encoding also enables one to access a node without decompressing the entire structure, which means that the data structure can be accessed. We then provide four ways to store multi-dimensional data, which represents intricate relations within the social network. Once the data are stored in compressed format, it is important to provide algorithms that support the structures. One such computation which is the basis for any graph algorithm is matrix-multiplication. We now extend our work to perform value-based matrix multiplication on compressed structures. We test our algorithm on extremely large matrices, in the order of 100s of millions with various levels of sparsity. Using matrix-matrix multiplication and keeping the theme of storing the data in small spaces, we propose another way of compression is through the dimensionality reduction, which is referred to as Matrix Factorization. Performing any of these operations on a compressed structure without decompressing would be time consuming. Therefore, in this dissertation, we introduce a parallel technique to construct the graph and also run a list of queries using the querying algorithms, such as fetching neighbor or edge existence in parallel. We also extend our work to propose parallel time-evolving differential compression of CSR using the prefix sum approach

    Compressing and Performing Algorithms on Massively Large Networks

    Get PDF
    Networks are represented as a set of nodes (vertices) and the arcs (links) connecting them. Such networks can model various real-world structures such as social networks (e.g., Facebook), information networks (e.g., citation networks), technological networks (e.g., the Internet), and biological networks (e.g., gene-phenotype network). Analysis of such structures is a heavily studied area with many applications. However, in this era of big data, we find ourselves with networks so massive that the space requirements inhibit network analysis. Since many of these networks have nodes and arcs on the order of billions to trillions, even basic data structures such as adjacency lists could cost petabytes to zettabytes of storage. Storing these networks in secondary memory would require I/O access (i.e., disk access) during analysis, thus drastically slowing analysis time. To perform analysis efficiently on such extensive data, we either need enough main memory for the data structures and algorithms, or we need to develop compressions that require much less space while still being able to answer queries efficiently. In this dissertation, we develop several compression techniques that succinctly represent these real-world networks while still being able to efficiently query the network (e.g., check if an arc exists between two nodes). Furthermore, since many of these networks continue to grow over time, our compression techniques also support the ability to add and remove nodes and edges directly on the compressed structure. We also provide a way to compress the data quickly without any intermediate structure, thus giving minimal memory overhead. We provide detailed analysis and prove that our compression is indeed succinct (i.e., achieves the information-theoretic lower bound). Also, we empirically show that our compression rates outperform or are equal to existing compression algorithms on many benchmark datasets. We also extend our technique to time-evolving networks. That is, we store the entire state of the network at each time frame. Studying time-evolving networks allows us to find patterns throughout the time that would not be available in regular, static network analysis. A succinct representation for time-evolving networks is arguably more important than static graphs, due to the extra dimension inflating the space requirements of basic data structures even more. Again, we manage to achieve succinctness while also providing fast encoding, minimal memory overhead during encoding, fast queries, and fast, direct modification. We also compare against several benchmarks and empirically show that we achieve compression rates better than or equal to the best performing benchmark for each dataset. Finally, we also develop both static and time-evolving algorithms that run directly on our compressed structures. Using our static graph compression combined with our differential technique, we find that we can speed up matrix-vector multiplication by reusing previously computed products. We compare our results against a similar technique using the Webgraph Framework, and we see that not only are our base query speeds faster, but we also gain a more significant speed-up from reusing products. Then, we use our time-evolving compression to solve the earliest arrival paths problem and time-evolving transitive closure. We found that not only were we the first to run such algorithms directly on compressed data, but that our technique was particularly efficient at doing so

    Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning

    Full text link
    Many problems in sequential decision making and stochastic control often have natural multiscale structure: sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure, particularly beyond a single level of abstraction, has remained a longstanding challenge. We describe a fast multiscale procedure for repeatedly compressing, or homogenizing, Markov decision processes (MDPs), wherein a hierarchy of sub-problems at different scales is automatically determined. Coarsened MDPs are themselves independent, deterministic MDPs, and may be solved using existing algorithms. The multiscale representation delivered by this procedure decouples sub-tasks from each other and can lead to substantial improvements in convergence rates both locally within sub-problems and globally across sub-problems, yielding significant computational savings. A second fundamental aspect of this work is that these multiscale decompositions yield new transfer opportunities across different problems, where solutions of sub-tasks at different levels of the hierarchy may be amenable to transfer to new problems. Localized transfer of policies and potential operators at arbitrary scales is emphasized. Finally, we demonstrate compression and transfer in a collection of illustrative domains, including examples involving discrete and continuous statespaces.Comment: 86 pages, 15 figure

    The Thermodynamics of Network Coding, and an Algorithmic Refinement of the Principle of Maximum Entropy

    Full text link
    The principle of maximum entropy (Maxent) is often used to obtain prior probability distributions as a method to obtain a Gibbs measure under some restriction giving the probability that a system will be in a certain state compared to the rest of the elements in the distribution. Because classical entropy-based Maxent collapses cases confounding all distinct degrees of randomness and pseudo-randomness, here we take into consideration the generative mechanism of the systems considered in the ensemble to separate objects that may comply with the principle under some restriction and whose entropy is maximal but may be generated recursively from those that are actually algorithmically random offering a refinement to classical Maxent. We take advantage of a causal algorithmic calculus to derive a thermodynamic-like result based on how difficult it is to reprogram a computer code. Using the distinction between computable and algorithmic randomness we quantify the cost in information loss associated with reprogramming. To illustrate this we apply the algorithmic refinement to Maxent on graphs and introduce a Maximal Algorithmic Randomness Preferential Attachment (MARPA) Algorithm, a generalisation over previous approaches. We discuss practical implications of evaluation of network randomness. Our analysis provides insight in that the reprogrammability asymmetry appears to originate from a non-monotonic relationship to algorithmic probability. Our analysis motivates further analysis of the origin and consequences of the aforementioned asymmetries, reprogrammability, and computation.Comment: 30 page
    • …
    corecore