863 research outputs found

    GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics

    Full text link
    Large scale graph processing is a major research area for Big Data exploration. Vertex centric programming models like Pregel are gaining traction due to their simple abstraction that allows for scalable execution on distributed systems naturally. However, there are limitations to this approach which cause vertex centric algorithms to under-perform due to poor compute to communication overhead ratio and slow convergence of iterative superstep. In this paper we introduce GoFFish a scalable sub-graph centric framework co-designed with a distributed persistent graph storage for large scale graph analytics on commodity clusters. We introduce a sub-graph centric programming abstraction that combines the scalability of a vertex centric approach with the flexibility of shared memory sub-graph computation. We map Connected Components, SSSP and PageRank algorithms to this model to illustrate its flexibility. Further, we empirically analyze GoFFish using several real world graphs and demonstrate its significant performance improvement, orders of magnitude in some cases, compared to Apache Giraph, the leading open source vertex centric implementation.Comment: Under review by a conference, 201

    Gunrock: GPU Graph Analytics

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

    A Partition-centric Distributed Algorithm for Identifying Euler Circuits in Large Graphs

    Full text link
    Finding the Eulerian circuit in graphs is a classic problem, but inadequately explored for parallel computation. With such cycles finding use in neuroscience and Internet of Things for large graphs, designing a distributed algorithm for finding the Euler circuit is important. Existing parallel algorithms are impractical for commodity clusters and Clouds. We propose a novel partition-centric algorithm to find the Euler circuit, over large graphs partitioned across distributed machines and executed iteratively using a Bulk Synchronous Parallel (BSP) model. The algorithm finds partial paths and cycles within each partition, and refines these into longer paths by recursively merging the partitions. We describe the algorithm, analyze its complexity, validate it on Apache Spark for large graphs, and offer experimental results. We also identify memory bottlenecks in the algorithm and propose an enhanced design to address it.Comment: To appear in Proceedings of 5th IEEE International Workshop on High-Performance Big Data, Deep Learning, and Cloud Computing, In conjunction with The 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019), Rio de Janeiro, Brazil, May 20th, 201

    A Survey of Graph Pre-processing Methods: From Algorithmic to Hardware Perspectives

    Full text link
    Graph-related applications have experienced significant growth in academia and industry, driven by the powerful representation capabilities of graph. However, efficiently executing these applications faces various challenges, such as load imbalance, random memory access, etc. To address these challenges, researchers have proposed various acceleration systems, including software frameworks and hardware accelerators, all of which incorporate graph pre-processing (GPP). GPP serves as a preparatory step before the formal execution of applications, involving techniques such as sampling, reorder, etc. However, GPP execution often remains overlooked, as the primary focus is directed towards enhancing graph applications themselves. This oversight is concerning, especially considering the explosive growth of real-world graph data, where GPP becomes essential and even dominates system running overhead. Furthermore, GPP methods exhibit significant variations across devices and applications due to high customization. Unfortunately, no comprehensive work systematically summarizes GPP. To address this gap and foster a better understanding of GPP, we present a comprehensive survey dedicated to this area. We propose a double-level taxonomy of GPP, considering both algorithmic and hardware perspectives. Through listing relavent works, we illustrate our taxonomy and conduct a thorough analysis and summary of diverse GPP techniques. Lastly, we discuss challenges in GPP and potential future directions

    Dynamic re-optimization techniques for stream processing engines and object stores

    Get PDF
    Large scale data storage and processing systems are strongly motivated by the need to store and analyze massive datasets. The complexity of a large class of these systems is rooted in their distributed nature, extreme scale, need for real-time response, and streaming nature. The use of these systems on multi-tenant, cloud environments with potential resource interference necessitates fine-grained monitoring and control. In this dissertation, we present efficient, dynamic techniques for re-optimizing stream-processing systems and transactional object-storage systems.^ In the context of stream-processing systems, we present VAYU, a per-topology controller. VAYU uses novel methods and protocols for dynamic, network-aware tuple-routing in the dataflow. We show that the feedback-driven controller in VAYU helps achieve high pipeline throughput over long execution periods, as it dynamically detects and diagnoses any pipeline-bottlenecks. We present novel heuristics to optimize overlays for group communication operations in the streaming model.^ In the context of object-storage systems, we present M-Lock, a novel lock-localization service for distributed transaction protocols on scale-out object stores to increase transaction throughput. Lock localization refers to dynamic migration and partitioning of locks across nodes in the scale-out store to reduce cross-partition acquisition of locks. The service leverages the observed object-access patterns to achieve lock-clustering and deliver high performance. We also present TransMR, a framework that uses distributed, transactional object stores to orchestrate and execute asynchronous components in amorphous data-parallel applications on scale-out architectures

    Modernizing Parallel Code with Pattern Analysis

    Get PDF

    Time and Memory Efficient Parallel Algorithm for Structural Graph Summaries and two Extensions to Incremental Summarization and kk-Bisimulation for Long kk-Chaining

    Full text link
    We developed a flexible parallel algorithm for graph summarization based on vertex-centric programming and parameterized message passing. The base algorithm supports infinitely many structural graph summary models defined in a formal language. An extension of the parallel base algorithm allows incremental graph summarization. In this paper, we prove that the incremental algorithm is correct and show that updates are performed in time O(Δ⋅dk)\mathcal{O}(\Delta \cdot d^k), where Δ\Delta is the number of additions, deletions, and modifications to the input graph, dd the maximum degree, and kk is the maximum distance in the subgraphs considered. Although the iterative algorithm supports values of k>1k>1, it requires nested data structures for the message passing that are memory-inefficient. Thus, we extended the base summarization algorithm by a hash-based messaging mechanism to support a scalable iterative computation of graph summarizations based on kk-bisimulation for arbitrary kk. We empirically evaluate the performance of our algorithms using benchmark and real-world datasets. The incremental algorithm almost always outperforms the batch computation. We observe in our experiments that the incremental algorithm is faster even in cases when 50%50\% of the graph database changes from one version to the next. The incremental computation requires a three-layered hash index, which has a low memory overhead of only 8%8\% (±1%\pm 1\%). Finally, the incremental summarization algorithm outperforms the batch algorithm even with fewer cores. The iterative parallel kk-bisimulation algorithm computes summaries on graphs with over 1010M edges within seconds. We show that the algorithm processes graphs of 100+ 100+\,M edges within a few minutes while having a moderate memory consumption of <150<150 GB. For the largest BSBM1B dataset with 1 billion edges, it computes k=10k=10 bisimulation in under an hour
    • …
    corecore