1,914 research outputs found
Parallel Processing of Large Graphs
More and more large data collections are gathered worldwide in various IT
systems. Many of them possess the networked nature and need to be processed and
analysed as graph structures. Due to their size they require very often usage
of parallel paradigm for efficient computation. Three parallel techniques have
been compared in the paper: MapReduce, its map-side join extension and Bulk
Synchronous Parallel (BSP). They are implemented for two different graph
problems: calculation of single source shortest paths (SSSP) and collective
classification of graph nodes by means of relational influence propagation
(RIP). The methods and algorithms are applied to several network datasets
differing in size and structural profile, originating from three domains:
telecommunication, multimedia and microblog. The results revealed that
iterative graph processing with the BSP implementation always and
significantly, even up to 10 times outperforms MapReduce, especially for
algorithms with many iterations and sparse communication. Also MapReduce
extension based on map-side join usually noticeably presents better efficiency,
although not as much as BSP. Nevertheless, MapReduce still remains the good
alternative for enormous networks, whose data structures do not fit in local
memories.Comment: Preprint submitted to Future Generation Computer System
Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation
Myriad of graph-based algorithms in machine learning and data mining require
parsing relational data iteratively. These algorithms are implemented in a
large-scale distributed environment in order to scale to massive data sets. To
accelerate these large-scale graph-based iterative computations, we propose
delta-based accumulative iterative computation (DAIC). Different from
traditional iterative computations, which iteratively update the result based
on the result from the previous iteration, DAIC updates the result by
accumulating the "changes" between iterations. By DAIC, we can process only the
"changes" to avoid the negligible updates. Furthermore, we can perform DAIC
asynchronously to bypass the high-cost synchronous barriers in heterogeneous
distributed environments. Based on the DAIC model, we design and implement an
asynchronous graph processing framework, Maiter. We evaluate Maiter on local
cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves
as much as 60x speedup over Hadoop and outperforms other state-of-the-art
frameworks.Comment: ScienceCloud 2012, TKDE 201
Astronomy in the Cloud: Using MapReduce for Image Coaddition
In the coming decade, astronomical surveys of the sky will generate tens of
terabytes of images and detect hundreds of millions of sources every night. The
study of these sources will involve computation challenges such as anomaly
detection and classification, and moving object tracking. Since such studies
benefit from the highest quality data, methods such as image coaddition
(stacking) will be a critical preprocessing step prior to scientific
investigation. With a requirement that these images be analyzed on a nightly
basis to identify moving sources or transient objects, these data streams
present many computational challenges. Given the quantity of data involved, the
computational load of these problems can only be addressed by distributing the
workload over a large number of nodes. However, the high data throughput
demanded by these applications may present scalability challenges for certain
storage architectures. One scalable data-processing method that has emerged in
recent years is MapReduce, and in this paper we focus on its popular
open-source implementation called Hadoop. In the Hadoop framework, the data is
partitioned among storage attached directly to worker nodes, and the processing
workload is scheduled in parallel on the nodes that contain the required input
data. A further motivation for using Hadoop is that it allows us to exploit
cloud computing resources, e.g., Amazon's EC2. We report on our experience
implementing a scalable image-processing pipeline for the SDSS imaging database
using Hadoop. This multi-terabyte imaging dataset provides a good testbed for
algorithm development since its scope and structure approximate future surveys.
First, we describe MapReduce and how we adapted image coaddition to the
MapReduce framework. Then we describe a number of optimizations to our basic
approach and report experimental results comparing their performance.Comment: 31 pages, 11 figures, 2 table
GiViP: A Visual Profiler for Distributed Graph Processing Systems
Analyzing large-scale graphs provides valuable insights in different
application scenarios. While many graph processing systems working on top of
distributed infrastructures have been proposed to deal with big graphs, the
tasks of profiling and debugging their massive computations remain time
consuming and error-prone. This paper presents GiViP, a visual profiler for
distributed graph processing systems based on a Pregel-like computation model.
GiViP captures the huge amount of messages exchanged throughout a computation
and provides an interactive user interface for the visual analysis of the
collected data. We show how to take advantage of GiViP to detect anomalies
related to the computation and to the infrastructure, such as slow computing
units and anomalous message patterns.Comment: Appears in the Proceedings of the 25th International Symposium on
Graph Drawing and Network Visualization (GD 2017
- …