Zorro: zero-cost reactive failure recovery in distributed graph processing

Abstract

Distributed graph processing frameworks have become increasingly popular for processing large graphs. However, existing frameworks either lack the ability to recovery from failures or support proactive recovery methods. Proactive recovery methods like checkpointing incur high overheads during failure-free execution making failure recovery an expensive operation. Our hypothesis is that reactive recovery of failures in graph processing that provides a zero-overhead alternative to expensive proactive failure recovery mechanisms is feasible, novel and useful. We support the hypothesis with Zorro, a recovery protocol that reactively recovers from machine failures. Zorro utilizes vertex replication inherent in existing graph processing frameworks to collectively rebuild the state of failed servers. Surviving servers transfer the states of inherently replicated vertices back to replacement servers, which rebuild their state using the received values. This fast recovery mechanism prioritizes high degree vertices ensuring high accuracy of graph processing applications. We have implemented our approach in two existing distributed graph processing frameworks: LFGraph and PowerGraph. Experiments using graph applications on real-world graphs show that Zorro is able to recover between 87-92% graph state when half the cluster fails and maintains at least 97% accuracy in all experimental failure scenarios

    Similar works