Reliable Actors with Retry Orchestration

Abstract

Enterprise cloud developers have to build applications that are resilient to failures and interruptions. We advocate for, formalize, implement, and evaluate a simple, albeit effective, fault-tolerant programming model for the cloud based on actors, reliable message delivery, and retry orchestration. Our model guarantees that (1) failed actor invocations are retried until success, (2) in a distributed chain of invocations only the last one may be retried, (3) pending synchronous invocations with a failed caller are automatically cancelled. These guarantees make it possible to productively develop fault-tolerant distributed applications ranging from classic problems of concurrency theory to complex enterprise applications. Built as a service mesh, our runtime system can interface application components written in any programming language and scale with the application. We measure overhead relative to reliable message queues. Using an application inspired by a typical enterprise scenario, we assess fault tolerance and the impact of fault recovery on application performance.Comment: 14 pages, 6 figure

    Similar works

    Full text

    thumbnail-image

    Available Versions