Distributed replication systems based on the replicated state machine model
have become ubiquitous as the foundation of modern database systems. To ensure
availability in the presence of faults, these systems must be able to
dynamically replace failed nodes with healthy ones via dynamic reconfiguration.
MongoDB is a document oriented database with a distributed replication
mechanism derived from the Raft protocol. In this paper, we present
MongoRaftReconfig, a novel dynamic reconfiguration protocol for the MongoDB
replication system. MongoRaftReconfig utilizes a logless approach to managing
configuration state and decouples the processing of configuration changes from
the main database operation log. The protocol's design was influenced by
engineering constraints faced when attempting to redesign an unsafe, legacy
reconfiguration mechanism that existed previously in MongoDB. We provide a
safety proof of MongoRaftReconfig, along with a formal specification in TLA+.
To our knowledge, this is the first published safety proof and formal
specification of a reconfiguration protocol for a Raft-based system. We also
present results from model checking its safety properties on finite protocol
instances. Finally, we discuss the conceptual novelties of MongoRaftReconfig,
how it can be understood as an optimized and generalized version of the single
server reconfiguration algorithm of Raft, and present an experimental
evaluation of how its optimizations can provide performance benefits for
reconfigurations.Comment: 35 pages, 2 figure