Raft
Overview
Raft is a consensus algorithm, which is a distributed state machine that provides consistency in operations across multiple nodes. Consensus algorithms provide fault tolerance by guaranteeing progress and consistency in the face of server failures so long as some condition is upheld. In the case of raft, this condition is that the majority of servers are online. E.g. so long as 3 of our 5 servers that comprise the Raft stay online, we can still progress.
Each Raft server has 3 states:
- Follower: accept commands from leader.
- Candidate: run for election if no valid leader (from server POV).
- Leader: relay commands from user to rest of servers.
Raft primaryily uses the AppendEntries() and RequestVote() Remote
Procedural Calls (RPCs - a fancy word for a function call that runs on
another server). Because of the intricate and interwoven logic of
state, most raft implementations use sempahores or locking. However,
this implementation is fully asynchronous and requires no
locks.
This version of raft uses ROS 2 and the same functional methodology as RT-DDS (will link paper when published). See the image below for how the servers are composed in a pub-sub manner so that much of the individual logic of e.g. responding to an RPC is handled at the unary unmarshall/marshall layer.
Structure
The function space is partitioned by arity into two groups:
- unary: takes one input
- n-ary: takes more than one input
Each function may have state associated with it, which is encapsulated in the class for simplicity and to code in a Object-Oriented manner since this is a Python implementation.
All data, which flows through the pub/sub graph, and state lives in separate files from our functions. This significantly decreases lines of code and allows for a more scalable design.
.
├── communication/ # handle transfer or data
├── data/
│ ├── append_entries_req.py
│ ├── append_entries_resp.py
│ ├── client_cmd_req.py
│ ├── client_cmd_resp.py
│ ├── entry.py
│ ├── vote_req.py
│ └── vote_resp.py
├── functions/
│ ├── n_ary/ # i.e. if forall X, then do Y
│ │ ├── _candidate.py
│ │ ├── _follower.py
│ │ ├── _leader.py
│ │ ├── client.py
│ │ └── server.py
│ └── unary/ # for handling messages; note the alphabetical name-grouping of RPC steps
│ ├── append_entries_req.py
│ ├── append_entries_req_resp.py
│ ├── append_entries_resp_recv.py
│ ├── client_cmd_req.py
│ ├── client_cmd_req_recv.py
│ ├── client_cmd_resp.py
│ ├── client_cmd_resp_recv.py
│ ├── recv.py
│ ├── send.py
│ ├── vote_req.py
│ ├── vote_req_resp.py
│ └── vote_resp_recv.py
└── state/ # server state
├── persistent.py
├── raft_server_state.py
├── state_machine.py
├── timer.py
└── volatile.py
Pseudocode
Given the clear decomposition of functions (unary and n-ary), state, data, and communication logic, each raft server operates with the (pseudo-code) following logic:
# Stage 1: unary unmarshall inputs for _input in inputs: common_unmarshall(_input) # Check term, update self.term if < if _input is response: resp_recv(response, state) # Stage 2: n-ary n_ary = [follower, candidate, leader] before = state.status n_ary[before](state) after = state.status converted = (before != after) if converted: n_ary[after](state) # Stage 3: unary marshall subscribers (request or execute respond from their input) for subscriber, corresponding_input in zip(subscribers, inputs) # Ok to ignore request if e.g. new leader sending append entries to old leader if corresponding_input is request and should_respond(state, corresponing_input): req_resp(request, state) else: if timeout_allows: # i.e. election_timeout, heartbeat_timeout req_send(subscriber, state)
Tests
This repo also has a CLI parser which takes in arguments or a
config.json which can configure:
num_servers: intscheduled_downtime: list[(server_index, start_ms, end_ms)]scheduled_config_change: list[(change, time)]scheduled_partition: list[(Set_A, set_B, ..., start_ms, end_ms)]writes: list[(int, time_ms)]for when the client writes a value change to the raft