Enough With All The Raft
50 points by 355E3B
50 points by 355E3B
It is weird that people are not taught to fear distributed systems and avoid them until no other choice remains. Instead it is fashionable to design for google-scale without any idea how much brain-power google retains to debug google-scale distributed systems.
It is also amusing that even google-scale system that powers most of google compute is not as complex as k8s.
Yeah I think many people don't realize that both the Borg Master and GFS Master ran on a single machine in memory for 10+ years [1]
e.g. Borg was rolled out 2004 or so, and by 2014 I recall that Borg was still more or less the original architecture (unlike GFS, there wasn't a paper written about Borg until much later, and the paper wasn't written by the original authors of the software)
Paxos was used to elect the masters; it wasn't used to store the state of the cluster (which I think Kubernetes does with etcd ?). The state of the cluster is in memory
That is, Google became the most profitable business of all time [2], without very much Paxos (and no Raft). The core ads database was in MySQL clusters forever, which was outside of GFS or Borg IIRC. YouTube also used MySQL forever.
It's true that hosting in multiple geographically-dispersed data centers was a huge problem 10-20 years ago, but Kubernetes doesn't seem to solve that either.
And apparently Amazon US-East is still a single point of failure for half of the Internet. So the problem still seems unsolved for 99% of apps (and maybe it doesn't need to be solved)
So:
Google clusters started in the era of 2 core CPUs IIRC! Now we have 128+ cores. It is definitely true you can speed up many workloads by 64x with such hardware! The size of a small cluster.
By definition, the workloads that run well on a cluster will also run well on a single machine! Jeff Dean even said that a modern multi-core server is "just" a bunch of CPUs with really fast connections ... i.e. by distributing, you've already gotten away from the shared memory paradigm
[1] And I'm pretty sure Colossus masters; Colossus being the successor to GFS. Although I think metadata was in Bigtable. This was a long time ago so I could be wrong).
I was curious if my memory about Borg was right, but, as mentioned, it was only described 10 years after its initial deployment:
In 2015, Borg did use Paxos, as described in the paper. But that was a recent development. The initial implementation was probably similar to GFS (but there's probably no way to know exactly how it worked other than asking people who worked on it)
Having a single master vastly simplifies our design
All metadata is kept in the master’s memory. The first two types (namespaces and file-to-chunkmapping) are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines. Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash.
i.e. this was simple C++ code, that uses the memory of the machine well, not fancy distributed consensus algorithms. Paxos/Chubby were only used for master election
It is weird that people are not taught to fear distributed systems and avoid them until no other choice remains.
What qualifies as a "distributed system"?
If I have a web app with, say, a PostgreSQL DB server and a Redis cache, does that count? Because even in that simple case I have to think about how to keep the DB and cache agreeing with each other on the current state.
Heck, just with the DB and no cache I have to worry about multiple users interacting and modifying data simultaneously and figure out what they each should see as the current state.
So I'd flip your comment around: the kinds of problems that you're worried about exist even in very simple systems that people don't think of as "distributed", and learning about them and learning strategies for them early on helps you avoid the terror of a sudden massive learning curve later on.
Lots of people work on these kind of web applications without any real knowledge of distributed systems or knowledge of the fact that they are working on distributed systems even. I think that's one of the reasons why the term "feral concurrency" exists.
Paxos and Raft are algorithms meant for consistent distributed state, which is both a hard computer science problem and a hard engineering problem. I believe the first ever deployment of such algorithms was at Google in the early 2000's [1]
Postgres and Redis aren't concerned with consistent distributed state. They do have replication, but they have a master/leader and follower strategy
In systems that use Paxos and Raft, any node can go down, and the behavior of the state machine is preserved. This is not true for Postgres or Redis -- a single machine can be a single point of failure
So I would say there are two kinds of distributed systems:
But a main point of this thread is that you don't need consistent distributed state for very many things in a practice. It is rightly a niche topic
For most distributed systems, you can and should use simpler mechanisms
Notably, Spanner is a distributed database (aiming to avoid single points of failure) but it didn't rely on Paxos/Raft, because those algorithms are expensive. Instead they used special time hardware to balance consistent state and performance
Related: https://en.wikipedia.org/wiki/Paxos_(computer_science)#Production_use_of_Paxos
[1] Hm actually Gemini claimed there was a Petal File system at DEC that implemented Paxos first. And then came Chubby by Mike Burrows, who worked at DEC before working at Google
Spanner is a distributed database (…) but it didn't rely on Paxos/Raft,
https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
second sentence:
At the highest level of abstraction, it is a database that shards data across many sets of Paxos state machines in data-centers spread all over the world.
OK I should have said it relies on Paxos + atomic clocks / GPS. From section 3:
The underlying time references used by TrueTime are GPS and atomic clocks.
So that is not "just Paxos" -- i.e. if someone wants to "just use Paxos/Raft for everything", that is probably not a good design.
If you want to do a Google-like design, you may also need special hardware in your data center. (Although I'm not sure what Cockroach DB does)
The distributed consensus algorithms are useful in limited circumstances, which is what the original post is saying
Postgres and Redis aren't concerned with consistent distributed state.
An application which has both a Postgres DB and a Redis cache very much is concerned with consistent distributed state, which is the claim I was making. Similarly, a multi-user app in which two or more users might modify the same data at the same time in Postgres (or any other DB) is concerned with consistent distributed state (on remote clients). What do you think all those complicated transaction isolation levels and locking primitives in your DB engine are for?
Even vey simple web applications like these display properties of distributed systems, which is entirely the point I'm getting at. Rather than telling people that a "distributed system" is a strange and exotic thing they'll likely never encounter and should be terrified of, we should teach them that most of the things we build are, in at least some sense, distributed systems.
Or more succinctly: "distributed systems" is a term which does not only consist of "things which have leader elections".
You asked:
What qualifies as a "distributed system"?
Postgres and and Redis are often part of distributed systems, but they don't solve the problem of consistent distributed state.
Specifically, you can't set up 5 postgres servers, and have any 2 nodes go down, without incurring downtime of the system itself.
That's what Paxos and Raft solve
But like I said, you don't need this for most distributed systems. You can use Postgres or Redis.
Repeating myself:
Or more succinctly: "distributed systems" is a term which does not only consist of "things which have leader elections".
If you thought I disagreed with that, then you misread my comments
I said
there are two kinds of distributed systems
Single source of truth is something you do not get a distributed system. You can also manage to lose that in a simpler system. Good point.
In this case concurrent access is another problem to try to avoid if possible. Note we can make your simple postgres example into a distributed system with replication...and make things even more complex
It is also amusing that even google-scale system that powers most of google compute is not as complex as k8s.
While I can’t speak to Google, this is definitely true at other big tech (where I previously worked, not speaking about or for my current employer).
Internally at these companies, nobody would touch k8s. It’s just awful in every way. Not even remotely a consideration.
In AWS terms, you can build just about anything you want with some combination of ECS Fargate, DDB, S3, SQS/SNS, and then some networking basics like Route53, ALBs, NLBs. A API Gateway if you want to pay for it. Maybe some Redshift if you need OLAP. That’s it.
It is also amusing that even google-scale system that powers most of google compute is not as complex as k8s.
Interesting. Is there someplace to read more about this?
This doesn't directly answer the question, but it's a detailed post by an engineer who worked on both Borg and Kubernetes:
The Technical History of Kubernetes
https://itnext.io/the-technical-history-of-kubernetes-2fe1988b522a
A highly read-worthy write-up for sure; but I also feel it is somewhat answering its own question already:
There is a correct counter-argument here, and it’s that you cannot solve consensus with two failures using three nodes. So when raft is electing a new leader or changing its replicas, it can do that itself. Reconfiguration-based replication needs some external consensus service to lean on.
Not requiring an external consensus service is a very attractive property of Raft.
Am I correct in saying this is the difference between requiring a consensus algorithm and a replication algorithm?
I think Raft, Paxos and friends are often introduced and discussed as consensus algorithms but they can be adapted for replication as well. The post is comparing Raft against others as a replication algorithm.
I think this is a subtlety I didn’t appreciate until recently and something I wish Alex’s post explained a bit more. As someone below mentions [1] you can combine a replication algorithm that isn’t a consensus algorithm (like chain replication) with a consensus algorithm that requires more coordination. Then you can end up with a system that meets your design requirements and is performant.
Consensus is super expensive in a distributed system. Having a mechanism to achieve initial consensus is important (if the distributed system is in any way dynamic in terms of servers/membership), but once initial consensus is achieved, everything should run based on finite state automata, so that communication is not necessary in most cases to make decisions. The only exception to that rule is dealing with multiple simultaneous node failure (or a few variants of failure that appear to be multiple simultaneous node failure), in which case the surviving nodes should re-achieve consensus on the shared state that they use for finite state automata.
I was part of the team that released the first (AFAIK) dynamically-sharded scale-out HA-replicated clustered data service (24 years ago next month), and we had a saying about consensus: "If you need to go to committee, you've already lost." But we ended up being unable to avoid consensus for reliably diagnosing and recovering from split brain and other forms of simultaneous multi-node failure. I'm not saying that it's impossible, but we couldn't find a better solution.
Consensus is super expensive in a distributed system
we had a saying about consensus: "If you need to go to committee, you've already lost."
Yeah I meant the distinction between consensus and replication seems important and they are sometimes conflated. I hypothesize that conflation leads to the overuse of Raft for replication the author discusses.
I’m curious if others agree?
These are just my musings though! Could be wrong
I agree! To go a step further, I think separating replication from consensus gives you similar benefits as a control/data plane split, in that it pushes you to implement system(s) with greater focus/separation of responsibilities.
Eh. I don’t disagree with the talk’s arguments so much as I don’t think they’re a big deal for Raft’s frequent use cases. I rarely see it used for a primary data store (which I agree is not a great fit), but instead acting as an “external consensus service” for leader election and configuration data. In those cases I don’t think storage efficiency or tail latency are as important as being robust, strongly-consistent, and able to recover automatically from failures.
Compared to other consensus algorithms, Raft’s advantages are not so much theoretical, but are mostly in being a strongly-consistent consensus algorithm which is easy to understand (a goal it was literally developed for) and which has existing high-quality implementations.
Indeed, Raft’s popularity may be an accident of history in that it was used by the CoreOS etcd project which was later adopted by Kubernetes. IIRC (I can’t find the reference), etcd adopted Raft instead of Multi-Paxis specifically because it was easy to understand and implement. Which seems like a strong enough reason for its use, theoretical arguments aside.
etcd was in turn popular because it offered an easy HTTP API and was simple to operate… unlike Zookeeper which was the most popular strongly-consistent key-value store at the time, and which was a pain in the butt to deal with in almost any capacity.
(I may be biased, I used to get paged semi-regularly due to Zookeeper shenanigans and I’m still a bit salty about it.)
I haven't seen it said, so I will mention this talk which points out that a huge point of Raft is that it's (in theory anyways) much easier to implement than other options. Built for understandability, as they say.
One might say "well that's fine for toy projects but what about anything more serious" but well... there's a certain value in being able to hold some fundamental properties of your system in your head and to be convinced of their safety.
I love this blog post/talk and have been sending it to anyone who’d be tangentially interested in this subject. I don’t know if Alex is on lobsters, but I’m immensely thankful for him doing a literature review of this space.
Speaking generally, I think replication-based systems are severely underrated: FoundationDB and AWS’ internal journal database are some of the largest deployments of “fancy” replication-based systems that I know (e.g., they’re using something like chain replication or CRAQ, but they have their own Paxos implementation to handle consensus). I’ve been turning this idea over in my head for a bit, but if you were developing a replication-based database in late 2025, external consensus is a lot easier to handle nowadays—you’d just (famous last words…) take a dependency on something like etcd or object storage, take a sans-io approach to networking, and test with something like Turmoil, you’ve got the building blocks for something that doesn’t suck. Plus, if your consensus system is based on object storage, you’ve got shockingly good portability across cloud providers and pretty darn good economics at the cost of slightly slower reconfigurations.
(I do not have the time to implement this myself, but I do have the time to write this comment on the off-chance I can nerdsnipe someone into writing this!)
amazing article, raft is often a go to choice because of it's popularity and I wish I've seen it less in the wild
Sure, lets go back to out of process zookeeper, because that never caused any kafka downtime in enterprises.
This presentation was far from simple advice. Until there is clear evidence-backed prescription for use-this for this case, use that for that case, raft is an excellent default choice.
Something that's not mentioned in this article is that Raft is by default not live in the face of network faults ( https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/ ), and this has caused a Cloudflare outage before.
There are some fixes mentioned in the article, but it's worth considering what type of network and non-crash faults you want to be able to handle, and if you might want a Byzantine fault tolerant (BFT) algorithm instead (the blockchain hype era spawned some very scalable BFT algorithms), or maybe cross-fault tolerant (XFT), though I've only seen it in this one paper: https://www.usenix.org/system/files/conference/osdi16/osdi16-liu.pdf .
edit: I'll throw in a link to my favorite BFT algorithm/paper, which I think is quite understandable, though I often think in Merkle-DAGs so maybe it just fits my existing mental models: https://arxiv.org/pdf/2102.08325 )