Making Postgres distributed with FoundationDB
56 points by fabian
56 points by fabian
This is really exciting to see someone taking on this project. There are very few OSI-approved distributed SQL databases (Vitest, TiDB, Yugabyte). Frankly I have a hard time buying in to Yuagbyte & TiDB because of commercial ownership & potential licensing changes – esp after our company (Apache 2.0 open-core) was rug pulled by CockroachDB, Nomad, and Redis licenses.
I’ve also spent some time developing a few flavors of SQLite VFS on FDB (akin to mvsqlite with different tradeoffs). It’s a fun time.
How are you tackling PG transactions with the 5s & 10 MB limit on FDB transactions?
Glad you like it!
I hear you on the rug pull thing. Neon being acquired by Databricks the other day also raised concerns with many, with Neon playing a pretty visible role in the Postgres ecosystem as of late. One of the reasons I’m so fascinated by FoundationDB is exactly because it’s open source and probably will remain so despite being owned and bankrolled by Apple.
I’ve also spent some time developing a few flavors of SQLite VFS on FDB (akin to mvsqlite with different tradeoffs). It’s a fun time.
Really good fun! Anything you want to share?
How are you tackling PG transactions with the 5s & 10 MB limit on FDB transactions?
Basically not at all, those limits apply to each Postgres transaction in pgfdb. My take at the moment is that pgfdb will be a plainly OLTP-focused project, allowing it to make full use of FoundationDB.
What are the cons of running Postgresql on FoundationDB in prod (were pgfdb not experimental)? Reading around, about the core KV storeI found:
compromises that software transactional memories have, and fdb roughly matches the semantics of those: retry on failure, a maximum transaction size, a maximum transaction time … These constraints very inconvenient for many kinds of applications so, ok, you’d like a wrapper library that handles them gracefully and hides the details (for example count of range).
Promises like lock-freeness, f of f+1 faillure tolerance etc. are amazing and simplify some backend code! But what application types would pgfdb not be for? Is it always better as soon as you need even 2 shards?
Yes, the transaction timeout and size limit might be the most obvious limits of FoundationDB. It would technically be possible to work around them by basically using Postgres’ transaction processing on top of FoundationDB, but you would then likely lose out on a ton of performance and wouldn’t really be using FoundationDB the way it was meant to be used.
But what application types would pgfdb not be for
It would definitely not be for anything that is not squarely OLTP, like a hybrid workload that mixes long running analytical queries with short transactional ones. Postgres works perfectly fine for hybrid and OLAP workloads to some degree, but scaling out that kind of workload can bottleneck the database and impact transaction processing. I recently read this nice article on how OpenAI has been battling exactly that (they seem to be pushing plain Postgres to its limits, really cool to see!): https://www.pixelstech.net/article/1747708863-openai%3A-scaling-postgresql-to-the-next-level
This looks really interesting.
Could you say more about how Postgres concepts like schemas and tables map on to FoundationDB’s key-value store?
I imagine mapping PG transactions onto FoundationDB transactions is where a lot of the heavy lifting took place, but it’s not obvious to me how schemas/tables efficiently map onto the key-value store.
In any case, congrats on the release! This looks like a significant effort.
Sure thing! Schemas I haven’t mapped out yet, only tables and indices where the actual data is stored and queried.
The mapping is actually rather simple at the moment. It looks something like this currently (in an abstract form):
Where row ID is an internal randomly generated ID for each row. Could potentially make that into the primary key in the future for even better performance. Will probably also nest indices under their respective tables by prefix for better data locality.
It uses FoundationDB’s tuple encoding for the mapping so that index reads can be done efficiently. If you have an index over an integer column column for example and do an index search for > 5, pgfdb will perform an index read for any key “larger” than /indices/{index_id}/5
.
Surprisingly maybe the mapping from PG transactions to FoundationDB transaction is the easy part! The harder part is just making the querying patterns fit into the Postgres model and making the best use out of FoundationDB at the same time. Postgres architecture makes some assumptions that makes sense from a “reading pages from disk” model that don’t fit as well when the data is distributed, which takes some working around.
What are the tradeoffs of using FoundationDB as the engine, vs Aurora, Citus or Neon?
I’m a huge fan of FoundationDB and its semantics. But I don’t intimately know the other distributed PG options’ properties.
This is a very enticing combination though.
This is really cool, but I desperately wish the post had actual architectural info! It is full of teasers, talking up how the extension model made this possible and how Postgres is operating statelessly but how? Please share more!
Noted! Would love to write a more extensive deep-dive kind of post on this eventually. I do feel I need to make a bit more progress on the project to get an even better sense for the architectural challenges, to be able to write a post detailed enough to be worth reading!
Not sure if I can offer a valuable yet concise explanation here but on a high-level, pgfdb relies on three extension points in Postgres:
The key thing remaining to making Postgres truly stateless here is syncing DDL to FDB as well, which would mean you could simply spin up an all new Postgres server, point it to your FDB database, and it would be able to serve queries. You can scale out compute by adding new Postgres servers, and scale down to zero as well if you share a single FoundationDB cluster between multiple databases (what some might call a serverless model).
That’s pretty cool!
One of the core issues I encountered with FDB in the past was “optimal cluster size”. From the doc and forum posts it seems like 100 TB is the limit for a single cluster. However, for OLTP use cases, I don’t think a lot of projects will hit such size
Another problem with FDB is how greedy with space it is. The doc mentions that write amplification is 7x, which is considerable. In the end, I was getting something like 1$/GB on AWS for moderately-sized cluster. However, this was with storage volumes instead of disks, which is expensive and unnecessary as FDB already handles fault tolerance.
Looks exciting! Would native extensions compiled for Postgres still work with Postgres on FoundationDB?
It depends a bit! If the extension in any way relies on adding custom index storage or similar, then it naturally wouldn’t work. Take for example pgvector that implements its own index types for approximate nearest neighbor search on vectors.
What would work though is the non-index parts of pgvector, like the custom data types and operators. One thing I’m pretty excited for is taking extensions like pgvector and keeping them as is, but implementing a FDB-backed alternative to their indexing. Postgres actually enables this because it disaggregates data types, operators and index types. That way one could get all the same syntax and ergonomics as a user of the extension, but behind the scenes the implementation would be distributed and FDB-optimized.