Making Postgres distributed with FoundationDB

56 points by fabian

nathanflurry

This is really exciting to see someone taking on this project. There are very few OSI-approved distributed SQL databases (Vitest, TiDB, Yugabyte). Frankly I have a hard time buying in to Yuagbyte & TiDB because of commercial ownership & potential licensing changes – esp after our company (Apache 2.0 open-core) was rug pulled by CockroachDB, Nomad, and Redis licenses.

I’ve also spent some time developing a few flavors of SQLite VFS on FDB (akin to mvsqlite with different tradeoffs). It’s a fun time.

How are you tackling PG transactions with the 5s & 10 MB limit on FDB transactions?

fabian

Glad you like it!

I hear you on the rug pull thing. Neon being acquired by Databricks the other day also raised concerns with many, with Neon playing a pretty visible role in the Postgres ecosystem as of late. One of the reasons I’m so fascinated by FoundationDB is exactly because it’s open source and probably will remain so despite being owned and bankrolled by Apple.

I’ve also spent some time developing a few flavors of SQLite VFS on FDB (akin to mvsqlite with different tradeoffs). It’s a fun time.

Really good fun! Anything you want to share?

How are you tackling PG transactions with the 5s & 10 MB limit on FDB transactions?

Basically not at all, those limits apply to each Postgres transaction in pgfdb. My take at the moment is that pgfdb will be a plainly OLTP-focused project, allowing it to make full use of FoundationDB.
- veqq
  
  What are the cons of running Postgresql on FoundationDB in prod (were pgfdb not experimental)? Reading around, about the core KV storeI found:
  
  compromises that software transactional memories have, and fdb roughly matches the semantics of those: retry on failure, a maximum transaction size, a maximum transaction time … These constraints very inconvenient for many kinds of applications so, ok, you’d like a wrapper library that handles them gracefully and hides the details (for example count of range).
  
  Promises like lock-freeness, f of f+1 faillure tolerance etc. are amazing and simplify some backend code! But what application types would pgfdb not be for? Is it always better as soon as you need even 2 shards?
  - fabian
    
    Yes, the transaction timeout and size limit might be the most obvious limits of FoundationDB. It would technically be possible to work around them by basically using Postgres’ transaction processing on top of FoundationDB, but you would then likely lose out on a ton of performance and wouldn’t really be using FoundationDB the way it was meant to be used.
    
    But what application types would pgfdb not be for
    
    It would definitely not be for anything that is not squarely OLTP, like a hybrid workload that mixes long running analytical queries with short transactional ones. Postgres works perfectly fine for hybrid and OLAP workloads to some degree, but scaling out that kind of workload can bottleneck the database and impact transaction processing. I recently read this nice article on how OpenAI has been battling exactly that (they seem to be pushing plain Postgres to its limits, really cool to see!): https://www.pixelstech.net/article/1747708863-openai%3A-scaling-postgresql-to-the-next-level
- adriano
  
  This looks really interesting.
  
  Could you say more about how Postgres concepts like schemas and tables map on to FoundationDB’s key-value store?
  
  I imagine mapping PG transactions onto FoundationDB transactions is where a lot of the heavy lifting took place, but it’s not obvious to me how schemas/tables efficiently map onto the key-value store.
  
  In any case, congrats on the release! This looks like a significant effort.
  - fabian
    
    Sure thing! Schemas I haven’t mapped out yet, only tables and indices where the actual data is stored and queried.
    
    The mapping is actually rather simple at the moment. It looks something like this currently (in an abstract form):
    
    /tables/{table_id}/{row_id} -> encoded row data
    
    /indices/{index_id}/{encoded_column_data}/{row_id} -> empty
    
    Where row ID is an internal randomly generated ID for each row. Could potentially make that into the primary key in the future for even better performance. Will probably also nest indices under their respective tables by prefix for better data locality.
    
    It uses FoundationDB’s tuple encoding for the mapping so that index reads can be done efficiently. If you have an index over an integer column column for example and do an index search for > 5, pgfdb will perform an index read for any key “larger” than /indices/{index_id}/5.
    
    Surprisingly maybe the mapping from PG transactions to FoundationDB transaction is the easy part! The harder part is just making the querying patterns fit into the Postgres model and making the best use out of FoundationDB at the same time. Postgres architecture makes some assumptions that makes sense from a “reading pages from disk” model that don’t fit as well when the data is distributed, which takes some working around.
- amw-zero
  
  What are the tradeoffs of using FoundationDB as the engine, vs Aurora, Citus or Neon?
  
  I’m a huge fan of FoundationDB and its semantics. But I don’t intimately know the other distributed PG options’ properties.
  
  This is a very enticing combination though.
- mqudsi
  
  This is really cool, but I desperately wish the post had actual architectural info! It is full of teasers, talking up how the extension model made this possible and how Postgres is operating statelessly but how? Please share more!
  - fabian
    
    Noted! Would love to write a more extensive deep-dive kind of post on this eventually. I do feel I need to make a bit more progress on the project to get an even better sense for the architectural challenges, to be able to write a post detailed enough to be worth reading!
    
    Not sure if I can offer a valuable yet concise explanation here but on a high-level, pgfdb relies on three extension points in Postgres:
    
    Custom table access method. pgfdb implements its own that does away with all writing to disk of table data and also WAL handling. Instead it encodes and writes new data to FDB in the format explained in my other comment. pgfdb identifies each table row with a randomly generated ID, this takes the place internally of what Postgres calls a TID that usually identifies the page and offset into that page where a row is located. Table scans are implemented simply by doing a range scan over the table’s FDB subspace and decoding each value.
    
    Custom index access method. pgfdb implements its own here as well and uses the FDB key format I also described in another comment. When doing an index scan, the index access method translates the query into an FDB range scan that only scans the parts of the index that fit the query filters. The index scan results in row IDs that are returned to Postgres as TIDs, which Postgres the calls the table access method to get the actual row data for. There is also an optimization here that issues the table lookups for an index scan ahead of time, to avoid sequentially waiting for individual reads to FDB when doing index scans.
    
    Transaction hooks. pgfdb hooks into when a Postgres transaction is committed or aborted and performs the equivalent operations on the FDB transaction. Each Postgres transaction maps 1:1 with a FDB transaction.
    
    The key thing remaining to making Postgres truly stateless here is syncing DDL to FDB as well, which would mean you could simply spin up an all new Postgres server, point it to your FDB database, and it would be able to serve queries. You can scale out compute by adding new Postgres servers, and scale down to zero as well if you share a single FoundationDB cluster between multiple databases (what some might call a serverless model).
- laplab
  
  That’s pretty cool!
  
  One of the core issues I encountered with FDB in the past was “optimal cluster size”. From the doc and forum posts it seems like 100 TB is the limit for a single cluster. However, for OLTP use cases, I don’t think a lot of projects will hit such size
  
  Another problem with FDB is how greedy with space it is. The doc mentions that write amplification is 7x, which is considerable. In the end, I was getting something like 1$/GB on AWS for moderately-sized cluster. However, this was with storage volumes instead of disks, which is expensive and unnecessary as FDB already handles fault tolerance.
- rjzak
  
  Looks exciting! Would native extensions compiled for Postgres still work with Postgres on FoundationDB?
  - fabian
    
    It depends a bit! If the extension in any way relies on adding custom index storage or similar, then it naturally wouldn’t work. Take for example pgvector that implements its own index types for approximate nearest neighbor search on vectors.
    
    What would work though is the non-index parts of pgvector, like the custom data types and operators. One thing I’m pretty excited for is taking extensions like pgvector and keeping them as is, but implementing a FDB-backed alternative to their indexing. Postgres actually enables this because it disaggregates data types, operators and index types. That way one could get all the same syntax and ergonomics as a user of the extension, but behind the scenes the implementation would be distributed and FDB-optimized.
- ClashTheBunny
  
  Why AGPL?