Scaling Rails: 41M Req/Hour, 8 DBs, disable_joins: true
17 points by andatki
17 points by andatki
Great post Andy!
I'd love to hear too about any issues you might have run into where transactions involved tables that ended up on opposite sides of the boundary. Did you have to make tradeoffs around transactional correctness?
Thank you! Ah, I meant to add more about trade-offs to the post. I’ll add a section to the post but will also reply here.
Briefly, one possible blocker could have been database foreign key constraints that would no longer work for tables in separate databases. There were no FK constraints in use though so that was not an actual blocker (leaving aside whether absence of FK constraints is good or not).
For transactions, I don’t recall any explicit transactions in use that grouped commits for multiple tables that we had to break up, where it was a big deal. There might have been one or two we had to stop using. The general approach is mainly the "Rails Way" of Active Record callbacks and lots of background jobs, and lots of “find or create” types of patterns that retry/re-create missing dependent data. The schema design plays a role here too, where the tables might be relatively wide for the tastes of some, but kind of keep a bunch of things grouped together. The tables moved were mostly driven by insert/update rate and their size, but to a lesser extent how coupled they were.
Hang on. What was the actual constraint here? CPU or IO?
And what IO did you have configured on RDS?
Hi there. You may want to check out the Part 1 post which is more Postgres-focused. The main business goal was reliable Postgres throughout the holiday peak season, where we get a 4-5x increase in load (some work as high as 15x) on Christmas Day. The DB work includes user-facing query performance via the API, and background Postgres work like vacuuming dead tuples etc.
With high insert and update rates from new customers uploading millions of photos and videos, setting up hundreds of thousands of new frames, write pressure/IOPS I’d say is the most scarce resource. There was also some CPU spike issues we saw on a single instance, but IO was more of a limiting factor.
Primarily the additional instances brought in much more IOPS resources. That also brought more CPU and Memory (buffer cache, which helped with the client workload and the background work. The instances were over-provisioned but the business wanted to invest that way if it meant improved reliability. Instances had the io2 storage type, upsized provisioned IOPS, dedicated log volumes, and small schema changes were made like minimizing indexes, constraints, and rebuilding indexes, all related to reducing IO.
When load returned to normal levels, we scaled instance resources down to save costs. This included moving to smaller instances, reducing PIOPS, switching from io2 to gp3, and reducing gp3 throughput.
Ah I see. Thanks.
You do realise that even a consumer NVMe can do 4x+ the IOPS as the top tier IO you provisioned? Plus the latency is 2-3 orders of magnitude less (which is a huge boon for this kind of IO). Enterprise NVMe on PCIe5 with RAID0 can go dramatically faster than that.
I'm 90% sure had you moved this to bare metal you could have continued on the old database structure for 10x the load, at probabaly 10% of the cost. And you wouldn't have had the outage not the multiple months of engineering work.
Maybe good reasons not to, but something to keep in mind for next time. I've seen so many super complicated database migrations like you wrote here which would be solved with proper IO. I suspect if you moved it over to bare metal you'd see enormous latency decreases fwiw even not under Xmas day load.
Hello. I can share that with the team, however I think they’re happy with the trade-offs of RDS which uses EBS storage and it being "managed." A trade-off being less storage performance compared with locally attached drives. io2 storage currently supports a max of 256K IOPS and we haven’t needed quite that much yet, in part from gaining the IOPS resources from multiple instances. I think the things we’d give up and would then need to figure for a team of ~5 infrastructure engineers is possible, but would then take away time from other investment areas.
We use Multi-AZ now where the replica in the second AZ is available for failover if needed. We'd need to figure out how to do something similar. We use snapshotting and can share snapshots among AWS accounts.
Elasticity: we can provision more or less IOPS when needed or not needed, or change to lower cost storage (e.g. from io2 to gp3) when that lower performance level is sufficient.
Memory cache hit ratio is really high, in part due to using memory optimized instance classes with a lot of memory available (e.g. up to 1.5 TB). This way the application query work load doesn't need much disk IOPS, as content is accessed mostly from memory. This was another advantage of more instances, we then had more memory available for more Postgres buffer cache.
Just to be clear, it's not just IOPS that's faster, its IO latency. NVMe direct attached latency is maybe 10-20microseconds, on even your super expensive IO on RDS it is more like 500-1000microseconds. This really matters for your workload, because you are write process contended and it means (probably) every time a fsync is happening your entire server is getting blocked on that write getting onto disk. If that write happened 10x faster, a lot of other stuff would be a lot faster.
It's not hard to built a multi AZ bare metal system, take a look at https://severalnines.com/clustercontrol/. They will also provide RDS like support for your bare metal on the enterprise packages.
And honestly you are paying $25k/month for RDS provisioned (but slow!) IOPS unless you have some custom deal. For that kind of rate you can massively overprovision stuff.
Just a thought, I know how hard AWS gets entrenched in organisations but there are very good alternatives that get round a lot of the really slow hardware AWS has. I think I'd recommend you do load testing on even a consumer PC with NVMe for your prod database, you'll be astonished how quick it is compared to RDS :).
Yes I understand you’re talking about reduced latency, e.g. 10-20 microseconds, e.g. 10x reduced from EBS/io2/AWS and not about IOPS capability. io2 block express advertises an average of sub 500 microseconds latency, 4,000 MB/s throughput, a max of 256K IOPS, for 16KiB I/O size with SSDs but I think 500 microseconds to 1 millisecond is probably what we see. To get that 10-20 microseconds latency though even if this whole project was avoided, I’d argue we would have needed an equivalent amount of time to do a high risk significant migration for Postgres. To stay within AWS I believe we could migrate to Metal instances to get locally attached storage, but then we’d lose hosted Postgres via RDS, and need to self-host, which means all this additional software and operations. We likely would need another staff member who is focused on storage and Postgres (a DBA). To get the RDS-like hosting experience to move outside of AWS is a big lift in a company that's entrenched in AWS. There are definitely existing instance reservations for discounts etc. that complicate that. Then, what about the possibility of continued annual business growth, and still needing to shard in some form down the road?
That all said, I appreciate you shining the light on the performance numbers for locally attached SSDs, and how that may benefit the platform. It definitely touches on lots of ROI questions on hosted Postgres, hosted instances, storage capabilities, IO and CPU bound work and I will share this thread with the engineering leadership. Thank you.
Are the images stored in Postgres?
Nope, metadata is but S3 holds the files. Here’s a related post from the AWS Storage Blog: https://aws.amazon.com/blogs/storage/how-aura-improves-database-performance-using-amazon-s3-express-one-zone-for-caching/
It is an impressive amount of business to need such scale in Postgres for purely metadata. Congrats!
Thank you! When I said “metadata” I may have been underselling the value prop. 😅 Frames are typically used by a handful of people acting as a small, private social network, contributing photos, videos, interactions, and comments on frames and via their apps. The frames generate personalized feeds and the backend needs to perform that generation for millions of connected frames all year, with an especially big load of new frames and uploads on Christmas Day. The company also continues to expand to new countries each year, selling frames online and in retail locations in those countries. That all means we're planning on even more database load in the future!