Scaling a Monolith to 1M LOC: 113 Pragmatic Lessons from Tech Lead to CTO

29 points by semicolonandsons


byroot

Kill Worker Processes Regularly as Insurance Against Memory Leaks.

This one IMO is wrong. As in having resilience to leaks is indeed very important, but restarting processes at regular interval only hides the problem until it become so dire that you need to restart even more often.

The better solution is to restart once a fixed memory threshold has been reached, this way you can instrument how often processes need to restart and know that a leak or bloat has been introduced.

Self plug on that exact subject: https://byroot.github.io/ruby/performance/2025/02/09/guard-rails-are-not-code-smells.html

lpil

Lots of good and sensible advice here!

I think the title might be suggest the wrong audience, folks who are expecting to run a 50k loc service for the first time would benefit from reading this. Don't need to be aiming to run such a large application to benefit.

coxley

Some minor, unsolicited nits:

Reasonable advice though. The "100 lessons" format isn't usually as on the nose. Any single lesson here would resonate with a senior engineer, but I'd be challenged to make such a comprehensive list if asked; well done. :)

Random commentary while reading below


Page Counts Are a Major (but Surprising) Source of DB Performance Issues at Scale

Surprised to not see probabilistic data structures like HLL mentioned here — but perhaps that's what is meant by "for very large tables, we use an estimated count paginator".

Delete Useless Data

I wish this was easier, sometimes. I've been places with either contracts that prevented us from deleting expensive data or sales teams wanting us to hold churned client data for longer (at their approval) in hopes of winning them back.

Prevent Long PRs

I'm looking forward to stacked PRs in Github soon. This is a skill that's atrophied for me since leaving BigCo.

Observability CLI Tooling Is Your Number One Force Multiplier for AI

I wish I had leaned into using LLMs for investigations sooner, because my mind has been blown by this very thing. They're surprisingly (or maybe not) good at looking for trends of telemetry data.