Replacing a 3 GB SQLite database with a 10 MB FST (finite state transducer) binary

69 points by jmillikin


isuffix

The article is quite interesting, it's lovely to see a well-suited tool applied to a well-suited task achieving an order-of-magnitude improvement! But I was personally more struck by the final footnote than the body content. It states clearly a worry which has also paralyzed me at times, so I'll surface it here in part:

One could say in the first quarter-century of my life, that while I was always fascinated by programming, I could never overcome the guilt of not really knowing whether the tool I am building right now isn’t already superceded by some much better implementation someone else has already written 30 or 40 years ago; I could write a TSV-aware search and replace, or I could find out about awk and solve that entire class of problems in one fell swoop, for example. My central conceit is that this is a trap. You need to reinvent a couple of wheels to get to the edge of what we know about wheel-making, not a thousand wheels, and not zero; probably four or five is sufficient in most domains, maybe closer to twenty or thirty in the most epistemically rigorous and developed fields like mathematics or computer science. Each wheel you reinvent, and every directed question you ask along the way, will propel you faster to the true frontier than that same amount of time spent in idle study, or even five times that amount.

lemon

Very cool! Reminded me of: Compressing Icelandic name declension patterns into a 3.27 kB trie

yokljo

There's a related data structure called a Dawg (Directed Acyclic Word Graph) which I discovered when implementing a Scrabble bot, and apparently is very common for that purpose. I believe the main difference from the fst crate is that it doesn't map each word to a unique integer.

I had a similar experience with size reduction, and also insane performance wins. A Scrabble bot needs to basically filter the entire dictionary by words that match a simple regex like ..cat.. We'll, not one simple regex, but every possible simple regex on the board right now. I went from waiting around for a minute to an unnoticeable delay.