Replacing a 3 GB SQLite database with a 10 MB FST (finite state transducer) binary
69 points by jmillikin
69 points by jmillikin
The article is quite interesting, it's lovely to see a well-suited tool applied to a well-suited task achieving an order-of-magnitude improvement! But I was personally more struck by the final footnote than the body content. It states clearly a worry which has also paralyzed me at times, so I'll surface it here in part:
One could say in the first quarter-century of my life, that while I was always fascinated by programming, I could never overcome the guilt of not really knowing whether the tool I am building right now isn’t already superceded by some much better implementation someone else has already written 30 or 40 years ago; I could write a TSV-aware search and replace, or I could find out about
awkand solve that entire class of problems in one fell swoop, for example. My central conceit is that this is a trap. You need to reinvent a couple of wheels to get to the edge of what we know about wheel-making, not a thousand wheels, and not zero; probably four or five is sufficient in most domains, maybe closer to twenty or thirty in the most epistemically rigorous and developed fields like mathematics or computer science. Each wheel you reinvent, and every directed question you ask along the way, will propel you faster to the true frontier than that same amount of time spent in idle study, or even five times that amount.
I think an important related remark in the text is also that the SQLite DB was an inefficient brute force, but having a reference brute force something at all — it helped doing the efficient thing to replace it.
This is great, I didn’t get all the way through the article so thanks for sharing.
It’s a problem I’m dealing with right now. I am paralysed from writing something that’s optimised for my problem space because I know that the general problem has been solved before. But when I look at the solutions they all carry baggage I don’t need. I know from experience that my ideas are probably worth pursuing but keep thinking I’m missing something.
After reading this I’m gonna do it. Doesn’t matter if it doesn’t work, I’m gonna learn something from the experience.
There's a related data structure called a Dawg (Directed Acyclic Word Graph) which I discovered when implementing a Scrabble bot, and apparently is very common for that purpose. I believe the main difference from the fst crate is that it doesn't map each word to a unique integer.
I had a similar experience with size reduction, and also insane performance wins. A Scrabble bot needs to basically filter the entire dictionary by words that match a simple regex like ..cat.. We'll, not one simple regex, but every possible simple regex on the board right now. I went from waiting around for a minute to an unnoticeable delay.
Appel and Jacobson’s 1988 DAWG paper was posted here not long ago and previously
FWIW, I ended up posting that because I was trying to implement a word game and researching the data structures used. I ended up writing a compact format it in C, but got distracted and haven’t implemented the game. Oops.
fst is also what I ended up using for srgn's German support -- after Andrew himself recommended it :-) . It's the same problem space of compressing common prefixes and suffixes. It works remarkably well.
Another requirement I was going for was minimal startup cost (this is a CLI tool), and the fst crate supports zero-copy loading, for zero run-time overhead (FST is built at compile time). Just great stuff.