A Spellchecker Used to Be a Major Feat of Software Engineering (2008)
35 points by runxiyu
35 points by runxiyu
Sure, you could come up with some ways to decrease the load time or reduce the memory footprint, but that’s icing and likely won’t be needed.
That statement was probably more innocuous before the ubiquity of Electron apps. In 2025 I still have a ~5 second loading screen when loading into Discord, and then it uses 1.5 GB of RAM, about 10% of what I have total.
Loading every possible word in the language into a set works if you’re only checking English - it’s a pretty simple language. The largest language I know of is Brazilian Portuguese (pt-BR) and loading that into a hash set is really not feasible.
Could you expand on why it is a big language? I don’t know much about spell checking, nor Portuguese - is it because it is a fusional language and a given word has multiple forms? Because as far as I know English has by far the largest vocabulary (mostly due to simply taking words wholesale from another languages).
I believe there are far more “nefarious” languages among the agglutinatinating category. (And as a data point, I have never seen a Hungarian spell checker that could also match more complicated word-creations).
German grammar is productive with simply concatenating words together, so I have no idea how a German spellchecker would deal with “ Rinderkennzeichnungsund Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz” which is, in German, a perfectly cromulent word.
Hunspell has a “compounding” feature that I believe German takes advantage of. Dictionaries can define patterns for valid ways to combine stems. I haven’t tried that one but I wouldn’t be surprised if it checks correctly!
Could you expand on why it is a big language?
A picture is worth 10^10 words…
https://www.reddit.com/r/polandball/comments/211ogu/conjugation/
But Finnish is agglutinatinating language, just like Hungarian, my question was more about Portuguese (which is not).
Nonetheless, funny image, thanks!
Yes, I know and it’s true. I think as a loose parallel it works though.
I’ve been learning Czech for over a decade and it has so many word forms for just something similar. Dog, 2-4 dogs, 5+ dogs, of the dog, going to the dog, belonging to the dog, from the dog, in the dog, calling a person named Dog, giving a dog, etc. etc.
The issues are far more modest for English spellcheckers. Weird English grammatical complexities, of which there are plenty – like phrasal verbs – don’t affect a spellchecker.
It has a lot of stems that combine with a really large number of prefixes and/or suffixes so the set of possible words is really large. When you can apply a lot of prefixes and suffixes to a lot of stems then the set of possible words grows, conceptually kind of like a cartesian product.
For an example of “stem” / “prefix” / “suffix”: the en-US dictionary has a stem of “write” where you can apply a prefix like “re” and a suffix like “ing”.
I get that - my point is that from what a quick Google search showed, Portuguese is not particularly bad on that count. Agglutinatinating languages on the other hand..
“Development of a spelling list”, Doug McIlroy,
The word list used by the UNIX spelling checker, SPELL, was developed from many sources over several years. As the spelling checker may be used on minicomputers, it is important to make the list as compact as possible. Stripping prefixes and suffixes reduces the list below one third of its original size, hashing discards 60 percent of the bits that remain, and data compression halves it once again. This paper tells how the spelling checker works, how the words were chosen, how the spelling checker was used to improve itself, and how the (reduced) list of 30 000 English words was squeezed into 26 000 16-bit machine words.
Similar issues may still be relevant, for example, when packaging and serving a (specialised) dictionary over the web.