Regular Expression Matching Can Be Simple And Fast (2007)
21 points by LesleyLai
21 points by LesleyLai
This is a classic. At some point, I did a lot of work related to regular expressions, and I always kept coming back to Russ Cox’s articles to understand things better
I’m at that point in life, as we all once were or will be, where I have to strongly consider writing yet another RegExp engine. This does not make me particularly happy, as the engine in question would be an ECMAScript RegExp engine taking as input (both pattern and haystack) WTF-8 strings. This brings about some fairly unfortunate challenges, like that the “sloppy/loose mode” of ECMAScript’s RegExp will happily match individual surrogates found in surrogate pairs. In a UTF-16 world that makes a silly sort of sense, in that those individual surrogates are there in the input string, but in a WTF-8 world that sense is lost: a well-formed pair of surrogates becomes a well-formed UTF-8 code point made up of (likely) 4 byes, and there is no way to split that up into two halves that would make a surrogate! (I think anyway…)
Yeah, I’m not looking forward to all that.
I’m at that point in life, as we all once were or will be, where I have to strongly consider writing yet another RegExp engine.
I feel seen.