Why does my regular expression work in X but not in Y?
8 points by dasm
8 points by dasm
This is where tools like https://regex101.com/ comes handy, it support 8 different regex flavours
I’m now compelled to make a YouTube video called “standards are fake, actually” and point out how most file formats with a standard are more of a guidance, because none of the major players you’ll use ever really implement it faithfully.
So regexes is an example where most languages kind of have the same things happening, but you almost always have to look at your specific language’s documentation to use it on the fringes.
Another would be SQL. There’s a SQL standard, but Postgres, MySQL, MS SQL Server, and Oracle all have extensions outside it or ways where they differ from the standard.
Markdown would be another. It was never formally specified but every Markdown parser will do wildly different things (CommonMark was a way to try to unite them).
Even ISO8601: Python’s datetime
iso8601 method, IIRC, only guarantees that it will be parsed into the same datetime object within the datetime library; I once had some kind of format error when I passed its output into a database that expected valid 8601.
There’s also a funny storyline about Scheme here: the RnRS standard system, even after forking to big and small versions of the language, didn’t prevent each Scheme from making up their own world. I think the Steering Committee had a line in one of their documents like “the only benefit of a standardization process and spec is that things are consistent across implementations; and yet that hasn’t happened here either.”
Not throwing a value judgement on any of these examples, they just come to mind.
most file formats with a standard are more of a guidance, because none of the major players you’ll use ever really implement it faithfully
Sometimes the standard isn’t even correct. When APPNOTE.TXT that defines the ZIP format was updated to add Unicode support, they also mention that if the Unicode flag isn’t set, it should be treated as CP437 for backwards compatibility.
Issue is, it’s only in CP437 on old archives if the archive was from a US machine. Like everything else in the pre-Unicode codepage nightmare, it was in whatever codepage the creator was using, which gets really obvious when you find ZIP files from Russia or Japan.
I would definitely split the standard situation in three:
Of course, you have the whole issue of “implementing it faithfully” on top of this.
Maybe instead of a youtube video, you can just share the relevant xkcd and not have a competing media trying to be the standard rant regarding standards 🤣
Answer is missing an intermediate level: FA-based engines tend to have PCRE features, but not all of them (specifically not lookaheads or lookbehinds).
special characters \n, \t, etc.; word boundaries \b and \B, word constituents \b and \B, …
Nota: the backslashed character classes (\d, \w, \s, …) are Perl / PCRE extensions to EREs.
There’s another subtlety that I learned recently from a paper about JavaScript regexes: Javascript doesn’t follow the traditional unix/perl leftmost-longest rule, so ambiguous ( ) matches can vary between at least Henry Spencer-style and JavaScript and possibly other different regex engines.
There are many differences between the engines, for example:
A small nitpick:
Java requires to escape all slashes;
This isn’t really related to the regexp engine, rather to the handling of string literals in Java. The same thing affects Emacs FWIW.