Why does my regular expression work in X but not in Y?

8 points by dasm

number5

This is where tools like https://regex101.com/ comes handy, it support 8 different regex flavours

srpablo

I’m now compelled to make a YouTube video called “standards are fake, actually” and point out how most file formats with a standard are more of a guidance, because none of the major players you’ll use ever really implement it faithfully.

So regexes is an example where most languages kind of have the same things happening, but you almost always have to look at your specific language’s documentation to use it on the fringes.

Another would be SQL. There’s a SQL standard, but Postgres, MySQL, MS SQL Server, and Oracle all have extensions outside it or ways where they differ from the standard.

Markdown would be another. It was never formally specified but every Markdown parser will do wildly different things (CommonMark was a way to try to unite them).

Even ISO8601: Python’s datetime iso8601 method, IIRC, only guarantees that it will be parsed into the same datetime object within the datetime library; I once had some kind of format error when I passed its output into a database that expected valid 8601.

There’s also a funny storyline about Scheme here: the RnRS standard system, even after forking to big and small versions of the language, didn’t prevent each Scheme from making up their own world. I think the Steering Committee had a line in one of their documents like “the only benefit of a standardization process and spec is that things are consistent across implementations; and yet that hasn’t happened here either.”

Not throwing a value judgement on any of these examples, they just come to mind.

DustyFuzzy
most file formats with a standard are more of a guidance, because none of the major players you’ll use ever really implement it faithfully

Sometimes the standard isn’t even correct. When APPNOTE.TXT that defines the ZIP format was updated to add Unicode support, they also mention that if the Unicode flag isn’t set, it should be treated as CP437 for backwards compatibility.

Issue is, it’s only in CP437 on old archives if the archive was from a US machine. Like everything else in the pre-Unicode codepage nightmare, it was in whatever codepage the creator was using, which gets really obvious when you find ZIP files from Russia or Japan.

I would definitely split the standard situation in three:
- There are multiple independent standards referred to by the same name. They don’t claim to have any form of compatibility with each other. (This is the regex situation.)
- There are one “core” standard. All implementations extend this standard, but you can theoretically stick to the “core” standard for compatibility with everything, even if it will be a lot of extra work. (This is the SQL situation.)
- There was no standard, just an implementation, and other implementations just tried to copy how this implementation behaved, and possibly extend it. (This is the Markdown situation.)
Of course, you have the whole issue of “implementing it faithfully” on top of this.
einacio

Maybe instead of a youtube video, you can just share the relevant xkcd and not have a competing media trying to be the standard rant regarding standards 🤣

masklinn

Answer is missing an intermediate level: FA-based engines tend to have PCRE features, but not all of them (specifically not lookaheads or lookbehinds).

special characters \n, \t, etc.; word boundaries \b and \B, word constituents \b and \B, …

Nota: the backslashed character classes (\d, \w, \s, …) are Perl / PCRE extensions to EREs.

fanf

There’s another subtlety that I learned recently from a paper about JavaScript regexes: Javascript doesn’t follow the traditional unix/perl leftmost-longest rule, so ambiguous ( ) matches can vary between at least Henry Spencer-style and JavaScript and possibly other different regex engines.

abareplace

There are many differences between the engines, for example:

JavaScript did not support flags modifiers until recently;
Go does not support backreferences for performance reasons;
Java requires to escape all slashes;
advanced PCRE features based on backtracking (atomic grouping, conditional subpatterns, or recursive patterns) are not widely supported elsewhere.

vifon

A small nitpick:

Java requires to escape all slashes;

This isn’t really related to the regexp engine, rather to the handling of string literals in Java. The same thing affects Emacs FWIW.