Regular expressions that work “everywhere”
6 points by susam
6 points by susam
Backreferences are not supported by Rust's regex crate because of the underlying algorithm (I think it's Thompson's NFA or somethingin the similar vein.)
The problem with backrefs is they make regex matching NP-hard, so a fast algorithm should probably not include the feature. (In most cases backrefs can be replaced with a slightly more relaxed regex and a subsequent check that the relaxed substrings match the earlier occurrence as expected; but that refactoring can be stymied if the backrefs lack unambiguous delimiters.)
See also Linear Matching of JavaScript Regular Expressions which shows how to match unrestricted lookarounds quickly.
This is subtle, because there are two different things we can ask:
n?n, what is the complexity of matching it against a string of length m?And while the answer to the second is NP-hard, the answer to the first is probably polynomial (admittedly, I have never worked through the details of the construction linked in Geoff Langdale's blog). They're both reasonable questions, but I think the first one is the most natural understanding of "how complex is regular expression matching with backreferences?"
See https://news.ycombinator.com/item?id=40430093 for some discussion.
This does not change the fact that efficient regex libraries should not include backreferences. No one has found a practically efficient way to match them, even if there is a polynomial algorithm.