Ambiguity in C
16 points by icefox
16 points by icefox
I don't understand why this article insists on referring to context-sensitivity as “ambiguity”. C's syntax can't be expressed in BNF, but it's not ambiguous, although the BNF given in the standard of course is.
If you’ve ever written a parser for C, you’ll know the most annoying part is the ambiguities.
I can't really agree with that. The solution is inelegant but it's also easy, self-contained, and doesn't interact multiplicatively with other things.
OTOH a nice, user-friendly compiler wants to do things like trace tokens back through preprocessor expansions, issue context-sensitive warnings, keep looking for more errors after the first one, and so on. You can add one or two things like that to a basic compiler without major problems, but if you add all the modern conveniences you will quickly end up with a spaghetti nightmare mess where everything knows about everything else, changing any single thing takes weeks, and the whole thing will come crashing down if someone sneezes near it.
That is annoying. I'm not sure if it's specific to C though.
Yeah, the author knows what the problem is, but has decided not to use the standard solution, and is complaining at length because they have made the problem worse. D’oh.
The only correct solution is to build a type table, which we’ve already decided against
It's worse than just needing a symbol table. Distinguishing between a cast and a structured initializer of a struct type requires deep lookahead or delayed reduction. And there are places where the type check you need to do is several tokens ahead, rather than where you currently are or only a single token ahead. Also the exact semantics of where and when you update the symbol table are very obtuse; most people would probably write something that gets the following program wrong:
typedef int T;
T f(T x, T (*T)(T x))
{
return (T)(x);
}
Yeah, C’s grammar is fairly closely fitted to LALR bottom-up parsers and a bit awkward for LL.
Distinguishing between a cast and a structured initializer of a struct type requires deep lookahead or delayed reduction
Do you mean a compound literal? Isn’t that disambiguated by the { ?
the exact semantics of where and when you update the symbol table are very obtuse
Yeah the standard is often written in torturous style. For this kind of thing the standard is usually clear and explicit about where the scope of an identifier starts and ends, which tells the compiler writer when to update the symbol table.
(But there’s currently a discussion about adding more explicit coupling between pointers and their associated lengths. It’s particularly troublesome because it’s common to declare the pointer before the length, so the pointer declarator can’t use something simple like dynamically-sized array syntax because that refers forwards to the length which is not yet in scope.)
Do you mean a compound literal? Isn’t that disambiguated by the { ?
That's exactly right. But (because type names can be arbitrarily long) that { can be arbitrarily far forward from the original ( if you're trying to tell between the two before parsing the target type. And if you decide which kind of thing you're in after parsing the target type, you're basically manually left-factoring that part of the grammar and deciding after the fact what rule you were in. It works and these both keep you in O(n) territory for parsing the full source text, but it's not obvious looking at the grammar that these are your options unless you're already very experienced.
(Part of the problem is that the cast operator and compound literal expression rules are specified at different precedence levels in the grammar! In order to avoid problems around that, in my parser, I chose a third thing: my lexer pairs up brackets/parens, and I do a lookahead past the current (...) pair and see if it's a {.)
Like you said, it's a lot easier in LR/LALR land than LL. Which, IMO, is kind of an impedance mismatch with how the symbol state updates make you want to go the LL way.
The whole last section of the post is about why the standard solution is undesirable: when you're parsing an editor buffer that contains an incomplete program more often than not, the expectation that you can parse the entire prefix (including all the header files) to make sense of a declaration is often broken.
Context-sensitivity often devolves to ambiguity when you don't have the context, which is the use case the author cares about here.
More generally it is totally possible to design a language where all these user-friendly features are implementable without a spaghetti nightmare mess. C just happens to have made a bunch of choices that make that difficult. (Though it may be hard to blame its creators, who were working in a very different environment.)
Jim Roskind also has a writeup about some of these issues. It includes the funny line:
since you copied the grammar from the ANSI Standard, your grammar is probably wrong.