Hundred year mistakes

ericlippert

from Fabulous adventures in coding on 2020-02-27 19:28 (#4ZZ6G)

My manager and I got off on a tangent in our most recent one-on-one on the subject of the durability of design mistakes in programming languages. A particular favourite of mine is the worst of the operator precedence problems of C; the story of how it came about is an object lesson in how sometimes gradual evolution produces weird results. Since he was not conversant with all the details, I thought I might write it up and share the story today.

First off, what is the precedence of operators in C? For our purposes today we'll consider just three operators: &&, & and ==, which I have listed in order of increasing precedence.

What is the problem? Consider:

int x = 0, y = 1, z = 0;int r = (x & y) == z; // 1int s = x & (y == z); // 0int t = x & y == z; // ?

Remember that before 1999, C had no Boolean type and that the result of a comparison is either zero for false, or one for true.

Is t supposed to equal r or s?

Many people are surprised to find out that t is equal to s! Because == is higher precedence than &, the comparison result is an input to the &, rather than the & result being an input to the comparison.

Put another way: reasonable people think that

x & y == z

should be parsed the same as

x + y == z

but it is not.

What is the origin of this egregious error that has tripped up countless C programmers? Let's go way back in time to the very early days of C. In those days there was no && operator. Rather, if you wrote

if (x() == y & a() == b) consequence;

the compiler would generate code as though you had used the && operator; that is, this had the same semantics as

if (x() == y) if (a() == b) consequence;

so that a() is not called if the left hand side of the & is false. However, if you wrote

int z = q() & r();

then both sides of the & would be evaluated, and the results would be binary-anded together.

That is, the meaning of & was context sensitive; in the condition of an if or while it meant what we now call &&, the "lazy" form, and everywhere else it meant binary arithmetic, the "eager" form.

However, in either context the & operator was lower precedence than the == operator. We want

if(x() == y & a() == b())

to be

if((x() == y) & (a() == b))

and certainly not

if((x() == (y & a())) == b)

This context-sensitive design was quite rightly criticized as confusing, and so Dennis Ritchie, the designer of C, added the && operator, so that there were now separate operators for bitwise-and and short-circuit-and.

The correct thing to do at this point from a pure language design perspective would have been to make the operator precedence ordering &&, ==, &. This would mean that both

if(x() == y && a() == b())

and

if(x() & a() == y)

would mean exactly what users expected.

However, Ritchie pointed out that doing so would cause a potential breaking change. Any existing program that had the fragment if(a == b & c == d) would remain correct if the precedence order was &&, &, ==, but would become an incorrect program if the operator precedence was changed without also updating it to use &&.

There were several hundred kilobytes of existing C source code in the world at the time. SEVERAL HUNDRED KB. What if you made this change to the compiler and failed to update one of the & to &&, and made an existing program wrong via a precedence error? That's a potentially disastrous breaking change.

You might say "just search all the source code for that pattern" but this was two years before grep was invented! It was as primitive as can be.

So Ritchie maintained backwards compatibility forever and made the precedence order &&, &, ==, effectively adding a little bomb to C that goes off every time someone treats & as though it parses like +, in order to maintain backwards compatibility with a version of C that only a handful of people ever used.

But wait, it gets worse.

C++, Java, JavaScript, C#, PHP and who knows how many other languages largely copied the operator precedence rules of C, so they all have this bomb in them too. (Swift, Go, Ruby and Python get it right.) Fortunately it is mitigated somewhat in languages that impose type system constraints; in C# it is an error to treat an int as a bool, but still it is vexing to require parentheses where they ought not to be necessary were there justice in the world. (And the problem is also mitigated in more modern languages by providing richer abstractions that obviate the need for frequent bit-twiddling.)

The moral of the story is: The best time to make a breaking change that involves updating existing code is now, because the bad designs that result from maintaining backwards compat unnecessarily can have repercussions for decades, and the amount of code to update is only going to get larger. It was a mistake to not take the breaking change when there were only a few tens of thousands of lines of C code in the world to update. It's fifty years since this mistake was made, and since it has become embedded in popular successor languages we'll be dealing with its repercussions for fifty more at least, I'd wager.

UPDATE: The most common feedback I've gotten from this article is "you should always use parentheses when it is unclear". Well, obviously, yes. But that rather misses the point, which is that there is no reason for the novice developer to suppose that the expression x & y == z is under-parenthesized when x + y == z works as expected. The design of a language should lead us to naturally write correct code without having to think "will I be punished for my arrogance in believing that code actually does what it looks like it ought to?"

Source	RSS or Atom Feed
Feed Location	http://ericlippert.com/feed
Feed Title	Fabulous adventures in coding
Feed Link	https://ericlippert.com/