Rob Pike's 5 Rules of Programming
63 points by dhruvp
63 points by dhruvp
In my experience #5 has proven to be true far more often than you would expect. Shuffle data around into the right places and use the right structures, and everything else usually writes itself.
It smells a bit like Fred Brooks'
Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowchart; it’ll be obvious.
What's interesting to me about #5 is that Go (which Rob Pike co-created) famously is lacking any sort of tagged union data structure. Maybe I'm spoiled, but to me that seems like such a fundamental data type.
Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
Reading stuff like this feels just a tad ironic, when one considers how Go went about doing things. Yes, I know the language is wildly successful, but it's also such a pain in the neck in its error handling and not having any sum or union types.
I know he wasn't the only person who designed Go, but he is easily the "face" of the language.
It's one of those weird places where people will declare it a matter of experience to know when to ignore that guidelines. For example, I immediately noticed that rules #3 and #4 would indicate that we should avoid using hash maps. It's much simpler to just have an array of key-value pairs (aka an association list). Sure, look-up and deletion are now O(n), but rule #3 tells us not to care about that. Plus, hashing is a fancy algorithm that's hard to get right, while string comparison is really simple (and might also be O(n), but we still don't care about that). However, Rob Pike had enough experience to know to ignore rules #3 and #4 when adding Hashmaps to the language.
I find most advice about programming is like advice about roulette. There's one simple rule for making money at roulette: Always bet on black. Once you have the experience to know when not to follow that advice, you'll win 96% of the spins.
rules #3 and #4 would indicate that we should avoid using hash maps
True if you mean the literal hash-map implementation, not true if you mean the associative map interface with certain performance characteristics. Which is a great excuse to bring up my favorite-not-favorite bit of lisp history, in which a bunch of Common Lispers absolutely flame Erann Gat to hell for suggesting that it would be nice for a runtime to provide associative maps which automatically switch from alists to hash maps as they grow. Which is exactly what Clojure later did and it was indeed quite pleasant and useful.
Most of the issues #4 is trying to avoid are caused by on-the-fly DIY implementations, not implementing a fancy algorithm like Bagwell tries once at the language level, which IMO doesn't count as a fancy data structure from the POV of the user.
A lot of the optimisations that I've done at my current job is converting (association) lists into Maps or Sets. For N=100 it already makes a difference. (Also removing code that construct a string for each character)
If the language already provides you with a Map/Set/Hashmap implementation in its standard library then I think that counts as simple, and you should use them if it makes sense. It might also express constraints better (such as not having duplicate entries).
Things are a bit different in C where you'd have to hand-roll one.
I see Go's error handling as a consequence of rule #4. Sure it's verbose, and redundant. But you'll have a hard time arguing that it's fancy or hard to implement. The err type also has a very simple and coherent structure which makes it really reliable to use to catch and handle errors.
And the same goes for unions. I like unions, but they are "complex" data types. Explicit and rigid typing makes it easier to create "stupid" data structures that are easier to deal with.
I think calling tagged unions complex is a mistake. They're not as common as structs, but "value a and value b" (struct) is about as complex as "value a or value b" (sum types).
I exaggerated and used quotes because I know this is not really complex in essence. My point is that of all data types, unions are the first one to ignore because they're redundant with structs, and are context dependant which can make their use more "complex" than other types. If you try to make the simplest language possible, supporting unions would probably not be your first goal.
I like unions, but they are "complex" data types.
What makes you think so? Algebraic data types (like Haskell's data or Rust's enum) are from the 1970s. They're described in the "simple extensions" chapter of Types and Programming Languages, after structs and before lists. I think they're harder to implement in practice than the type theory for them would suggest -- there's a bunch of details around match behavior -- but even so they're easy. Much easier than type parameters, which I understand why the Go developers were hesitant to add before they were certain it would be worth the complexity.
(Union types are algebraic data types where the language doesn't store the discriminant and requires the programmer to store it, and has bad behavior if they get it wrong. I doubt unions were ever on the table for Go because they're so extremely unsafe, but algebraic data types should have been.)
While I agree with the conclusion, at the very least Go's decision to treat errors as data is inline with this, even if the ergonomics are not good.
Pike's rules 1 and 2 restate Tony Hoare's famous maxim "Premature optimization is the root of all evil."
Weird, I thought that was a Knuth quote.
There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Donald E. Knuth, Structured Programming with go to Statements, Computing Surveys, Vol. 6, No. 4, December 1974, pg. 268 (link)
The next paragraph is also pretty good:
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.
I like the sound of 5, but I have to admit I don’t immediately know or understand how to evaluate if I’ve chosen the right data structures.
It's a simple flow chart:
mutable vs immutable (depends whether your language supports immutable data structures, if it supports them it is usually easier to reason about immutable data structures)
thread safety (if it isn't thread safe, document it, if it is thread safe add unit tests that actually use multiple threads)
make invalid states unrepresentable
at the external API level all operations on your data structure must ensure it maintains desirable invariants (it may temporarily break those invariants in the implementation as long as a caller cannot observe that, e.g. in multithreaded code)
flexibility vs hard coding (e.g. using a 'user' type, even if the user is currently a string)
units of measure (try to avoid having everything as integers, especially when using different quantities or values at different scales, it is easy to substitute one for the other and the compiler won't help). Sake for strings, if everything is a string you don't really have much type safety. E.g. a record/variant containing a string is better, because you get compile time errors if you use the wrong one.
It might be easier to identify the wrong data structures:
Do you find yourself using two complementary types/structs/classes in almost every function that operates on either? Maybe they should be bundled together
Do you find yourself regularly using just one or two fields from a big type? Maybe it should be split apart
Are your loops slow even though the inner computations are fast? You might want to switch to arrays of primitive types (or struct of array patterns, etc.)
Broadly, learn the APIs for the major data structures. If you find yourself doing a lot of, e.g., lookups on an array, try switching to a dict.
Tables are often underused, they get you a lot of the performance benefits of arrays, a host of analytics functions (like summing over a column), and the row-level bundling of a list of structs (so you don't accidentally pull two fields belonging to different records.)
There is a tradeoff in Rule 4 about using simpler data structures. A list is simpler than a Map or Set, but it is the wrong choice in many situations, and can be a lot slower for large N.
A good rule of thumb is to use the data structure that most naturally models the problem domain and its constraints. That at least gets you in approximately the right kind of big O complexity. Then, if needed, you can fine tune what kind of tree you use/etc based on benchmarks/profiling, but you've avoided the most obvious traps (such as O(N^2) time to load N elements if you used lists)
You can see Pike's Rule #5 in his paper The Text Editor Sam, https://research.swtch.com/sam.pdf
Pike describes a number of the data structures used by sam in depth.
Pike also talks about using Gerard Holzmann's spin protocol verifier. Pike must have known about CSP style concurrency for decades before designing Go - the spin implementation language uses CSP.
Pike must have known about CSP style concurrency for decades before designing Go - the spin implementation language uses CSP
Rob Pike worked at the start of the 80s on PAN, SPIN's ancestor, with Holzmann. And then he worked throughout the 80s on Squeak (not the Smalltalk implementation) then Newsqueak. Go's concurrency has indeed its roots in that lineage, along Alef and Limbo.
If the topic is of interest, I can recommend Russ Cox's article, it's a good reference.