SRE School: No Haunted Forests

All industrial codebases contain bad code. To err is human, and situations get very human when you’re staring down the barrel of a launch deadline. You’ve heard the euphemism tech debt, where like a car loan you hold a recurring obligation in exchange for immediate liquidity. But this is misleading: bad code is not merely overhead, it also reduces optionality for all teams that come in contact with it. Imagine being unable to get indoor plumbing because your neighbor has a mortgage!

Thus a better analogy for bad code is a haunted forest. Bad code negatively affects everything around it, so engineers will write ad-hoc scripts and shims to protect themselves from direct contact with the bad code. After the authors move to other projects, their hard work will join the forest.

Healthy engineering orgs do not tolerate the presence of haunted forests. When one is discovered you must move vigorously to contain, understand, and eradicate it.

Make this the motto of your team: No Haunted Forests!

Engineer debugging a Puppet manifest (2018, colorized)

Identifying a Haunted Forest

Not all intimidating or unmaintained codebases are haunted forests. Code may be difficult for a newcomer to come up to speed, or it might be a stable implementation of some RFC. A couple rules of thumb to identify code worthy of a complete rewrite:

Haunted Environmentalists

Fresh graduates often push for a rewrite at the first sign of complexity, because they’ve spent the last four years in an environment where codebase lifetimes are measured in weeks. After their first unsuccessful rewrite they will evolve into Junior Engineers, repeating the parable of Chesterton’s Fence and linking to that old Joel Spolsky thunkpiece about Netscape3.

Be careful not to confuse this reactive anti-rewrite sentiment with true objections to your particular rewrite. Remind them that Joel wrote that when source control meant CVS.

Clearing Haunted Forests

Rewriting an existing codebase should be modeled as a special case of a migration. Don’t try to replace the whole thing at once: systematize how users interact with the existing code, insert strong API boundaries between subsystems, and make changes intentionally.

User Interaction will make or break your rewrite. You must understand what the touch-points are for users of the existing system to avoid exposing them to maintain UI Compatibility. Often rewrites mandate some changes, so try to put them all near the start (if you know what the final state should be) or delay them to the end (when you can make it seem like a big-bang migration). If the user-facing changes are significant, see if you can arrange for separate opt-in and opt-out periods during which both interaction modes co-exist.

Subsystem API Boundaries let you carve up the old system into chunks that are easier to reason about. Be fairly strict about this: run the components in separate processes, separate machines, or whatever is needed to guarantee that your new API is the only mechanism they have to communicate. Do this recursively until the components are small enough that rewriting them from scratch is tedious instead of frightening.

Intentional Changes happen when the new codebase’s behavior is forced to deviate from the old. At this point you should have a good idea which behavior, if either, is correct. If there’s no single correct behavior, it’s fine to settle for “predictable” or (in the limit) “deterministic”. By making changes intentionally you minimize the chances of forced rollbacks, and may even be able to detect users depending on the old behavior.

Work incrementally. A good rewrite is valid and fully functional at any given checkpoint, which might be commits or nightly builds or tagged releases. The important thing is that you never get into a state where you’re forced to roll back a functional part of the new system due to breakage in another part.

Common Features of Haunted Forests

All bad code is bad in its own special way, but there are some properties that are especially likely to make it hard to refactor incrementally. These are generally programming styles that hide state, obscure control flow, or permit type confusion.

Hidden State means mutable global variables and dynamic scoping. Both of these inhibit a reader’s understanding of what code will do, and forces them to resort to logging or debuggers. They’re like catnip for junior developers, who value succinct code but haven’t yet been forced to debug someone else’s succinct code at 3 AM on a Sunday.

Non-Local Control Flow prevents a reader from understanding what path execution will take. In the old times this meant setjmp and longjmp, but nowadays you’ll see it in the form of callbacks and event loops. Python’s Twisted and Ruby’s EventMachine can easily turn into global callback dispatchers, preventing static analysis and rendering stack traces useless.

Dynamic Types require careful and thoughtful programming practices to avoid turning into “type soup”. Highly magical metaprogramming like __getattr__ or method_missing are trivially easy to abuse in ways that make even trivial bug fixes too risky to attempt. Tooling such as Mypy and Flow can help here, but introducing them into an existing haunted forest is unlikely to have significant impact. Use them in the new codebase from the start, and they might be able to reclaim portions of the original code.

Distributed Systems can become haunted forests through sheer size, if no single person is capable of understanding the entire API surface they provide. Note that microservices don’t automatically prevent this, because merely splitting up a monolith turns the internal structure into API surface. Each of the above per-process issues has distributed analogues, for example S3 is global mutable state and JSON-over-HTTP is dynamically typed.

  1. A codebase where nobody knows what behavior it currently has is materially different from one where nobody understands what behavior it should have. The former don’t need to be rewritten, because you can grind their test coverage up and then safely refactor.
  2. The real reason Netscape failed is they wrote a dreadful browser, then spent three years writing a second dreadful browser. The fourth rewrite (Firefox) briefly had a chance at being the most popular browser, until Google’s rewrite of Konqueror took the lead. The moral of this story: rewrites are a good idea if the new version will be better.