Chesterton's Fence and linking to that old Joel Spolsky thunkpiece about Netscape
Be careful not to confuse this reactive anti-rewrite sentiment with true objections to your particular rewrite. Remind them that Joel wrote that when source control meant CVS.
Rewriting an existing codebase should be modeled as a special case of a migration. Don't try to replace the whole thing at once: systematize how users interact with the existing code, insert strong API boundaries between subsystems, and make changes intentionally.
User Interaction will make or break your rewrite. You must understand what the touch-points are for users of the existing system to avoid exposing them to maintain UI Compatibility. Often rewrites mandate some changes, so try to put them all near the start (if you know what the final state should be) or delay them to the end (when you can make it seem like a big-bang migration). If the user-facing changes are significant, see if you can arrange for separate opt-in and opt-out periods during which both interaction modes co-exist.
Subsystem API Boundaries let you carve up the old system into chunks that are easier to reason about. Be fairly strict about this: run the components in separate processes, separate machines, or whatever is needed to guarantee that your new API is the only mechanism they have to communicate. Do this recursively until the components are small enough that rewriting them from scratch is tedious instead of frightening.
Intentional Changes happen when the new codebase's behavior is forced to deviate from the old. At this point you should have a good idea which behavior, if either, is correct. If there's no single correct behavior, it's fine to settle for "predictable" or (in the limit) "deterministic". By making changes intentionally you minimize the chances of forced rollbacks, and may even be able to detect users depending on the old behavior.
Work incrementally. A good rewrite is valid and fully functional at any given checkpoint, which might be commits or nightly builds or tagged releases. The important thing is that you never get into a state where you're forced to roll back a functional part of the new system due to breakage in another part.
All bad code is bad in its own special way, but there are some properties that are especially likely to make it hard to refactor incrementally. These are generally programming styles that hide state, obscure control flow, or permit type confusion.
Hidden State means mutable global variables and dynamic scoping. Both of these inhibit a reader's understanding of what code will do, and forces them to resort to logging or debuggers. They're like catnip for junior developers, who value succinct code but haven't yet been forced to debug someone else's succinct code at 3 AM on a Sunday.
Non-Local Control Flow prevents a reader from understanding what path execution will take. In the old times this meant setjmp
and longjmp
, but nowadays you'll see it in the form of callbacks and event loops. Python's Twisted and Ruby's EventMachine can easily turn into global callback dispatchers, preventing static analysis and rendering stack traces useless.
Dynamic Types require careful and thoughtful programming practices to avoid turning into "type soup". Highly magical metaprogramming like __getattr__
or method_missing
are trivially easy to abuse in ways that make even trivial bug fixes too risky to attempt. Tooling such as Mypy and Flow can help here, but introducing them into an existing haunted forest is unlikely to have significant impact. Use them in the new codebase from the start, and they might be able to reclaim portions of the original code.
Distributed Systems can become haunted forests through sheer size, if no single person is capable of understanding the entire API surface they provide. Note that microservices don't automatically prevent this, because merely splitting up a monolith turns the internal structure into API surface. Each of the above per-process issues has distributed analogues, for example S3 is global mutable state and JSON-over-HTTP is dynamically typed.
A codebase where nobody knows what behavior it currently has is materially different from one where nobody understands what behavior it should have. The former don't need to be rewritten, because you can grind their test coverage up and then safely refactor.
You will sometimes hear objections from people who have not worked directly on the bad code, but have opinions about it anyway. Let them know that they're welcome to help out and you can arrange for a temporary rotation into the role of Forest Ranger.
The real reason Netscape failed is they wrote a dreadful browser, then spent three years writing a second dreadful browser. The fourth rewrite (Firefox) briefly had a chance at being the most popular browser, until Google's rewrite of Konqueror took the lead. The moral of this story: rewrites are a good idea if the new version will be better.