By any other CNAME

When an engineer joins Google, they are issued a workstation – a physical computer in tower[1] form-factor that sits under their desk. Workstations have names, which the engineer gets to choose, and the fully-qualified hostname consists of this name plus an office-specific suffix. Workstations in Mountain View are in .mtv.corp.google.com, workstations in New York are in .nyc.corp.google.com, and so on – this is all tracked in various databases and synced to DNS. Intranet services that weren't office-specific, like the go/ URL shortener, were on .corp.google.com directly.

For convenience, Google uses a DNS feature named the DNS search path to let users reference workstations by short names. If I am sitting next to you, and I want to SSH into your workstation[2], I can type ssh yourbox instead of ssh yourbox.mtv.corp.google.com and it'll work. This wasn't only for workstation names, you could also use it for any sort of .corp.google.com name. In a browser you could type http://go/somelink and it would resolve to http://go.corp.google.com/somelink. In a PAM module you could #define LDAP_ADDRESS "ldap" and it would direct queries to ldap.corp.google.com. The halls rang with the sound of protobuf engineering, and all was at peace.

Some of you have noticed the problem.

One day I arrived at the office and discovered that I couldn't unlock my screen. This wasn't especially alarming, because nobody else in the building could log in and network outages happen sometimes. But then news started trickling in over working comms[3] that this wasn't a network outage. A few early risers had working desktop sessions, and the network was fine – only attempts to log in, SSH, or sudo were hanging.

Also, it was affecting the entire Mountain View campus.

The usual debugging process was followed with unusual haste and the issue was narrowed down to DNS. One of the new hires starting that day had requested their workstation be named ldap, per their initials, and as soon as that hostname hit the network it hijacked every LDAP client that had been configured to talk to "ldap". Unfortunately that was a big 'every' because (1) if the wrong value is easier to type then it outcompetes the correct value, and (2) the chances of a misconfiguration being discovered are the inverse of how often it happens. So pretty much everything was broken.

This story has a happy ending because Google does regular disaster recovery tests. The tests are always something outlandish, like aliens have invaded and all contact with California has been lost, and everyone has a good laugh around the coffee robot. The recovery procedure for total DNS outage involves taking a laptop into the panic room, a locked room with a direct connection to the Prod network. This was done, the owners of the machine inventory were able to delete the bad record, and new safety checks were installed around the important hostnames.


  1. There was a brief experiment involving decommissioned Warp 19s, which were Google-designed rackmount machines with infamously sharp edges and the sound profile of a motorrad.

  2. This is not as alarming as it sounds; workstations at Google are (were?) fairly untrusted, and it was common to SSH into a coworker's machine if your own was overloaded or doing software updates or whatever.

  3. Engineers almost universally used IRC for quick conversations, team chatter, and coordinating incident response. Although Slack has some good points, I find myself missing good ol' port 6697 every time loading a channel makes my fan spin up.

Change Feed