When an engineer joins Google, they are issued a workstation – a physical computer in tower
.mtv.corp.google.com, workstations in New York are in
.nyc.corp.google.com, and so on – this is all tracked in various databases and synced to DNS. Intranet services that weren't office-specific, like the go/ URL shortener, were on
For convenience, Google uses a DNS feature named the
DNS search path to let users reference workstations by short names. If I am sitting next to you, and I want to SSH into your workstation
ssh yourbox instead of
ssh yourbox.mtv.corp.google.com and it'll work. This wasn't only for workstation names, you could also use it for any sort of
.corp.google.com name. In a browser you could type
http://go/somelink and it would resolve to
http://go.corp.google.com/somelink. In a PAM module you could
#define LDAP_ADDRESS "ldap" and it would direct queries to
ldap.corp.google.com. The halls rang with the sound of protobuf engineering, and all was at peace.
Some of you have noticed the problem.
One day I arrived at the office and discovered that I couldn't unlock my screen. This wasn't especially alarming, because nobody else in the building could log in and network outages happen sometimes. But then news started trickling in over working comms
Also, it was affecting the entire Mountain View campus.
The usual debugging process was followed with unusual haste and the issue was narrowed down to DNS. One of the new hires starting that day had requested their workstation be named
ldap, per their initials, and as soon as that hostname hit the network it hijacked every LDAP client that had been configured to talk to
"ldap". Unfortunately that was a big 'every' because (1) if the wrong value is easier to type then it outcompetes the correct value, and (2) the chances of a misconfiguration being discovered are the inverse of how often it happens. So pretty much everything was broken.
This story has a happy ending because Google does regular disaster recovery tests. The tests are always something outlandish, like
aliens have invaded and all contact with California has been lost, and everyone has a good laugh around the coffee robot. The recovery procedure for total DNS outage involves taking a laptop into the panic room, a locked room with a direct connection to the Prod network. This was done, the owners of the machine inventory were able to delete the bad record, and new safety checks were installed around the important hostnames.
There was a brief experiment involving decommissioned Warp 19s, which were Google-designed rackmount machines with infamously sharp edges and the sound profile of a motorrad.
This is not as alarming as it sounds; workstations at Google are (were?) fairly untrusted, and it was common to SSH into a coworker's machine if your own was overloaded or doing software updates or whatever.
Engineers almost universally used IRC for quick conversations, team chatter, and coordinating incident response. Although Slack has some good points, I find myself missing good ol' port 6697 every time loading a channel makes my fan spin up.