Even with Kubernetes, you can clearly see what is deploying to what nodes. Not s...

lclarkmichalek · on June 4, 2019

When you're dealing with this scale of system, the number of config changes, automated or human, can make determining which config the harder issue. Once you've found out what the issue is, you probably also want the revert to go via the normal flows, for fear that your revert could exacerbate the situation. Both of those add time to remediation.

reilly3000 · on June 4, 2019

I'm guessing it was something lower level than Kubernetes/Borg, since it was able to affect all of their networking bandwidth across multiple regions. ¯\_(ツ)_/¯

shereadsthenews · on June 4, 2019

The interesting tidbit in here (really the only piece of information at all) is that the outage itself prevented remediation of the outage. That indicates that they might have somehow hosed their DSCP markings such that all traffic was undifferentiated. Large network operators typically implement a "network control" traffic class that trumps all others, for this reason.