Greg Freemyer wrote:
I don't think they do. They are highly redundant. Google does resiliency testing by turning off a a rack of computers at a time to see if any data is lost. That is an intentional level of testing, or so I understand.
They do unintentional testing at the DC level. What happens when a full DC has an outage? I can't recall how often that happens, but I think it does happen from time to time.
This reminds me of Netflix using a test suite to turn off random servers. This allowed them to verify that the infrastructure/staff would respond appropriately. I found the software again, it is appropriately named Chaos Monkey, and was provided on github. The software can be found here https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey