Unleash the chaos monkey!

jakewins · on Nov 15, 2016

Which reminds me - another thing we learned is that too much chaos can be easier on a distributed system than less chaos.

One of the robustness suites tries as hard as it can to permanently destroy a Neo4j cluster, looking for things like distributed deadlocks, faults in the leader election, data inconsistencies between replicas and so on.

It does that by applying a randomized load of all operations the system supports; reads, writes and schema changes. That's then combined with induced hardware faults and "regular" operational restarts and cluster configuration changes.

The problem was that, early on, the test would actually create unresponsive clusters, but then the "chaos" would continue, stirring up enough dust to get the cluster going again before the "unresponsive" timeouts triggered, causing false green tests.

Hence: Today this suite plays out a "chaos" scenario, but then it heals network partitions, turns unpowered instances back on and so forth, and sits back and waits for the cluster to recover "on its own".

brianwawok · on Nov 16, 2016

This is considered multiple unrelated failures. You usually don't want to (at least to start) design a system to handle this.

handle a machine power down? Yes. Handle a network cable unplug? Yes. Handle 1 machine power down and 1 network cable unplug at the same time? That may be impossible to handle, or it may be an order of magnitude harder to handle.

Get all the single failure issues nailed down. Then perhaps you can do a little work on multiple unrelated failures, but it gets insanely complex fast.

buzzybee · on Nov 16, 2016

That suggests that your original chaos was more "random" than "chaotic", as actually chaotic processes will go through indefinite periods of slack where they are apparently predictable. i.e. a better "chaos monkey" might employ some traversal of a fractal system rather than simply aim for high entropy noise.