The first step is simply to merge my temporary-hack fixes (e.g., removing the I/O rate limiting during recovery operations) into the main Tarsnap server codebase. I lost about 3 hours to those.
The second step is to change the order of S3 GETs and adjust the parallelization of those; I think I can easily cut that from the ~20 hours it took down to ~5 hours.
The third step is to parallelize the third (replaying of log entries on a machine-specific basis) stage; that should cut from ~10 hours down to ~1 hour given a sufficiently hefty EC2 instance.
After that it's a question of profiling and experimenting. I'm sure the second stage can be sped up considerably, but I don't know exactly where the bottleneck is right now. I know the first stage can be sped up by pre-emptively "rebundling" the metadata from each S3 object so that I need fewer S3 GETs, but I'm not sure if that's necessarily the best option.
In the longer term, I'm reworking the entire back-end metadata store, so some of the above won't be relevant any more.
Will those changes also finally speed up client restores? The extremely slow restores are one of those things that remain terrifying as a customer.
We've had to consider moving to a different backup system for disaster recovery. Tarsnap covers the cases of fat-fingering and versioned backups nicely, but not disaster recovery of large-ish data sets.
That's a different issue, but related to the long-term back-end reworking. (It's not one back-end, it's several pieces of back-end, some of which are necessary for speeding up extracts and some of which aren't.)
The simplest mitigation to this is to store your primary systems in a different location to your backup. This way the likelihood of both your primaries and your backups becoming unavailable at the same time are significantly reduced.
I don't see how that mitigates the issue at all. The case I'm talking about is when your primaries totally bite the dust. It's obvious that a backup system shouldn't be in the same physical location as the primary system.
The issue I'm referring to isn't about lack of availability of tarsnap's backups, it's that it's very slow to restore backups if you need to do a complete restore (rather than just grab a few accidentally trashed files).
Would it be feasible to store snapshots of the metadata periodically as well, so you only have to replay mutations performed after the last snapshot in an emergency?
The second step is to change the order of S3 GETs and adjust the parallelization of those; I think I can easily cut that from the ~20 hours it took down to ~5 hours.
The third step is to parallelize the third (replaying of log entries on a machine-specific basis) stage; that should cut from ~10 hours down to ~1 hour given a sufficiently hefty EC2 instance.
After that it's a question of profiling and experimenting. I'm sure the second stage can be sped up considerably, but I don't know exactly where the bottleneck is right now. I know the first stage can be sped up by pre-emptively "rebundling" the metadata from each S3 object so that I need fewer S3 GETs, but I'm not sure if that's necessarily the best option.
In the longer term, I'm reworking the entire back-end metadata store, so some of the above won't be relevant any more.