I believe the original problem definition called for the data to be randomly spr...

I believe the original problem definition called for the data to be randomly spread over 1000 computers. Bringing together 32GB of data from 1000 nodes is going to stress your network, and pre-emptively sharding would be worse.

I think the best way to use histograms is 4 passes with 256 buckets (and bit-twiddling micro-optimizations), but other values are possible.