Indeed, reduce is the difficult part. OTOH, I think this limitation is seen in m...

Indeed, reduce is the difficult part. OTOH, I think this limitation is seen in many algorithms at a fairly fundamental level, and not just an artefact of MR. The only alternative framework I can think of for dealing with really large datasets in a distributed manner is sampling-based methods, with one-pass algorithms (or mostly one pass algorithm).