Even leaving aside the cost of storing an index, if you're only going to use that dataset for a small number of queries then it's cheaper to just perform them directly rather than construct an index first. As the paper points out at the end, Hadoop is fundamentally batchy; it works best when you have discrete chunks of data (e.g. weekly) and you want to reduce each one into some summarized form, then never use the raw data again.