Asymptotically, yes; prediction = compression (if you have a model for a bitstream which produces probabilities over the next bit, it can be fed into an arithmetic encoder and you now have a compressor). In this case, it's not practically helpful. A VGG is 528MB all on its own, so you need to compress a lot of images to make back that 0.5GB use plus runtime dependencies.
They don't necessarily use the same budget. The compressor can come prepackaged or be downloaded from a fast connection, and then be used when you have poor mobile internet.
They don't necessarily have to, that's true. But if there was any appetite to have gargantuan 500MB+ decompression libraries to save 20 or 30% on downloaded bytes while browsing, we would have seen much more uptake of existing simpler compression schemes like SDCH which takes a tiny step in that direction with a relatively large (but still tiny) pre-built dictionary for the WWW.