Based on the reverse engineering done by Parth Thakkar [1], the model used by Copilot is probably about 10x as large (12B parameters), so I would expect Copilot to still win pretty handily (especially since the Codex models are generally a lot better trained than Salesforce CodeGen or InCoder). It's also a little bit hard to compare directly because as Parth documents, there are a lot of extra smarts that go into Copilot on the client side.
The SantaCoder paper does have some benchmarks on MultiPL-E though, so you could compare them to the Codex results on that benchmark reported here (but keep in mind that code-davinci-002 is probably even larger than the model used by Copilot): https://arxiv.org/abs/2208.08227
OpenAI hasn't said exactly how they trained code-davinci-002 so this is speculative, but I'm reasonably sure it was trained on more data and languages than CodeGen and for longer. It was also trained using fill-in-the middle [1].
The SantaCoder paper does have some benchmarks on MultiPL-E though, so you could compare them to the Codex results on that benchmark reported here (but keep in mind that code-davinci-002 is probably even larger than the model used by Copilot): https://arxiv.org/abs/2208.08227
[1] https://thakkarparth007.github.io/copilot-explorer/posts/cop...