This is the gap that kills most AI features. Devs test with queries they already know the answer to. Users come in with vague questions using completely different words. I learned to test by asking my kids to use my app - they phrase things in ways I would never predict.
Ironically, pitting a LLM (ideally a completely different model) up against what you're testing, letting it write human "out of the ordinary" queries to use as test cases tend to work well too, if you don't have kids you can use as a free workforce :)