This is the gap that kills most AI features. Devs test with queries they already...

embedding-shape · 2025-11-29T13:00:31 1764421231

Ironically, pitting a LLM (ideally a completely different model) up against what you're testing, letting it write human "out of the ordinary" queries to use as test cases tend to work well too, if you don't have kids you can use as a free workforce :)

scosman · 2025-11-29T16:48:49 1764434929

I build a system to do exactly this: https://docs.kiln.tech/docs/evaluations/evaluate-rag-accurac...

Basically it:

- iterates over your docs to find knowledge specific to the content

- generates hundreds of pairs of [synthetic query, correct answer]

- evaluates different RAG configurations for recall