It depends on how you test it. I recently found that the way devs test it differ...

victorbuilds · 2025-11-29T10:15:06 1764411306

This is the gap that kills most AI features. Devs test with queries they already know the answer to. Users come in with vague questions using completely different words. I learned to test by asking my kids to use my app - they phrase things in ways I would never predict.

embedding-shape · 2025-11-29T13:00:31 1764421231

Ironically, pitting a LLM (ideally a completely different model) up against what you're testing, letting it write human "out of the ordinary" queries to use as test cases tend to work well too, if you don't have kids you can use as a free workforce :)

scosman · 2025-11-29T16:48:49 1764434929

I build a system to do exactly this: https://docs.kiln.tech/docs/evaluations/evaluate-rag-accurac...

Basically it:

- iterates over your docs to find knowledge specific to the content

- generates hundreds of pairs of [synthetic query, correct answer]

- evaluates different RAG configurations for recall

babelfish · 2025-11-29T01:05:21 1764378321

How did you end up changing it? Creating new evals to measure the actual user experience seems easy enough, how did that inform your stack?