Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It depends on how you test it. I recently found that the way devs test it differs radically from how users actually use it. When we first built our RAG, it showed promising results (around 90% recall on large knowledge bases). However, when the first actual users tried it, it could barely answer anything (closer to 30%). It turned out we relied on exact keywords too much when testing it: we knew the test knowledge base, so we formulated our questions in a way that helped the RAG find what we expected it to find. Real users don't know the exact terminology used in the articles. We had to rethink the whole thing. Lexical search is certainly not enough. Sure, you can run an agent on top of it, but that blows up latency - users aren't happy when they have to wait more than a couple of seconds.


This is the gap that kills most AI features. Devs test with queries they already know the answer to. Users come in with vague questions using completely different words. I learned to test by asking my kids to use my app - they phrase things in ways I would never predict.


Ironically, pitting a LLM (ideally a completely different model) up against what you're testing, letting it write human "out of the ordinary" queries to use as test cases tend to work well too, if you don't have kids you can use as a free workforce :)


I build a system to do exactly this: https://docs.kiln.tech/docs/evaluations/evaluate-rag-accurac...

Basically it:

- iterates over your docs to find knowledge specific to the content

- generates hundreds of pairs of [synthetic query, correct answer]

- evaluates different RAG configurations for recall


How did you end up changing it? Creating new evals to measure the actual user experience seems easy enough, how did that inform your stack?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: