Two things are holding back current LLM-style AI of being of value here:
* Latency. LLM responses are measured in order of 1000s of milliseconds, where this project targets 10s of milliseconds, that's off by almost two orders of magnitute.
* Determinism. LLMs are inherently non-deterministic. Even with temperature=0, slight variations of the input lead to major changes in output. You really don't want your DB to be non-deterministic, ever.
From what I understand, in practice it often is true[1]:
Matrix multiplication should be “independent” along every element in the batch — neither the other elements in the batch nor how large the batch is should affect the computation results of a specific element in the batch. However, as we can observe empirically, this isn’t true.
In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.
"But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first."