It's true that averages can be misleading but we encourage users to think about it instead as a percentage of inputs. In practice the error distribution is very bimodal, the two modes being "basically fine" (a few ulps of error) and "garbage" (usually 0 instead of some actual value)
Author here. The speed up is modeled throughput, though the model is relatively naive. It's possible to disable branches by turning off the regimes flag, see https://herbie.uwplse.org/doc/1.0/options.html
Author here. I've got a few papers about this problem (including one in submission), but it is very very hard to do, especially with acceptable overhead. The state of the art is maybe 100x overhead.
It is, there's a page in the documentation about how errors are defined. Let me also add: Herbie generally gives the most accurate option it found first, and then the other stuff might be useful for speed (0.5x is way faster than two square roots and a divide!) but it's not as accurate
Author here! Yes, the float distribution isn't what you want in practice, but distribution selector isn't really the right thing either, because a low probability bad result can still be pretty bad! Hence the range selector; the float distribution is good at picking extreme values that trigger FP error.
We usually recommend looking for 90%+ accuracy or carefully examining the accuracy plot
In calculus the core issue is that the concept of a "function" was undefined but generally understood to be something like what we'd call today an "expression" in a programming language. So, for example, "x^2 + 1" was widely agreed to be a function, but "if x < 0 then x else 0" was controversial. What's nice about the "function as expression" idea is that generally speaking these functions are continuous, analytic [1], etc and the set of such functions is closed under differentiation and integration [2]. There's a good chance that if you took AP Calculus you basically learned this definition.
The formal definition of "function" is totally different! This is typically a big confusion in Calculus 2 or 3! Today, a function is defined as literally any input→output mapping, and the "rule" by which this mapping is defined is irrelevant. This definition is much worse for basic calculus—most mappings are not continuous or differentiable. But it has benefits for more advanced calculus; the initial application was Fourier series. And it is generally much easier to formalize because it is "canonical" in a certain sense, it doesn't depend on questions like "which exact expressions are allowed".
This is exactly what the article is complaining about. The non-rigorous intuition preferred for basic calculus and the non-rigorous intuition required for more advanced calculus are different. If you formalize, you'll end up with one rigorous definition, which necessarily will have to incorporate a lot of complexity required for advanced calculus but confusing to beginners.
Programming languages are like this too. Compare C and Python. Some things must be written in C, but most things can be more easily written in Python. If the whole development must be one language, the more basic code will suffer. In programming we fix this by developing software as assemblages of different programs written in different languages, but mechanisms for this kind of modularity in formal systems are still under-studied and, today, come with significant untrusted pieces or annoying boilerplate, so this solution isn't yet available.
[1] Later it was discovered that in fact this set isn't analytic, but that wasn't known for a long time.
[2] I am being imprecise; integrating and solving various differential equations often yields functions that are nice but aren't defined by combinations of named functions. The solution at the time was to name these new discovered functions.
> If you formalize, you'll end up with one rigorous definition
Can't you just formalize both definitions and pick the one to work with based on what you want to do? Surely the only obstacle here is the time and effort it takes to write the formalization?
Or, alternatively, just because you've formalized the advanced calculus version doesn't mean you need to use the formalization when teaching basic calculus. The way we've proven something and the way we teach that something don't have to be the same.
> the concept of a "function" was undefined but generally understood to be something like what we'd call today an "expression" in a programming language. So, for example, "x^2 + 1" was widely agreed to be a function, but "if x < 0 then x else 0" was controversial
Good answer, but not the best example. In many programming languages, the latter is easily written as an expression:
“a closed form expression or formula is one that is formed with constants, variables, and a set of functions considered as basic and connected by arithmetic operations (+, −, ×, /, and integer powers) and function composition. Commonly, the basic functions that are allowed in closed forms are nth root, exponential function, logarithm, and trigonometric functions”
and
“For example, if one adds polynomial roots to the basic functions, the functions that have a closed form are called elementary functions”
That would put the goniometric functions in the basic set allowed in elementary functions.
They're cheap but not free, especially at the front end of the CPU where it's just a lot more instructions to churn through. What the branch predictor gets you is it turns branches, which would normally cause a pipeline bubble, to be executed like straightline code if they're predicted right. It's a bit like a tracing jit. But you will still have a bunch of extra instructions to, like, compute the branch predicate.
Worse, IMO, is the never taken branch taking up space in branch prediction buffers. Which will cause worse predictions elsewhere (when this branch ip collides with a legitimate ip). Unless I missed a subtlety and never taken branches don’t get assigned any resources until they are taken (which would be pretty smart actually).
Horner's form is typically also more accurate, or at least, it is not bit-identical, so the compiler won't do it unless you pass -funsafe-math, and maybe not even then.
Deterministic output is incompatible with batching, which in turn is critical to high utilization on GPUs, which in turn is necessary to keep costs low.
Batching doesn't mean the computation suddenly becomes non-deterministic. Ideally, it just means you perform the same computation on multiple token streams in the batch simultaneously, without the values interacting with each other. Vectorization, basically.
Batching leads to cross-contamination in practice because of things like MoE load-balancing within the batch, or supporting different batch sizes with different kernels that have different numerical behavior. But a careful implementation could avoid such issues while still benefiting from the higher efficiency of batching.
It's true that averages can be misleading but we encourage users to think about it instead as a percentage of inputs. In practice the error distribution is very bimodal, the two modes being "basically fine" (a few ulps of error) and "garbage" (usually 0 instead of some actual value)