This is a bit of an irritation of mine. To be clear, if a function is nondiffere...

This is a bit of an irritation of mine. To be clear, if a function is nondifferentiable, automatic differentiation will not magically fix this. Ever.

There are plenty of nondifferentiable, elementary functions. For example, `sqrt` is nondifferentiable at 0. Depending on if you want to include it as an elementary function, `abs` is another. Now, what people get away with is that really pathological functions such as the Dirichlet function aren't readily representable on a computer. Most of the time, points of discontinuity or nondifferentiability occur at a small finite number of locations. As such, we pretend like we can just ignore these points.

The unfortunate part is that these points of nondifferentiability to come up quite often because a lot of the interesting action in a function occurs near these points. Take `abs`, if we were doing something like `min abs(x)` we'd be cruising along moving toward the origin until we hit the kink, which turns out to be the interesting part of this function. At this point, most algorithms will have trouble.

Now, we could also say that, "Well, a subderivative/subgradient exists and the routine can just return that and it will be fine." That's all well and true if you're using an optimization algorithm that correctly deals with subderivatives and is notified of them. Most optimization algorithms do not handle subderivatives. The ones that do generally need to know that there are many subderivatives in a particular location and get to choose the one they want. A typical AD routine will not return this information.

If you want an example from the deep learning field, look at the old sigmoid activation functions. If we set aside the pretense that this is a simulation of the brain, sigmoids are a differentiable representation of the step (Heaviside) function. Recall how these algorithms had trouble with vanishing gradients? Well, the derivative of the step function is zero everywhere except at the origin where it doesn't exist. How are we supposed to optimize with that? We can't. That's why we used sigmoid functions, at the time.

Ok, does that mean we can't use nondifferentiable functions when optimizing? No. However, I would encourage everyone to clearly document the points of nondifferentiability because it makes debugging these issues vastly easier. Outside of this, the way we fix them depends on a point by point case. For example, if we wanted to solve the problem `min abs(f(x))` we could instead solve the problem `min y st y>=f(x) y>=-f(x)`. Assuming that `f` is differentiable, the new optimization problem is as well, but we now have to handle inequality constraints. For a soft variety, I like `sqrt(1+f(x)^2)-1`. In the deep learning field, step functions are replaced by sigmoids and linear rectifiers are replaced by soft rectifiers. And, yes, there are plenty of examples where soft rectifiers weren't used and things worked great. That's great when it works. I'll contend that such techniques don't have to break every time for it to be a problem.

Anyway, the bottom line is that nondifferentiability is a problem because virtually all gradient based optimization solvers require this property. Subgradients are only good enough if the underlying algorithm explicitly works with this information and most solvers don't. None of this means that everything has to break every time. Certainly, if we find an optimal solution away from points of nondifferentiability, we're good. However, it does create some really difficult to debug situations. Finally, in case it wasn't clear, yes, the article is talking about deep learning, but the kind of deep learning techniques under discussion are using optimization solvers to fit their particular model to the data, so the general discussion holds.