Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sorry, hiddencost is correct here.

While we understand the math of backpropegation, we don't really understand why it lets neural networks converge as well as they do. If you asked a mathematician they'd point out that yes, backpropegation can work, but there are no guarantees even around the probability that it will work. And yet, it does work, and often enough and in enough different cases for it to be useful.

Yes, interpretability can be an issue, but that's not what is being discussed here.

There have been some attempts at making models which humans can interpret. One program named Eureqa fit the simplest possible mathematical expression possible to a set of samples. A biologist tried it on his data and found that it created an expression which actually fit the data really well. But he couldn't publish it because he couldn't explain why it worked. It just did. But there was no understanding, no explanation.

This is a pretty well understood issue in the ML community. Different tools have different strengths: Regression (sometimes), Tree based classifiers and Bayesian nets are nice for explaining your data, but others methods (SVMs, Neural Networks etc) can be better for prediction in some circumstances.



Gradient descent is guaranteed to move the parameters closer to a more fit solution. That's almost the definition of GD. I guess you are saying that we don't know why it doesn't just get stuck in local optimas?

I'm not certain, but for one, stochastic gradient descent is usually used, which helps break out of local optimas. And second, they don't seem to be as much of a problem as you add more hidden units. I was under the impression there was a mathematical explanation for both of those, though I haven't done much research.

Off the top of my head, I think as you add infinite hidden units, a subset of them will fit the function exactly just by chance, and gradient descent will only increase their parameters. In fact as long as they are in some large possibility space of the correct solution, GD can move the parameters downhill to the optimal solution. Not that we even want the optimal solution, just a good enough one.


At this point I'm going to point you to [1]. They interview Ilya Sutskever (ex Toronto, ex Google Brain, now director of OpenAI). He knows a little about neural networks[2] and possibly knows as much as anyone else about gradient descent.

In [1] he talks a lot about how there is no theoretical basis to think that a deep neural network should converge, and prior to around 2006 the accepted wisdom was that networks deep enough to outperform other methods of machine learning were useless because they couldn't be trained. Then they discovered that using larger random values for initialization means that they do converge, but that there is no theoretical basis to explain this at all.

[1] http://www.thetalkingmachines.com/blog/2015/1/15/machine-lea...

[2] https://scholar.google.com.au/citations?user=x04W_mMAAAAJ&hl...


I believe that has to do with the vanishing gradient issue which has been studied.

As for why nets converge at all, there's this paper whichI believe tries to establish some theory why bigger nets don't have such a big problem with local optima: http://arxiv.org/abs/1412.0233


Yes, exactly! Trying to understand why this happens is a very active area of research.


Neural nets virtually always get stuck in local optima. I have no idea what you're talking about.


http://arxiv.org/abs/1412.0233

>The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered.


Your link supports my point, not yours. I'm aware of research showing that local minima are close to the global minima. That does not mean neural nets usually converge to the global minima, only that, the local minima they converge to is close to the global minimum.

> Finally, we prove that recovering the global minimum becomes harder as the network size increases




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: