Sorry, hiddencost is correct here. While we understand the math of backpropegati...

Houshalter · on Feb 6, 2016

Gradient descent is guaranteed to move the parameters closer to a more fit solution. That's almost the definition of GD. I guess you are saying that we don't know why it doesn't just get stuck in local optimas?

I'm not certain, but for one, stochastic gradient descent is usually used, which helps break out of local optimas. And second, they don't seem to be as much of a problem as you add more hidden units. I was under the impression there was a mathematical explanation for both of those, though I haven't done much research.

Off the top of my head, I think as you add infinite hidden units, a subset of them will fit the function exactly just by chance, and gradient descent will only increase their parameters. In fact as long as they are in some large possibility space of the correct solution, GD can move the parameters downhill to the optimal solution. Not that we even want the optimal solution, just a good enough one.

nl · on Feb 6, 2016

At this point I'm going to point you to [1]. They interview Ilya Sutskever (ex Toronto, ex Google Brain, now director of OpenAI). He knows a little about neural networks[2] and possibly knows as much as anyone else about gradient descent.

In [1] he talks a lot about how there is no theoretical basis to think that a deep neural network should converge, and prior to around 2006 the accepted wisdom was that networks deep enough to outperform other methods of machine learning were useless because they couldn't be trained. Then they discovered that using larger random values for initialization means that they do converge, but that there is no theoretical basis to explain this at all.

[1] http://www.thetalkingmachines.com/blog/2015/1/15/machine-lea...

[2] https://scholar.google.com.au/citations?user=x04W_mMAAAAJ&hl...

Houshalter · on Feb 6, 2016

I believe that has to do with the vanishing gradient issue which has been studied.

As for why nets converge at all, there's this paper whichI believe tries to establish some theory why bigger nets don't have such a big problem with local optima: http://arxiv.org/abs/1412.0233

nl · on Feb 7, 2016

Yes, exactly! Trying to understand why this happens is a very active area of research.

argonaut · on Feb 6, 2016

Neural nets virtually always get stuck in local optima. I have no idea what you're talking about.

Houshalter · on Feb 6, 2016

http://arxiv.org/abs/1412.0233

>The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered.

argonaut · on Feb 7, 2016

Your link supports my point, not yours. I'm aware of research showing that local minima are close to the global minima. That does not mean neural nets usually converge to the global minima, only that, the local minima they converge to is close to the global minimum.

> Finally, we prove that recovering the global minimum becomes harder as the network size increases