That's what I thought too! But according to my friends on the Google Brain team, unsupervised pretraining is now thought to be an irrelevant detour.
In 2006, Hinton introduced greedy layer-wise pretraining, which was intended to solve the problem of backpropagation getting stuck in poor local optima. The theory was that you'd pretrain to find a good initial set of connection weights, then apply backprop to "fine-tune" discriminatively. And the theory seemed correct since the experimental results were good:
http://www.cs.toronto.edu/~hinton/absps/fastnc.pdfhttp://machinelearning.wustl.edu/mlpapers/paper_files/NIPS20...
But that same year, a student in Geoff Hinton's lab discovered that if you added information about the 2nd-derivatives of the loss function to backpropagation ("Hessian-free optimization"), you could skip pretraining and get the same or better results:
http://machinelearning.wustl.edu/mlpapers/paper_files/icml20...
And around ~2012, a bunch of researchers have reported you don't even need 2nd-derivative information. You just have to initialize the neural net properly. Apparently, all the most recent results in speech recognition just use standard backpropagation with no unsupervised pretraining. (Although people are still trying more complex variants of unsupervised pretraining algorithms, often involving multiple types of layers in the neural network.)
So now, after seven years of work, we're back where we started: the plain ol' backpropgation algorithm from 1974 worked all along.
This whole topic is really interesting to me from a history of science perspective. What other old, discarded ideas from the past might be ripe, now that we have millions of times more data and computation?
Yes this is really interesting. I haven't read those other papers yet (definitely plan on it now thanks for the links), but Bengio's latest paper on denoising autoencoders from earlier this year (http://arxiv.org/abs/1305.6663) still used the unsupervised pretraining. Also the Theano implementation that I run experiments with uses it as well (but that code could be a year or two old).
Definitely going to be researching this more throughout the year.
Very interesting I was not aware unsupervised pretraining was a distant second to availability of data and Flops. So really, deep learning is essentially the same old MLP of recent peasant like status (90's). Stacks of backpropagating perceptrons with the ancient logistic regression on top - now with more stacking! This makes sense.
Machine learning is really just a form of non-human scripting. After all, every ML system running on a PC is either Turing equivalent or less. An analogy would be something that tries to generate the minimal set of regular expressions (that match non deterministically) which cover given examples. The advantage of an ML model vs a collection of regexes is many interesting problems are vulnerable to calculus (optimize) or counting (probability, integration etc.)
So like good notation, the stacking allows more complicated things to be said more compactly. But more complicated things need more explanation and more thinking to understand.
> And around ~2012, a bunch of researchers have reported you don't even need 2nd-derivative information. You just have to initialize the neural net properly.
This sounds very interesting. How do you property initialize the weights? Do you have a link to a paper about this?
In 2006, Hinton introduced greedy layer-wise pretraining, which was intended to solve the problem of backpropagation getting stuck in poor local optima. The theory was that you'd pretrain to find a good initial set of connection weights, then apply backprop to "fine-tune" discriminatively. And the theory seemed correct since the experimental results were good: http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS20...
Does pretraining truly help solve the problem of poor local optima? In 2010, some empirical studies suggested the answer was yes: http://machinelearning.wustl.edu/mlpapers/paper_files/AISTAT...
But that same year, a student in Geoff Hinton's lab discovered that if you added information about the 2nd-derivatives of the loss function to backpropagation ("Hessian-free optimization"), you could skip pretraining and get the same or better results: http://machinelearning.wustl.edu/mlpapers/paper_files/icml20...
And around ~2012, a bunch of researchers have reported you don't even need 2nd-derivative information. You just have to initialize the neural net properly. Apparently, all the most recent results in speech recognition just use standard backpropagation with no unsupervised pretraining. (Although people are still trying more complex variants of unsupervised pretraining algorithms, often involving multiple types of layers in the neural network.)
So now, after seven years of work, we're back where we started: the plain ol' backpropgation algorithm from 1974 worked all along.
This whole topic is really interesting to me from a history of science perspective. What other old, discarded ideas from the past might be ripe, now that we have millions of times more data and computation?