Annealed dropout training of deep networks
Abstract
Recently it has been shown that when training neural networks on a limited amount of data, randomly zeroing, or 'dropping out' a fixed percentage of the outputs of a given layer for each training case can improve test set performance significantly. Dropout training discourages the detectors in the network from co-adapting, which limits the capacity of the network and prevents overfitting. In this paper we show that annealing the dropout rate from a high initial value to zero over the course of training can substantially improve the quality of the resulting model. As dropout (approximately) implements model aggregation over an exponential number of networks, this procedure effectively initializes the ensemble of models that will be learned during a given iteration of training with an enemble of models that has a lower average number of neurons per network, and higher variance in the number of neurons per network-which regularizes the structure of the final model toward models that avoid unnecessary co-adaptation between neurons. Importantly, this regularization procedure is stochastic, and so promotes the learning of 'balanced' networks with neurons that have high average entropy, and low variance in their entropy, by smoothly transitioning from 'exploration' with high learning rates to 'fine tuning' with full support for co-adaptation between neurons where necessary. Experimental results demonstrate that annealed dropout leads to significant reductions in word error rate over standard dropout training.