Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges
Abstract
In recent years, deep neural networks (DNNs) have received increased attention, have been applied to different applications, and achieved dramatic accuracy improvements in many tasks. These works rely on deep networks with millions or even billions of parameters, and the availability of graphics processing units (GPUs) with very high computation capability plays a key role in their success. For example, Krizhevsky et al. [1] achieved breakthrough results in the 2012 ImageNet Challenge using a network containing 60 million parameters with five convolutional layers and three fully connected layers. Usually, it takes two to three days to train the whole model on the ImagetNet data set with an NVIDIA K40 machine. In another example, the top face-verification results from the Labeled Faces in the Wild (LFW) data set were obtained with networks containing hundreds of millions of parameters, using a mix of convolutional, locally connected, and fully connected layers [2], [3]. It is also very time-consuming to train such a model to obtain a reasonable performance. In architectures that only rely on fully connected layers, the number of parameters can grow to billions [4].