Table of Contents
In this blogpost we will have a look at methods for compressing and reducing deep neural network models in size.
The simple fact that bigger and deeper is better for training, leads to models that take up quite some space in memory.
However, most of the time one is limited when it comes to memory.
Either it is the budget on hardware or more recently developing models for mobile devices is becoming more and more popular.
Another important point for deploying neural networks in applications is inference time.
In general, larger models also mean longer inference time.
Fortunately, deep neural networks learn a lot of redundant parameters and several methods are under research for compressing trained models without loss in accuracy.
compression, pruning, quantization, Hoffman encoding, distillation, teacher model
From a very practical point of view, downloading a new image recognition model via some app store is usually limited to some MB.
And even if you have your model running on your mobile device the model size determines what kind of memory unit can be used and respectively how much energy is consumed by the processing unit, whether it is a CPU, GPU or mGPU.
The before mentioned redundancy in the weights of trained deep neural networks allows for a number of compression methods which we are going to discuss.
By analogies with the development of the human brain a method called pruning tries to reduce the number of parameters of trained neural networks.
The human brain forms a huge number of synapses after birth until the first year.
However, by the age of 10 this number shrinks again to roughly its half.
This makes it plausible to assume that a lot of connections are no longer needed after the brain has learned a certain number of tasks, like speaking and so on.
Pruning hence tries removing redundant connection between neurons in neural networks, even getting rid of some neurons and sharing weights among the rest.
In general, this allows compression rates up to 10x without loss of accuracy for both deep convolutional and recurrent neural networks.
Looking at individual layers, fully connected layers can be pruned up to 96% of its parameters and convolutional ones typically around 60%.
For the MNIST data set one finds that pruning mostly happens at the borders of the image since the numbers are mostly located in the center.
Retraining and regularizing the parameters of models after applying pruning is vital to maintaining their accuracies.
Another compression method is known by the name of quantization and basically deals with the way numbers are stored during computation.
In the beginning of neural networks high precision calculation with floating point numbers were required to preserve accuracy.
Typically all modern deep learning frameworks use 32bit float for storing parameters and input numbers during computation.
However, recent results seem to indicate that inference is stable for lower precession and even training low precession models seems to be in reach.
The way quantization works on high precession models is by clustering weights according to their values and then generating a code book.
The code book stores cluster indices for the respective weight together with low precession values, called centroids.
8bit fixed point format seems to be a lower bound for now.
The results for image recognition models like AlexNet or VGG16 show that for 8/5bit fixed point quantization no loss in accuracy appears and small losses can be achieved for 8/4 and even 4/2bit.
Additionally, quantization works together with pruning and therefore even higher compression rates can be achieved.
From an information theoretic point of view cluster indices and centroid values can even be encoded more efficiently with Hoffman encoding, where more common indices or values get fewer bits and less common ones more.
Overall applying pruning, quantization and Hoffman encoding then leads to remarkable compression rates of 35x for AlexNet and 49x for VGG16.
Retraining models after applying pruning or quantization is of great importance as mentioned earlier.
Another method that relies on retraining is called distillation.
Distillation tries to transfer the knowledge of a trained model, called teacher model, to an untrained student model with fewer parameters.
The training procedure for the student model can be optimized by using the teacher model’s probabilities as soft targets instead of the one-hot targets present in the data.
The soft targets contain additional information about the similarity between classes and the variance in gradients during training hence gets reduced which helps the student model to learn faster.
In the near future we will see more and more models making use of compressing methods to fit into small memories of mobile devices or to reduce energy costs for big data centers.