Conditional Neural Network Architectures
Table of Contents
Conditional Neural Network Architectures
Today we are going to have a look at conditional neural network architectures and present some of the findings in the recent papers “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” and “PathNet: Evolution Channels Gradient Descent in Super Neural Networks”.
The interest in conditional models is mainly based on their capability to incorporate a huge number of parameters compared to standard architectures without increasing the need for more computationally powerful hardware.
Furthermore such models seem to be able to reduce the training time and are interesting for multi-, online-task learning and transfer learning.
gating mechanism, general artificial intelligence, mixture-of-experts, distributed network architecture, multi-task learning
Conditional architectures heavily rely on gating mechanisms which allow them to shut done certain areas of the network and only process input with specific submodules.
Gating mechanism come in different forms, such as binary, sparse, continuous and stochastic or deterministic and depending on the type one is able to train these with standard backpropagation or has implement reinforcement or possibly evolutionary algorithms.
The challenges one encounters when experimenting with conditional architectures are various.
Batches are usually split up and reduced batchsizes decrease the computational efficiency for the active submodules.
In case the modules are distributed on different machines network bandwith is becoming important in order to synchronize gradients and outputs between units.
Moreover the gating mechanism makes it necessary to introduce new terms in the loss function to achieve the desired sparsity and enhances the trainings complexity.
Finally, an increase in model parameters is usually meaningless if we deal with small data sets, as found for vision task.
An architecture which tries to deal with these problems was recently introduced under the name of sparsely gated mixture-of-expert model.
So-called mixture-of-expert models can be build from any and any combination of models one can think of.
An expert is then just a model which should take care of certain subtask necessary to solve the main task.
In principle one can also think of hierarchical mixture-of-expert models where an expert is a mixture-of-expert itself.
The above mentioned paper in particular looks at distributed architectures and investigates simple feedforward and convolutional layers for the experts.
After a common embedding layer, a layer of several experts are active depending on a gating network and their outputs get summed and processed in a task depending layer.
The gating network decides which of the experts should deal with a certain input and does so by generating a sparse vector.
In general the gating mechanism is trained together with the whole network and hence inclines to prefer already trained experts.
To overcome this imbalance one has to introduce additional terms in the loss function which minimize the variation of importance of a certain expert for a processed batch.
Now to increase computational efficiency in the network and therefore allow for decreased training times a certain combination of data and model parallelism is necessary.
Data parallelism in the non-expert layer, like the embedding layer, allow to process several batches simultaneously, and the activated experts do not have to deal with too small batches.
If the conditioning happens for example on the word level, it might make sense to process the whole sentence first and increase the number of samples send to an expert by batching them also over recurrent steps.
As already stated, the largest data sets for vision tasks contain around a million examples and are not big enough to test the increased model capacity.
Hence the authors base their experiments on language modeling and machine translation.
Available data sets contain easily over one billion words and for machine translation several language pair data sets can be combined.
The used architecture then consists of two LSTM layers and a layer of experts in between.
The results show that perplexity improves with the numbers of experts until a certain number while computational efficiency can be kept stable.
On the machine translation task both the encoder and the decoder get enhanced by expert layers between LSTM layers.
It is found that this allows to speed up training time enormously.
Surprisingly a mixture of expert model trained on a multilingual corpus is able to beat monolingual models on more than half the language pairs.
The reason for including the second paper even though it deals with a different set of tasks is that it could be considered as a stacked mixture-of-expert model trained by an evolutional algorithm.
The algorithm determines after certain number of training steps with gradient descent if a chosen path, meaning certain experts in each expert layer, is more successful in solving a task than another one.
In case, it then overwrites the path and performs a mutations, which makes exploration possible.
By the end of training, the weights of the final path are fixed.
All other parameters can be reinitialized without the network loosing its knowledge for solving the task.
In this way the authors show that the network is able to learn a second task, by retraining the network without updating the important weights for the first task.
This problem is commonly known as transfer learning.
The way it is addressed in the above paper, shows improvements over fine-tuning on the second task and even over training a network from scratch with respect to training time and accuracy.
Both papers show a possible direction of increasing the number of parameters in neural networks and reducing the training time by keeping computational efficiency high.
By introducing experts or submodules into the network architecture multi-task learning seems to become possible and giving a glimpse on general artificial intelligence.