Meet Musio

Graphics Processing Units for Neural Networks

General Purpose Computing on GPUs – Graphics Processing Units for Neural Networks

Goal
Neural networks are powered by graphic processing units (GPUs) and computing architectures are quickly growing employing several of them for building huge models.

Motivation
The main reason and motivation for handling the computations in neural networks on GPUs is the fact that neural networks consist of identical units which each perform computations widely independent of the overall architecture.
The physical architecture of GPUs with its greater number of cores perfectly allows to perform these computations in parallel.

Ingredients
CPU, RAM, PCIe, SSD, FLOPS

Steps
The central processing unit (CPU) is known to most people for basically determining the performance of a PC.
Traditionally the metric to look at is the clock time.
But in recent years physical limitations, related to heat production, become more apparent and the development is stuck around 4GHz and a CPU is at the moment capable of handling 200GFLOPS.
Therefore the development changed in the direction of increasing the number of cores.
Nevertheless, the CPU is still optimized for sequential serial processing and remains the central control element in every PC architecture.

In recent years graphic processing units have undergone a serious development.
Their major difference to CPUs lies in their huge number of simplified cores.
These allow a GPU to handle thousands of computations in parallel as is necessary for processing computer graphics as well as computations within neural networks and makes about 8 TFLOPS possible.

Since the focus on GPUs turned away from solely using them for graphic processing, one now also speaks of general purpose computing on GPUs (GPGPU).
Corresponding architectures are employed in the field of deep learning for processing computations of deep neural networks.
Usually such an architectures consists of one CPU and several GPUs.
The CPU basically does little computations is mostly responsible for processing the code, function calls and initiating function calls on the GPU.
Furthermore, it creates the mini-batches and passes them to the GPUs, where the computationally intensive steps are usually handled in form of large matrix multiplications.
Other than one might think it is in general not the number of FLOPS that limits the performance but the rate of reading and passing memory between the CPU, the GPUs and among themselves.
The bandwidth between the CPU-RAM and the GPU-RAM for transferring batches or gradients is for PCIe 3.0 around 16 GB/s.
This bottleneck can be by passed by the method of asynchronous mini-batch allocations, meaning that the CPU prepares the next batch in advance and passes it to the GPU-RAM before the computation of the last batch is completed.

Another limitation might come from the harddrive of the system.
Data sets for training neural networks can be huge that’s why it is necessary to continuously load the data into the memory.
Eventually we have to provide batches to the GPU as fast as it takes to process them.
Convolutional nets on uncompressed image data can reach 300MB/s exceeding the speed of standard harddrives.
That’s why it might be necessary to think about switching to SSD harddrives.

Apart from the architecture there is the question of how to run the training on several GPUs.
In general one speaks of data or model parallelism.
For data parallelism mini-batches are split and passed to different GPUs individually which means that on each GPU the same model processes parts of the same mini-batch.
In the end gradients have to be synchronized between the GPUs which can quickly lead to huge amounts of data for large models that has to be exchanged.
Model parallelism means that each part of the model is handled on another GPU and all components deal with the same data.
This allows to split large neural models, by splitting their weight matrices.
The cost of exchanging the outputs of layers between different GPUs in the forward pass and errors in the backward pass is in general much less than exchanging full gradients.
In this way we circumvent the synchronization of gradients and are able to treat bigger models than with data parallelism.
Sometimes this is the only way since weights no longer fit into the memory of a single GPU.

Outlook
The speed ups one gets with several GPUs might not grow linear with the applied number, but in general the performance can be increased.
In the future we will also see more and more libraries modified in order to make use of these new computing architectures.

Resources
A Full Hardware Guide to Deep Learning” (WEB). A Full Hardware Guide to Deep Learning. Published March 2015. Accessed 21 April 2016.
How to Parallelize Deep Learning on GPUs” (WEB). How to Parallelize Deep Learning on GPUs. Published November 2014. Accessed 21 April 2016.

Leave a Reply