Meet Musio

Dilated causal convolutions for audio and text generation

Dilated causal convolutions for audio and text generation


In today’s summary we dive into the architecture of WaveNet and its successor ByteNet which are autoregressive generative models for generating audio and respectively sentences on character-level.


The architectures behind both models are based on dilated causal convolutional layers which recently got much attention also in image generation tasks.
Especially modeling sequential data with long term dependencies like audio or text seem to benefit from convolutions with dilations to increase the receptive field.


dilation, causal convolution, residual blocks, skip connection, gated activation function,


Without further introduction we start right away with the main components behind WaveNet, which will later also appear in the architecture of ByteNet.
The key ingredient are so called dilated causal convolutions which have some advantages over standard convolutions.
In principle WaveNet is a stack of convolutional layers with constant stride of one and without pooling layers.
This causes the input and output dimensionality to be the same and hence we can use it to model sequential data, where we are interested to predict the next token based on the previously seen ones.
In order to base the convolutional computations only on the current and previous inputs and not on future time steps, as the kernel of a standard convolution would do, we have to introduce a certain type of masking.
In the case of one-dimensional input data, like raw audio or text, we speak of causal convolutions.
The other important innovation is dilation.
Standard convolutional layers need either large filters to capture a sufficient range of input tokens or the computational depth has to increase extremely to reach a certain size of the receptive field for the actual output.
Dilation here just refers to the fact that a certain number input values is skipped when applying the filter of a convolutional layer.
In a similar fashion pooling and striding increase the receptive field.
However, only dilated convolutions keep the dimensionality of the input data fixed.
This also means that the resolution of the input is preserved and the network does not have to use or learn a constant thought vector for varying length of input sequences.
If one now stacks several dilated convolutional layers on top of each other with exponentially increasing filters, one is able to cover long range dependencies without becoming to deep.
This is way more efficient than working with huge filters in standard convolutional layers.

In order to enhance the training dilated causal convolutional layers are grouped in residual blocks.
Residual connections within each layer and skip connections between convolutional layers increase the training speed as observed in known deep learning architectures for vision tasks.
On top of that the standard Relu activation function is traded for a gated activation function which takes as input the output of another dilated convolution.

Experiments with data sets for multiple speaker speech generation show that the network is able to learn a shared internal representation and when conditioned on actual text the results are getting close to natural speech.

The ByteNet architecture is more suitable for language modeling and has been first introduced in the field neural machine translation on character-level.
We can consider the previously described WaveNet as the decoding part which is a based on an encoder which is also a stack of dilated convolutional layers.
The encoder does not have be masked here in contrast to the decoder and computes at every time step a representation which takes also future tokens into account.
As already stated the decoder is basically a WaveNet architecture based on a certain number residual blocks with dilated causal convolutional layers.
The generation of the output sequence, i.e. a sentence on character-level, happens by dynamic unfolding which means the generation stops when the end of sequence symbol is generated.
One method which recently got much attention in the vision community is batch normalization in order to ease the learning in the layers.
Batch normalization has to be altered for the ByteNet architecture to a be masked since future tokens should not be taken into account when normalizing.

The advantages of the ByteNet architecture with regard to certain other encoder decoder frameworks with attention lie in the linear computational run time and the fact that convolutional computations could be speed up by parallelization during training.
Furthermore representations of fixed size make it necessary to learn to memorize the input sequence in a thought vector.
In contrast the ByteNet is resolution preserving and the decoder is always conditioned on all previous encodings.
Depending on the length of the input and target sequences back propagation in recurrent networks gets impossible, but for the ByteNet architecture a the forward and backward computational steps between any input and output in is constant.
This allows in principle for faster training.

The decoder part of ByteNet was evaluated on the Hutter prize character prediction data set and achieved a new state of the art result.
Also for machine translation the architecture showed very promising results.


In the future we might see dilated causal convolutional layers dethroning LSTMs when it gets to modeling sequential data.
It is also interesting to think about merging this two kind of architectures.


Leave a Reply