{"id":1735,"date":"2016-11-04T19:13:28","date_gmt":"2016-11-04T10:13:28","guid":{"rendered":"http:\/\/blog.themusio.com\/?p=1735"},"modified":"2024-05-01T10:55:34","modified_gmt":"2024-05-01T01:55:34","slug":"dilated-causal-convolutions-for-audio-and-text-generation","status":"publish","type":"post","link":"https:\/\/blog.themusio.com\/?p=1735","title":{"rendered":"Dilated causal convolutions for audio and text generation"},"content":{"rendered":"<div id=\"table-of-contents\">\n<h2>Table of Contents<\/h2>\n<div id=\"text-table-of-contents\">\n<ul>\n<li><a href=\"#org32d611a\">1. Dilated causal convolutions for audio and text generation&#xa0;&#xa0;&#xa0;<span class=\"tag\"><span class=\"causal\">causal<\/span>&#xa0;<span class=\"dilation\">dilation<\/span>&#xa0;<span class=\"convolution\">convolution<\/span><\/span><\/a>\n<ul>\n<li><a href=\"#org8a7e6cc\">1.1. goal<\/a><\/li>\n<li><a href=\"#orged0a5e7\">1.2. motivation<\/a><\/li>\n<li><a href=\"#org4044830\">1.3. ingredients<\/a><\/li>\n<li><a href=\"#org060d211\">1.4. steps<\/a><\/li>\n<li><a href=\"#orgc8e680a\">1.5. outlook<\/a><\/li>\n<li><a href=\"#org8eb2ab8\">1.6. resources<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<h1>Dilated causal convolutions for audio and text generation     <a id=\"org32d611a\"><\/a><\/h1>\n<h2>goal<a id=\"org8a7e6cc\"><\/a><\/h2>\n<p>In today&#8217;s summary we dive into the architecture of WaveNet and its successor ByteNet which are autoregressive generative models for generating audio and respectively sentences on character-level.<\/p>\n<h2>motivation<a id=\"orged0a5e7\"><\/a><\/h2>\n<p>The architectures behind both models are based on dilated causal convolutional layers which recently got much attention also in image generation tasks.<br \/>\nEspecially modeling sequential data with long term dependencies like audio or text seem to benefit from convolutions with dilations to increase the receptive field.<\/p>\n<h2>ingredients<a id=\"org4044830\"><\/a><\/h2>\n<p>dilation, causal convolution, residual blocks, skip connection, gated activation function,<\/p>\n<h2>steps<a id=\"org060d211\"><\/a><\/h2>\n<p>Without further introduction we start right away with the main components behind WaveNet, which will later also appear in the architecture of ByteNet.<br \/>\nThe key ingredient are so called dilated causal convolutions which have some advantages over standard convolutions.<br \/>\nIn principle WaveNet is a stack of convolutional layers with constant stride of one and without pooling layers.<br \/>\nThis causes the input and output dimensionality to be the same and hence we can use it to model sequential data, where we are interested to predict the next token based on the previously seen ones.<br \/>\nIn order to base the convolutional computations only on the current and previous inputs and not on future time steps, as the kernel of a standard convolution would do, we have to introduce a certain type of masking.<br \/>\nIn the case of one-dimensional input data, like raw audio or text, we speak of causal convolutions.<br \/>\nThe other important innovation is dilation.<br \/>\nStandard convolutional layers need either large filters to capture a sufficient range of input tokens or the computational depth has to increase extremely to reach a certain size of the receptive field for the actual output.<br \/>\nDilation here just refers to the fact that a certain number input values is skipped when applying the filter of a convolutional layer.<br \/>\nIn a similar fashion pooling and striding increase the receptive field.<br \/>\nHowever, only dilated convolutions keep the dimensionality of the input data fixed.<br \/>\nThis also means that the resolution of the input is preserved and the network does not have to use or learn a constant thought vector for varying length of input sequences.<br \/>\nIf one now stacks several dilated convolutional layers on top of each other with exponentially increasing filters, one is able to cover long range dependencies without becoming to deep.<br \/>\nThis is way more efficient than working with huge filters in standard convolutional layers.<\/p>\n<p>In order to enhance the training dilated causal convolutional layers are grouped in residual blocks.<br \/>\nResidual connections within each layer and skip connections between convolutional layers increase the training speed as observed in known deep learning architectures for vision tasks.<br \/>\nOn top of that the standard Relu activation function is traded for a gated activation function which takes as input the output of another dilated convolution.<\/p>\n<p>Experiments with data sets for multiple speaker speech generation show that the network is able to learn a shared internal representation and when conditioned on actual text the results are getting close to natural speech.<\/p>\n<p>The ByteNet architecture is more suitable for language modeling and has been first introduced in the field neural machine translation on character-level.<br \/>\nWe can consider the previously described WaveNet as the decoding part which is a based on an encoder which is also a stack of dilated convolutional layers.<br \/>\nThe encoder does not have be masked here in contrast to the decoder and computes at every time step a representation which takes also future tokens into account.<br \/>\nAs already stated the decoder is basically a WaveNet architecture based on a certain number residual blocks with dilated causal convolutional layers.<br \/>\nThe generation of the output sequence, i.e. a sentence on character-level, happens by dynamic unfolding which means the generation stops when the end of sequence symbol is generated.<br \/>\nOne method which recently got much attention in the vision community is batch normalization in order to ease the learning in the layers.<br \/>\nBatch normalization has to be altered for the ByteNet architecture to a be masked since future tokens should not be taken into account when normalizing.<\/p>\n<p>The advantages of the ByteNet architecture with regard to certain other encoder decoder frameworks with attention lie in the linear computational run time and the fact that convolutional computations could be speed up by parallelization during training.<br \/>\nFurthermore representations of fixed size make it necessary to learn to memorize the input sequence in a thought vector.<br \/>\nIn contrast the ByteNet is resolution preserving and the decoder is always conditioned on all previous encodings.<br \/>\nDepending on the length of the input and target sequences back propagation in recurrent networks gets impossible, but for the ByteNet architecture a the forward and backward computational steps between any input and output in is constant.<br \/>\nThis allows in principle for faster training.<\/p>\n<p>The decoder part of ByteNet was evaluated on the Hutter prize character prediction data set and achieved a new state of the art result.<br \/>\nAlso for machine translation the architecture showed very promising results.<\/p>\n<h2>outlook<a id=\"orgc8e680a\"><\/a><\/h2>\n<p>In the future we might see dilated causal convolutional layers dethroning LSTMs when it gets to modeling sequential data.<br \/>\nIt is also interesting to think about merging this two kind of architectures.<\/p>\n<h2>resources<a id=\"org8eb2ab8\"><\/a><\/h2>\n<p><a href=\"https:\/\/arxiv.org\/abs\/1610.10099\">https:\/\/arxiv.org\/abs\/1610.10099<\/a><br \/>\n<a href=\"https:\/\/arxiv.org\/abs\/1609.03499\">https:\/\/arxiv.org\/abs\/1609.03499<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Table of Contents 1. Dilated causal convolutions for audio and text generation&#xa0;&#xa0;&#xa0;causal&#xa0;di [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3642,3640],"tags":[3656],"class_list":["post-1735","post","type-post","status-publish","format-standard","hentry","category-ai-en","category-all-en","tag-baggage-en"],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts\/1735","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1735"}],"version-history":[{"count":3,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts\/1735\/revisions"}],"predecessor-version":[{"id":10868,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts\/1735\/revisions\/10868"}],"wp:attachment":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1735"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1735"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1735"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}