Uploading large files to Azure Blob Storage through REST API

I did write about uploading files to Azure Storage via REST API before. It turns out that implementation is very naive. As soon as I tried uploading anything larger that 64Mb, I hid a brick wall with exceptions.

Azure Blob Storage has 2 types of blobs: Page Blobs and Block Blobs. Page Blobs are optimised for random read-write operations. Block Blobs are storage. Here more about Page vs Block blobs. If you are just storing files in Azure, you’ll most likely will use Block Blobs.

It turns out that Block Blob storage has some limitations on the upload front. In one go you can upload up to 64Mb: 1024 * 1024 * 64. Add one extra byte and you get an error from the API (I
tested it). So instead of uploading large files, you need to cut them into blocks and then upload separate pieces of no larger that 4Mb. And once all the pieces (Blocks) are uploaded, you need
to commit them all and give order in which they should appear.

My code from previous blog-post can be easily broken if in tests you change uploaded file size from few bytes to 65Mb. And if you know you need to upload large files, you are in trouble. Because chopping file into pieces and uploading bits of it, then gluing it all together is whole new level of complexity. And if you want to do that fast, you’d need to use parallel programming to upload each chunk in separate thread – gives you more things to think about.

Luckily WindowsAzure.Storage library handles all that for you – it uploads files in 4Mb blocks and then joins them and you don’t have to do much yourself. And given the level of complexity, I’d recommend just using that library, instead of home-grown solution.

But, just for kicks I have made a successful attempt in implementing this myself. And because I had tests, I could verify if the implementation is correct – that was half of the job already done for me -)

Please note, this code is not ready for production and probably has issues that I can’t think of just now. Use at your own risk.

    public void UploadFile(string fullFilePath, string blobSasUri, Dictionary<string, string> metadata = null, int timeout = 60)
    {
        var blocks = new List<String>();
        var tasks = new List<Task>();

        // Cancel Signal Source
        var cancelSignal = new CancellationTokenSource();
        using (var fileStream = new FileStream(fullFilePath, FileMode.Open, FileAccess.Read))
        {
            var buffer = new byte[BlockSize];

            var bytesRead = 0;
            var blockNumber = 0;
            while ((bytesRead = fileStream.Read(buffer, 0, BlockSize)) > 0)
            {
                var actualBytesRead = new byte[bytesRead];
                // copy from old array to new one. 
                // Need this to be a separate array because we are passing that to a task
                Buffer.BlockCopy(buffer, 0, actualBytesRead, 0, bytesRead);
                var blockId = blockNumber++.ToString().PadLeft(10, '0');
                blocks.Add(blockId);

                // here could've used Task.Factory.StartNew(), but this reads better
                var task = new Task(() => UploadBlock(blobSasUri, blockId, actualBytesRead, timeout), cancelSignal.Token);
                task.Start();
                tasks.Add(task);
            }
        }
        try
        {
            // chaining the tasks together. CommitAllBlocks() will execute when all upload tasks finish.
            var continuation = Task.Factory.ContinueWhenAll(tasks.ToArray(),
                (t) => CommitAllBlocks(blobSasUri, blocks, metadata, timeout));

            // block the thread until commit task finishes
            continuation.Wait();
        }
        catch (AggregateException exception)
        {
            // when one of the tasks fail, we want to abort all other tasks, otherwise they will keep uploading.
            // This is how we signal the cancellation to all other tasks
            cancelSignal.Cancel();

            // AggregateException.InnerException contains the actual exception that was thrown by a task.
            throw exception.InnerException;
        }
    }

This is not a full code, I’ll give you a link to the full thing in a gist later. Here just note how complex the solution became. From 15 lines of code in my naive implementation this sky-rocketed to handling Tasks, cancellation tokens, byte arrays and AggregateExceptions. And here I’m presuming that all the Blocks have uploaded successfully. You can implement re-connect feature to start uploading from place where connection have dropped… Did I tell you just to use Microsoft provided library?

Here is the Gist with the full solution