MapReduce with Windows Azure

MapReduce is a pattern for transforming large quantities of data across a cluster of machines.

A common example is to take an input file, such as a text document. The file is split up and distributed across a number of nodes. Each node runs a mapping process on the file. In this case the mapping process identifies every word in the file. An  intermediate file, usually as large, or larger than the input file is produced as the output. The reduce process then takes this file, and transforms it into the desired output format. In this case, a count of every word.

Whilst this example is quite contrived, a real use case is not too dissimilar. Typical scenarios include parsing web log files to understand website usage patterns and referral information, and other analytical analysis of unstructured data.

Project Daytona is an Azure centric implementation of the MapReduce framework. When you compile the project, you have two roles, a master (which you have one instance of) and a slave (which you can have multiple instances of). You then derive a few classes to provide your implementation of the algorithm (i.e. the logic to perform the mapping and reducing, and also the retrieval and splitting of the data).

What is interesting is the separation your implementation has from the Azure infrastructure. When you submit your MapReduce  ‘job’ to the master node in your Azure deployment, behind the scenes Daytona is uploading your assemblies. This means that you can submit new types of work without having to recompile and re-deploy your cloud project. Smart stuff.

I wouldn’t say it was straight forward to get Daytona up and running, but it’s certainly worth a look at if you have large amounts of data in blobs or tables, and you want to do some analysis.