MapReduce Tutorial. This article gives an overview of MapReduce and lists several resources which describes MapReduce.

1. Parallel Programming and MapReduce

1.1. Parallel Programming

Parallel programming describes a means to divide a problem into several smaller subproblems and solve these in parallel. Distributed programming emphasis also the usage of different resources, e.g. machines.

The requirement for a problem to be solved by parallel programming is that a part of the program can be identified which can be processed concurrently. Not all problems have this attribute; e.g. the calculation of Fibonacci Numbers can not be separated as the next Fibonacci number always depends on the previously calculated number.

1.2. MapReduce

MapReduce is a parallel and distributed solution approach developed by Google for processing large datasets. MapReduce is utilized by Google and Yahoo to power their websearch. MapReduce was first describes in a research paper from Google.

MapReduce has two key components. Map and Reduce. A map is a function which is used on a set of input values and calculates a set of key/value pairs. Reduce is a function which takes these results and applies another function to the result of the map function. Or with other words: Map transforms a set of data into key value pairs and Reduce aggregates this data into a scalar. A reducer receives all the data for a individual "key" from all the mappers.

The approach assumes that there are no dependencies between the input data. This make it easy to parallelize the problem. The number of parallel reduce task is limited by the number of distinct "key" values which are emitted by the map function.

MapReduce incorporates usually also a framework which supports MapReduce operations. A master controls the whole MapReduce process. The MapReduce framework is responsible for load balancing, re-issuing task if a worker as failed or is to slow, etc. The master divides the input data into separate units, send individual chunks of data to the mapper machines and collects the information once a mapper is finished. If the mapper are finished then the reducer machines will be assigned work. All key/value pairs with the same key will be sent to the same reducer.

1.3. Usage of MapReduce

The classical example for using MapReduce is logfile analysis. Big logfiles are split and a mapper search for different webpages which are accessed. Every time a webpage is found in the log a key / value pair is emitted to the reducer where the key is the webpage and the value is "1". The reducers aggregate the number of for certain webpages. As a end result you have the total number of hits for each webpage.

Another example if full text indexing. The mapper would map every phrase / word in one document to the document and the reducer would write these mappings to an index.

Other applications are:

  • Distributed Grep

  • Reverse Web-Link Graph: Map function outputs (URL target, source) from an input webpage (source). The reduce function concatenates the list of all source URLs associated with a give target URL and returns (target, list(sources))

  • Word count in a number of documents

MapReduce can also be applied to lots of other problems. For example Google uses MapReduce to calculate their Pagerank.

2. Implementation in Java

The best known Java implementation of MapReduce is Apache Hadoop.

3. Information

MapReduce is described in the following introduction http://code.google.com/edu/parallel/mapreduce-tutorial.html.

4. Links and Literature