12 Dec 2016

MapReduce: Simplified Data Processing on Large Clusters

Link to the paper

The paper can be found here.

Gist in short

The paper is written in a very elegant and easy to understand way is and divided in 6 parts explaining topics from implementation to the usage at google. It exposes the reader to the immense power in functional languages and is explains how the programming model of map reduce is inspired from it.

MapReduce is a programming model which processes a set of input key-value pairs to a set of result. User species a map function that processes the input set of key-value pairs to generate a new (intermediate) set of key-value pairs. These are then given to a user specified reduce function that merges all intermediate values associated with the same intermediate key. Programs that can be written in this functional style can be easily parallelized automatically without user being able to understand the parallelism. The runtime environment in mapreduce takes care of the details related to partitioning, scheduling and execution across the cluster.

Need for MapReduce?

A lot of tasks now-a-days require to process large chunks of raw-data with various techniques. Thoughmost of the techniques and the computations involved are simple and straightforward a general method on how to paralleize this across a cluster when working with large amount of data is not available and this results in the developer wasting a lot of time trying to come up with a right and working solution on how to parallelize his/her code to run on several machines. MapReduce targets this problem and focuses to provide a general purpose framework where a simple program involving straight forward functional computations can be simply specified in the “map” and “reduce” blocks of the provided framework and the program then automatically paralleises the given computation. The paper focuses on describing these map and reduce functions provided by the programming model and explains how they can be put to use in case of a general problem. [Note: Not all problems may fit to this paradigm. Also it may often require creativity to break down your program and massage it to fit in the provided framework of the mapreduce programming model].

Programming model

Map, written by the user, takes an input pair and pro- duces a set of intermediate key/value pairs. The MapRe- duce library groups together all intermediate values asso- ciated with the same intermediate key I and passes them to the Reduce function.

[Input key values] —-> MAP FUNCTION —-> [Intermediate key-values]

The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation.

[Intermediate key values] —-> REDUCE FUNC.—-> [Final Key values]

Example

The following example comes directly from the paper and you are recommended to skim through the paper to find many more examples explained to get a good catch of how this model may help you in case of need.

Consider the problem of counting the number of oc- currences of each word in a large collection of docu- ments. The user would write code similar to the follow- ing pseudo-code:

map(String key, String value): 
	// key: document name
	// value: document contents for each word w in value:
     EmitIntermediate(w, "1");

reduce(String key, Iterator values): 
	// key: a word
	// values: a list of counts
	int result = 0;
	for each v in values:
  		result += ParseInt(v);
	Emit(AsString(result));

The map function emits each word plus an associated count of occurrences (just ‘1’ in this simple example). The reduce function sums together all counts emitted for a particular word.

Though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types:

map (k1,v1) –> list(k2,v2)
reduce (k2,list(v2)) –> list(v2)

i.e., the input keys and values are drawn from a different domain than the output keys and values. Furthermore, the intermediate keys and values are from the same do-main as the output keys and values.

Execution Overview

The Map function calls are distributed across multiple machines by automatically partitioning the input data into a set of M splits. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R).

When the program calls the MapReduce function following actions take place.
- The library splits input files into M pieces and then starts many copies of the user program on many machines
- One of the program copies is special in the sense that it is called master and controls and assignes work to rest of the workers [This also makes it a single point of failure where you must restart your job if this fails]. Master picks idle workers and assigns them with one of the M map or R reduce tasks.
- A Map worker reads input corresponding to that task and does the computation on input key-value pairs to produce intermediate key-value pairs. These intermediate key-values are buffered in this workers memory [and are periodically written to local disk, partitioned into R regions by the partitioning functions. Locations of these are passed back to master who should then forward these locations to reduce workers]
- A reduce worker, when notified by master about the above locations, uses its remote procedure calls to read data from map-workers, then sorts that by intermediate keys to groups all pairs with same intermediate key and then processes these groups to produce a set of output key value pairs
- On completing all map and reduce tasks, the master wakes up the user program and returns back to user code

Fault Tolerance

Most MapReduce implementations provide robust support against a worker failing.
- When a map-worker fails, the library reschedule all the tasks that were either assigned to it or were successfully processed by it but not reduced by any of the reduce workers, on other map-workers.
- When a reduce-worker fails, the library similarly auto reschedules the reduce tasks that were assigned to it on other reduce workers.
- However, most libraries do not provide any support against master failing (as of the time of writing) and the program must be restarted.
- support for skipping of bad records is also provided in most implementations and can be enabled by changing a configuration variable depending on the implementation.

Other issues to consider

Network bandwidth is a relatively scarce resource in the computing environment and hence saving bandwidth by means of a robust file system that takes advantage of locality becomes important in such computing environment
In cases of large tasks, choice of M and R become important. There is a following tradeoff in making decision regarding the values of M and R.

Ideally we would like M and R to be large since each worker has many different tasks improves dynamic load balancing and speed of recovery in case of worker failure.

But the master mush make O(M+R) scheduling decisions and store (M*R) states in memory and so M and R cannot be grown too large.
R is also constrained by users because the output of each reduce task ends up in a separate file and hence in practice R is generally kept small.

Conclusion

The remaining part of the paper discusses performance on a few jobs executed at google with some google specific infrastructure and this programming model seems to have done really well. It is being used widely in the current times and seems to be a promising paradigm to gain more importance in the near future. This technique of automatic parallelization when computations that can be written in functional style are under consideration, is a really handy way of getting a simple program that involves huge data set to run across a cluster with robust mechanism for fault tolerance. In the end, I hope the gist helped you in getting a quick introduction and summary of the paper in under 7 minutes. Please feel free to contact and suggest improvements.

Visitors:

Stats:

0 comments

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters

Link to the paper

Gist in short

Need for MapReduce?

Programming model

Example

Execution Overview

Fault Tolerance

Other issues to consider

Conclusion

Tags:

Visitors:

Stats: