Map Reduce

From Biowiki
Jump to: navigation, search

Map Reduce

Notes on Google's MapReduce and closely similar technologies.


Google's mapreduce algorithm is an implementation of parallel functional programming for associative & commutative operators (functions). This allows things like ultra-fast Expectation Maximization (each E-step of the algorithm can be distributed, the results summed and fed to the M-step) as well as the usual parallelization of database searches ("max" and "sort" are associative & commutative).

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day... [2004]

See also the Google File System, which is probably important for the successful application of this in many cases.

See also Sawzall, a language built on top of MapReduce, and BigTable, a distributed data storage mechanism.


From the Hadoop! FAQ:

Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of Map Reduce. For some details, see HadoopMapReduce.

Sequoia @Stanford

Sequoia is a programming language that is designed to facilitate the development of memory hierarchy aware parallel programs that remain portable...

Sequoia is syntactically an extension to C, though the language constructs that Sequoia introduces result in a programming model that is quite different from C...

Sequoia's third iteration construct, mapreduce, is inspired by the reduce operator of functional languages, in which a collection of data is "collapsed" or "reduced" to a single data item. An example of a reduction is summing the numbers of an array: the input is a sequence of numbers, and the output is a single number, into which every number in the array was summed.

A reduction has two components: a collection of data items, and a combiner function which take two inputs, a data element from that collection and a value representing a partial aggregation of data elements, and produces as a single output an updated partial aggregation. In the example of summing the numbers of an array, the combiner function takes a single array element and a running tally as inputs, adds the array element to the running tally and returns it as an updated running tally. Once all the array elements have been processed, the running tally value will be the full sum of the array's elements.

An important property of reductions is that they can be done in parallel, since the order that elements are aggregated isn't important. This assumes that the combiner function is both associative and commutative...