The MapReduce framework assumes as input a large, unordered stream of input values of an arbitrary type. For instance, each input may be a line of text in some vast corpus. Computation proceeds in three steps.
- A map function is applied to each input, which outputs zero or more intermediate key-value pairs of an arbitrary type.
- All intermediate key-value pairs are grouped by key, so that pairs with the same key can be reduced together.
- A reduce function combines the values for a given key k; it outputs zero or more values, which are each associated with k in the final output.
To perform this computation, the MapReduce framework creates tasks (perhaps on different machines) that perform various roles in the computation. A map task applies the map function to some subset of the input data and outputs intermediate key-value pairs. A reduce task sorts and groups key-value pairs by key, then applies the reduce function to the values for each key. All communication between map and reduce tasks is handled by the framework, as is the task of grouping intermediate key-value pairs by key.
In order to utilize multiple machines in a MapReduce application, multiple mappers run in parallel in a map phase, and multiple reducers run in parallel in a reduce phase. In between these phases, the sort phase groups together key-value pairs by sorting them, so that all key-value pairs with the same key are adjacent.
Consider the problem of counting the vowels in a corpus of text. We can solve this problem using the MapReduce framework with an appropriate choice of map and reduce functions. The map function takes as input a line of text and outputs key-value pairs in which the key is a vowel and the value is a count. Zero counts are omitted from the output:
def count_vowels(line): """A map function that counts the vowels in a line.""" for vowel in 'aeiou': count = line.count(vowel) if count > 0: emit(vowel, count)
The reduce function is the built-in sum functions in Python, which takes as input an iterator over values (all values for a given key) and returns their sum.