DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

MapReduce Basics (Part 2)

Hi there! I'm Shrijith Venkatrama, the founder of Hexmos. Right now, I’m building LiveAPI, a super-convenient tool that simplifies engineering workflows by generating awesome API docs from your code in minutes.

In my previous post, I introduced the MapReduce programming model at a high level. Head over to part 1 for that context.

In this post, we will dive deeper into the MapReduce programming model.

Mappers & Reducers

The basic data structure in MapReduce is key-value pairs.

Keys can be primitive data types (ints, strings, etc.) or more complex data types (lists, tuples, etc.).

For example, when representing web pages, the key could be the URL (string), and the value could be the page contents (string).

The programmer must define two functions with the following signatures:

map: (k1, v1) -> [(k2, v2)]
reduce: (k2, [v2]) -> [(k3, v3)]
Enter fullscreen mode Exit fullscreen mode

Here, the convention [...] denotes a list.

The mapper is applied to an arbitrary number of input key/value pairs and produces an arbitrary number of intermediate key/value pairs.

Finally, the reducer processes all the intermediate values to generate the output key/value pairs.

Between the map and reduce phases, a "group by" operation is performed on the intermediate keys.

Each reducer receives a subset of intermediate keys in an ordered fashion for a particular group.

The output consists of r files in the distributed file system, where r is the number of reducers.

Word Counter Example

The goal of the word counter is to take in a collection of documents, process each word, and compute the total count of each word in the entire collection.

Word counts are useful for building search indices, among other applications.

Here’s the mapper and reducer for the word counter:

Word Counter

A Bit More Nuance: Partitioners and Combiners

Partitioner

After the mapping phase, we are left with a collection of key-value pairs.

These key-value pairs must be assigned to reducers in some manner.

Partitioning determines the target reducer for each key-value pair.

By default, Hadoop uses the following formula:

partition = (key.hashCode() mod numReducers)
Enter fullscreen mode Exit fullscreen mode

Note that in default partitioning, only the key is considered.

This usually results in a fairly balanced distribution. However, if the values vary significantly in size, it can lead to imbalances.

Adjusting the partitioning logic to be more sophisticated can help mitigate such imbalances.

Combiners (or "Aggregate Intermediate Results")

A combiner acts as a "pre-processing" step at the mapper before sending data to the reducer.

It reduces bandwidth usage, improves performance, and performs local aggregation.

If a function is both associative and commutative, using combiners can yield substantial performance gains.

Combiners are optional components in the MapReduce paradigm.

Top comments (0)