MapReduce Basics (Part 1)

#programming #softwaredevelopment #learning #cloud

Hi there! I'm Shrijith Venkatrama, the founder of Hexmos. Right now, I’m building LiveAPI, a super-convenient tool that simplifies engineering workflows by generating awesome API docs from your code in minutes.

The MapReduce (MR) programming model was invented to solve large-scale data processing problems
In particular, Google, eBay and some specialized academic communities such as particle physics needed petabyte scale (10^15 bytes) data processing on a daily basis
MapReduce is a set of principles for solving large-scaale data processing problems
The two main principles in action are: (1) Divide and Conquer (2) Parallelization
Divide and Conquer is about - how to decompose a large problem into smaller ones
Parallelization is about - how to solve each smaller subproblem in parallel and finally integrate them to get a consolidated solution to the original large problem
Most traditional solutions/frameworks for parallelization the developer has to worry about many details: how to decompose the problem, how to distribute compute across cores, machines, how to distribute data efficiently, how to deal with errors/failures in the distributed system, how to synchronize different workers, etc. The older approaches have big cognitive burden and due to that room for errors in implementation.
MapReduce provides solutions for handling petabyte data efficiently. Instead of moving data where computation will happen, MR brings computation to data.
MapReduce has roots in functional programming. In particular, it is rooted in two functions: map and fold. Given a list of values, the map function is applied to each element to get a transformed value. Map is inherently parallelizable. The next thing is the fold function. A fold function takes two values: initial value (or prev value) and the next value (which is the result of map, usually).

In summary - map is the transformation operation, while fold is the aggregation operation
The fold operation requires at least two data elements to be "brought together" in a distributed setup (hence more tricky sometimes)
In real-world scenarios, many times fold is not required for all elements; rather fold happens in "groups", leading to higher parallelization.
For commutative and associative operations, fold can be made much faster through local aggregation and sensible reordering
MR is practically implemented at Google (proprietary) and also open sourced in the Hadoop project.

Next Steps

In the next part of the article series, I will explore:

Mappers and Reducers
Partitioners and Combiners

DEV Community

MapReduce Basics (Part 1)

Next Steps

Top comments (0)

Read next

Typescript deserves a better dependency injection framework 💉

How to Enable Multi-Session Support in the AWS Console

ReactJS for beginners

1368. Minimum Cost to Make at Least One Valid Path in a Grid