DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

MapReduce Basics (Part 1)

Hi there! I'm Shrijith Venkatrama, the founder of Hexmos. Right now, I’m building LiveAPI, a super-convenient tool that simplifies engineering workflows by generating awesome API docs from your code in minutes.

  • The MapReduce (MR) programming model was invented to solve large-scale data processing problems
  • In particular, Google, eBay and some specialized academic communities such as particle physics needed petabyte scale (10^15 bytes) data processing on a daily basis
  • MapReduce is a set of principles for solving large-scaale data processing problems
  • The two main principles in action are: (1) Divide and Conquer (2) Parallelization
  • Divide and Conquer is about - how to decompose a large problem into smaller ones
  • Parallelization is about - how to solve each smaller subproblem in parallel and finally integrate them to get a consolidated solution to the original large problem
  • Most traditional solutions/frameworks for parallelization the developer has to worry about many details: how to decompose the problem, how to distribute compute across cores, machines, how to distribute data efficiently, how to deal with errors/failures in the distributed system, how to synchronize different workers, etc. The older approaches have big cognitive burden and due to that room for errors in implementation.
  • MapReduce provides solutions for handling petabyte data efficiently. Instead of moving data where computation will happen, MR brings computation to data.
  • MapReduce has roots in functional programming. In particular, it is rooted in two functions: map and fold. Given a list of values, the map function is applied to each element to get a transformed value. Map is inherently parallelizable. The next thing is the fold function. A fold function takes two values: initial value (or prev value) and the next value (which is the result of map, usually).

Map & Fold

  • In summary - map is the transformation operation, while fold is the aggregation operation
  • The fold operation requires at least two data elements to be "brought together" in a distributed setup (hence more tricky sometimes)
  • In real-world scenarios, many times fold is not required for all elements; rather fold happens in "groups", leading to higher parallelization.
  • For commutative and associative operations, fold can be made much faster through local aggregation and sensible reordering
  • MR is practically implemented at Google (proprietary) and also open sourced in the Hadoop project.

Next Steps

In the next part of the article series, I will explore:

  1. Mappers and Reducers
  2. Partitioners and Combiners

Top comments (0)