Hi there! I'm Shrijith Venkatrama, the founder of Hexmos. Right now, I’m building LiveAPI, a super-convenient tool that simplifies engineering workflows by generating awesome API docs from your code in minutes.
- The MapReduce (MR) programming model was invented to solve large-scale data processing problems
- In particular, Google, eBay and some specialized academic communities such as particle physics needed petabyte scale (10^15 bytes) data processing on a daily basis
- MapReduce is a set of principles for solving large-scaale data processing problems
- The two main principles in action are: (1) Divide and Conquer (2) Parallelization
- Divide and Conquer is about - how to decompose a large problem into smaller ones
- Parallelization is about - how to solve each smaller subproblem in parallel and finally integrate them to get a consolidated solution to the original large problem
- Most traditional solutions/frameworks for parallelization the developer has to worry about many details: how to decompose the problem, how to distribute compute across cores, machines, how to distribute data efficiently, how to deal with errors/failures in the distributed system, how to synchronize different workers, etc. The older approaches have big cognitive burden and due to that room for errors in implementation.
- MapReduce provides solutions for handling petabyte data efficiently. Instead of moving data where computation will happen, MR brings computation to data.
- MapReduce has roots in functional programming. In particular, it is rooted in two functions: map and fold. Given a list of values, the map function is applied to each element to get a transformed value. Map is inherently parallelizable. The next thing is the fold function. A fold function takes two values: initial value (or prev value) and the next value (which is the result of map, usually).
- In summary - map is the transformation operation, while fold is the aggregation operation
- The fold operation requires at least two data elements to be "brought together" in a distributed setup (hence more tricky sometimes)
- In real-world scenarios, many times fold is not required for all elements; rather fold happens in "groups", leading to higher parallelization.
- For commutative and associative operations, fold can be made much faster through local aggregation and sensible reordering
- MR is practically implemented at Google (proprietary) and also open sourced in the Hadoop project.
Next Steps
In the next part of the article series, I will explore:
- Mappers and Reducers
- Partitioners and Combiners
Top comments (0)