Marcelo Pena

Posted on Feb 2

MapReduce Simplified: Understand Distributed Processing with the Same Logic as SQL

#bigdata #mapreduce #distributedsystems #hadoop

MapReduce Simplified: Understand Distributed Processing with the Same Logic as SQL

If you’ve worked with relational databases, you're probably familiar with operations like SELECT, GROUP BY, and JOIN. These operations are powerful for manipulating and analyzing structured data. But what happens when the data is so large it doesn't fit on a single server? This is where MapReduce comes in: a programming model that allows for processing large volumes of data in a distributed manner. And guess what? The logic behind MapReduce isn’t much different from the queries you’re already writing in SQL!

In this article, we’ll break down MapReduce and show you how it can be understood with the same mindset you use when writing SQL queries. Let’s dive in!

1. The MAP Step: Similar to SELECT

In SQL, the SELECT operation allows you to choose and transform data from a table. In MapReduce, the Map step does something similar: it takes input data, processes each record, and emits key-value pairs.

Consider the following sample dataset:

Salesperson	Sale Amount	Department
John	100	Electronics
Mary	200	Clothing
Peter	150	Electronics
Anna	300	Clothing
John	50	Electronics
Mary	100	Clothing

If we want to select the salesperson's name and sale amount in SQL, we would write:

SELECT salesperson, sale_amount FROM sales;

Result:

Salesperson	Sale Amount
John	100
Mary	200
Peter	150
Anna	300
John	50
Mary	100

In MapReduce, the Map step does the same thing: it processes the records and emits key-value pairs. The Map function in pseudocode would look like this:

def map(record):
    salesperson = record['salesperson']
    sale_amount = record['sale_amount']
    emit(salesperson, sale_amount)

Map Output:

('John', 100)
('Mary', 200)
('Peter', 150)
('Anna', 300)
('John', 50)
('Mary', 100)

2. The REDUCE Step: Similar to GROUP BY + Aggregation

In SQL, the GROUP BY operation groups data and allows you to apply aggregation functions like SUM(), COUNT(), or AVG(). In MapReduce, the Reduce step receives the key-value pairs from Map, groups them, and performs a final operation, like summing the values.

For instance, if we want to calculate the total sales per salesperson in SQL, we would write:

SELECT salesperson, SUM(sale_amount) AS total_sales
FROM sales
GROUP BY salesperson;

Result:

Salesperson	Total Sales
John	150
Mary	300
Peter	150
Anna	300

In MapReduce, the Reduce step would look like this (in pseudocode):

def reduce(salesperson, values):
    total_sales = sum(values)
    emit(salesperson, total_sales)

Reduce Output:

('John', 150)
('Mary', 300)
('Peter', 150)
('Anna', 300)

3. MapReduce with Grouping by Department

Now let’s group the sales by department and calculate the total sales per department. In SQL, we would write:

SELECT department, SUM(sale_amount) AS total_sales
FROM sales
GROUP BY department;

Result:

Department	Total Sales
Electronics	300
Clothing	600

In MapReduce, the process would look like this:

MAP Step

First, we emit key-value pairs where the key is the department and the value is the sale amount.

def map(record):
    department = record['department']
    sale_amount = record['sale_amount']
    emit(department, sale_amount)

Map Output:

('Electronics', 100)
('Clothing', 200)
('Electronics', 150)
('Clothing', 300)
('Electronics', 50)
('Clothing', 100)

REDUCE Step

Then, we sum the values for each department.

def reduce(department, values):
    total_sales = sum(values)
    emit(department, total_sales)

Reduce Output:

('Electronics', 300)
('Clothing', 600)

4. How MapReduce Works in a Distributed Way

Now, imagine that the data is too large to fit on a single server. MapReduce allows us to split the data across multiple servers and process it in parallel.

Example: Let’s split the sales between two servers.

Server 1:

Salesperson	Sale Amount
John	100
Mary	200
John	50

Server 2:

Salesperson	Sale Amount
Peter	150
Anna	300
Mary	100

Step 1: MAP Step (in parallel)

Each server processes its data locally and emits key-value pairs.

Server 1:

John → 100
Mary → 200
John → 50

Server 2:

Peter → 150
Anna → 300
Mary → 100

Step 2: Shuffle and Sort

The data is shuffled and sorted so that all sales for the same salesperson are grouped together and sent to the correct server.

Grouped Data:

John: [100, 50] → Sent to Server A
Mary: [200, 100] → Sent to Server B
Peter: [150] → Sent to Server A
Anna: [300] → Sent to Server B

Step 3: REDUCE Step (in parallel)

Each server processes the grouped data and calculates the total sales per salesperson.

Server A:

John: 100 + 50 = 150
Peter: 150 = 150

Server B:

Mary: 200 + 100 = 300
Anna: 300 = 300

Step 4: Final Output

The partial results from each server are combined to generate the final result.

Salesperson	Total Sales
John	150
Mary	300
Peter	150
Anna	300

Conclusion

With these examples, it becomes clear that MapReduce follows a similar logic to SQL, but is designed to process large datasets in a distributed and parallel manner. The Map step works like SELECT, and the Reduce step is similar to GROUP BY. The power of MapReduce lies in its ability to split large tasks across multiple servers, speeding up processing and allowing the system to scale as data grows.

Now that you understand the connection between MapReduce and SQL, are you ready to apply this knowledge in distributed projects? Feel free to share your questions or experiences in the comments! 😊

DEV Community

MapReduce Simplified: Understand Distributed Processing with the Same Logic as SQL

1. The MAP Step: Similar to SELECT

2. The REDUCE Step: Similar to GROUP BY + Aggregation

3. MapReduce with Grouping by Department

MAP Step

REDUCE Step

4. How MapReduce Works in a Distributed Way

Step 1: MAP Step (in parallel)

Step 2: Shuffle and Sort

Step 3: REDUCE Step (in parallel)

Step 4: Final Output

Conclusion

Top comments (0)

Read next

What Is Split For Heat In Amazon DynamoDB?

Day 1116 : No Time To Waste

# Pragmatic Programming: How to develop mental models and good practices to be a successful software developer?

🔥50 Just-Released GitHub Repos to Check!