Judy

Posted on Mar 11

Unconventional Yet Common Aggregation Operations

#programming #devops #productivity #discuss

Standard SQL provides five fundamental aggregation operations: SUM, COUNT, AVG, MIN, and MAX—all compute a single value from a set.

For example, finding the first login time for user 1001 in a log table is simple using SQL:

SELECT MIN(time) FROM login WHERE user=1001

Aggregation operations are also frequently used along with grouping operations, such as determining the first login time for each user:

SELECT user,MIN(time) FROM login GROUP BY user

The only difference in syntax is the addition of the grouping field.

Python is similar:

login[login['user']==1001]['time'].min()
login.groupby('user')['time'].min()

These two lines of code seem to have a consistent style, but the result of the grouped aggregation is not a two-dimensional DataFrame.

If you want to return a DataFrame, you should write the code as:

login.groupby('user').agg({'time': 'min'})

However, this leads to a minor inconsistency with the syntax style used in entire set aggregation.

Of course, esProc SPL also supports this basic form of syntax:

login.select(user==1001).min(time)
login.groups(user;min(time))

Similar to SQL, both the object and syntax style remain consistent.

However, aggregation operations don’t always appear in such simple forms. In more complex scenarios, SPL’s advantages become clear.

1. Unconventional aggregate functions

**
Sometimes, we care more about the information related to aggregated values than the values themselves. For example, we might want to find the IP address, browser type, and other details of a user’s first login, rather than just the timestamp. In such cases, the aggregation result should be a record, rather than just a single value.

SQL code:

SELECT * FROM login
WHERE user =1001 and
time = (SELECT MIN(time) FROM login WHERE user=1001);

SQL doesn’t have aggregate functions that return records. First, you need a subquery to calculate the minimum login time; then, use an outer query to retrieve other information related to it. This requires the dataset to be traversed twice.

Combining aggregation and grouping operations makes SQL even more cumbersome to use. For example, finding the first login record for each user in SQL:

SELECT * FROM login
JOIN (
    SELECT user,MIN(time) AS first_login_time
    FROM login
    GROUP BY user
    ) sub ON user = sub.user AND time = sub.first_login_time;

A subquery first groups and aggregates to find the first login time for each user, and then joins with the original table to retrieve other fields.

SPL offers unconventional aggregate functions, minp and maxp, which can return the records corresponding to the minimum and maximum values. To change from calculating the “minimum value” to retrieving the “record corresponding to the minimum value,” simply replace the aggregate function min with minp:

login.select(user==1001).minp(time)

The same applies to grouping and aggregation; all you need to do is change the aggregate function:

login.groups(user;minp(time))

Python also provides similar functions:

login[login['user']==1001].nsmallest(1, 'time')

nsmallest(1, 'time') finds the row with the smallest time value; nlargest is a similar function.

However, when involving grouping and aggregation, nsmallest cannot be used directly within the agg function after groupby like min; instead, it requires using apply. This leads to a conceptual inconsistency and makes the code slightly more complex:

login.groupby('user').apply(lambda group: group.nsmallest(1, 'time'))

This is because Python treats nsmallest and nlargest as data filtering methods, and their calculation process differs from aggregate functions like min and max. Aggregate functions often allow calculations to be performed during data traversal, without needing to first obtain the complete grouped subset, which results in significantly lower memory usage.

SPL’s minp and maxp functions are also calculated in this way, but Python’s nsmallest and nlargest cannot; they can only aggregate the pre-computed grouped subset.

Of course, SPL also supports further aggregation on those grouped subsets:

login.group(user;~.minp(time))

The ~ symbol represents the grouped subset, making the code more concise than Python’s.

Aggregation operations can also return sets, such as the common top-N problem: finding the top 3 highest-paid employees.

SQL does not treat topN as an aggregate function and can only express it through a sorting operation:

SELECT * FROM employee ORDER BY salary DESC LIMIT 3;

This is written in MySQL; the syntax varies across databases, but all involve sorting.

When combining this calculation with grouping, for example, to find the top 3 highest-paid employees in each department, it becomes more cumbersome to code in SQL:

SELECT *
FROM(
    SELECT *,
 ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num
    FROM employee
    ) 
WHERE row_num<= 3;

The syntax differs significantly from that used for calculating topN on the entire set. It requires using window functions to generate sequence numbers before filtering, which is a rather convoluted approach.

In contrast, SPL treats topN as an aggregate operation, providing the aggregate function top:

employee.top(-3;salary)

The top function aggregates data into a small set of size N when traversing the original table.

It also works easily with grouping:

employee.groups(department;top(-3;salary))

Alternatively, you can use the grouped subset syntax:

employee.group(department;~.top(-3;salary))

Python’s nlargest function can also return a set:

employee.nlargest(3, 'salary')

But if you want to use it in grouping, you still need to use apply:

employee.groupby('department').apply(lambda group: group.nlargest(3, 'salary'))

2. Complex aggregation operations

**
More complex aggregation operations cannot be implemented with a single function; multiple steps are required.

If it’s an operation on the entire set, that’s not a problem; you can just write it in steps. But if it involves grouping, then these steps must be performed on the grouped subsets. For example,

A chain store has sales data for different categories of products across its branches. Calculate the average sales for each branch, then determine the total sales of products that exceed the average sales for each branch, as well as the total sales of electronic products among those products.

SQL handles such problems in a very cumbersome way because it lacks native support for grouped subsets:

WITH StoreAvgVolume AS (
    SELECT store, AVG(volume) AS avg_volume  
    FROM  sales
    GROUP BY store
),
AboveAvgVolume AS (
    SELECT s.store, s.product, s.category, s.volume, sav.avg_volume
    FROM sales s
    JOIN StoreAvgVolume sav ON s.store = sav.store
    WHERE s.volume > sav.avg_volume
)
SELECT store,avg_volume,
SUM(volume) AS total_volume,
SUM(CASE WHEN category = 'Electronics' THEN volume ELSE 0 END) AS electronic_volume
FROM AboveAvgVolume
GROUP BY store,avg_volume;

Step 1: Group by branch and calculate the average sales. Step 2: Use a join to filter for products with sales exceeding the branch average. Step 3: Group again to calculate the total sales for the filtered products and the total sales specifically for electronic products.

SPL retains grouped subsets after grouping, allowing for continuous multi-step calculations on those subsets.

sales.group(store;a=~.avg(volume):avg,(above=~.select(volume>a)).sum(volume):total,above.select(category=="Electronics").sum(volume):e_total)

During aggregation operations, some intermediate results need to be referenced repetitively. For example, the above in the code can be defined as an intermediate variable for reuse.

Python also supports grouped subsets, but its related data objects and functions are disorganized. You need to add Series objects, and field names must be written in the cumbersome form of tablename['fieldname']:

import pandas as pd
sales = pd.read_csv('sales.csv')
result = sales.groupby('store').apply(lambda group: pd.Series({
   'avg': (avg := group['volume'].mean()),
    'above': (above := group[group['volume'] > avg]),
    'total': (above['volume'].sum()),
    'e_total': (above[above['category'] == 'Electronics']['volume'].sum())
})).drop(columns=['above'])

The intermediate results in aggregation operations must be defined as computed columns, like above, and this column finally needs to be deleted. While logically correct, the code is still much more cumbersome than SPL’s.

In summary: SQL provides only the most basic aggregation operations, making even slightly complex tasks quite difficult. Python offers a significant improvement with richer concepts and operations for handling more complex calculations. However, its fragmented conceptual system and inconsistent syntax make it difficult to learn and remember, and result in verbose code. SPL, in contrast, offers a clear conceptual structure and consistent syntax, enabling the most concise and readable code.

Download esProc for free

DEV Community

Unconventional Yet Common Aggregation Operations

1. Unconventional aggregate functions

2. Complex aggregation operations

Top comments (0)

Read next

Automating MIDI Generation with Python: A Comprehensive Guide

Unlocking Local AI Magic: DeepSeek and CodeGPT Empower Developers to Evolve Efficiently

Exploring Cloudflare's AI Gateway: Simplifying AI Integration and Security

Rethinking AI in the Future of QA and Testing