Grouping is a common structured data calculation, with corresponding statements and functions available in both SQL and Python. However, these languages are far less powerful than the grouping operations offered by esProc SPL.
Ordinary grouping often involves aggregation. For example, to calculate the number of employees for each department, the SQL code would be:
select DEPT,count(1) emp_count from employee group by DEPT
Of course, SPL also supports such basic grouping and aggregation operations:
T("employee.csv").groups(DEPT;count(1):emp_count)
It’s similar to SQL.
In addition, SPL offers richer grouping operations. Below are two scenarios:
- Grouped subset We are sometimes interested not only in grouped aggregate values, but also in the detailed data contained in each group – i.e., the grouped subset.
For instance, to identify employees from departments with over 10 members, a straightforward approach is to divide all employees into sub-sets by department, forming a set of sets. Then, filter the set to select the grouped subsets with a length greater than 10. Finally, union these subsets to get the desired result.
This means you cannot perform forced aggregation after grouping; you need to keep the grouped subsets.
SPL supports grouped subset:
T("employee.csv").group(DEPT).select(~.len()>10).conj()
The group function divides the data into grouped subsets by department, the select function filters the subsets with a length exceeding 10, and the conj function unions the filtered subsets.
SQL’s grouping is always bound to aggregation and lacks the concept of grouped subset. So, the easiest way to solve this problem in SQL is to traverse the data twice:
select *
from employee
where DEPT in
(select DEPT
from employee
group by DEPT
having count(1)>10)
order by DEPT
First, use a subquery to group and identify departments with more than 10 employees. Then, filter the employee data for those departments again in the outer query.
Python also doesn’t essentially support grouped subset objects. groupby function doesn’t return a set of grouped subsets. However, since this problem only requires one filtering step, it can be solved relatively concisely using lambda syntax:
import pandas as pd
employee = pd.read_csv('employee.csv')
groups = employee.groupby('DEPT').filter(lambda x: len(x) > 10)
result = groups.sort_values(by='DEPT')
It appears to follow a computation logic similar to SPL, essentially implementing the concept of grouped subsets. Here, the ‘x’ in lambda serves the same purpose as the ‘~’ symbol in SPL. The difference is that filter function doesn’t return a set of sets, but automatically flattens them into a single-layer set, eliminating the need for conj to union in SPL.
Thus, the filter result is no longer a set of grouped subsets and cannot be used in subsequent grouped subset operations.
Let’s explore more operations on the grouped subsets. After identifying departments with more than 10 employees, sort these departments by their average employee salary in descending order, while also sort employees within each department by their individual salaries in descending order.
This requires performing sequential operations on the set of grouped subsets: first filtering, then sorting, followed by sorting each individual subset.
SPL can naturally continue this process, starting from the code above:
=T("employee.csv").group(DEPT).select(~.len()>10).sort(-~.avg(SALARY)).(~.sort(-SALARY)).conj()
Because SQL lacks the concept of grouped subsets, it still requires traversing the data twice and using a join.
select * from employee
join (
select DEPT,avg(SALARY) as avg_salary,
from employee
group by DEPT
having count(1) > 10
) dept_stats on employee.DEPT = dept_stats.DEPT
order by
dept_stats.avg_salary desc,
SALARY desc
The subquery calculates the average salary, requiring both the inner and outer queries to be rewritten using JOIN operations, making the code somewhat convoluted.
Python’s grouped subset mechanism falls short in this scenario, as any operation after grouping will flatten the results into a single-layer table. To perform further computations on the grouped subsets, the data must be regrouped to regenerate grouped subsets:
import pandas as pd
employee = pd.read_csv('employee.csv')
groups = employee.groupby('DEPT').filter(lambda x: len(x) > 10)
dept_avg_salary = groups.groupby('DEPT')['SALARY'].mean().reset_index(name='avg_salary')
merged_df = pd.merge(groups, dept_avg_salary, on='DEPT')
sorted_df = merged_df.sort_values(by=['avg_salary', 'SALARY'], ascending=[False, False])
final_result = sorted_df[employee.columns]
Since the filter operation in the first step returns a single-layer table, regrouping is required to compute the aggregate values. The subsequent code will also need to borrow ideas from the preceding SQL approach, using merge or join operations to bind the aggregate values to achieve the sorting objective.
Python lacks essential grouped subset mechanism. While leveraging lambda syntax can achieve the effect of grouped subsets in a single step, enabling more complex aggregation computations—a significant advantage over SQL—it still falls far short when compared to SPL.
To address this problem in Python, if the requirement is to group only once, you could apply the entire SQL strategy mentioned earlier. However, this would still require traversing the data twice and results in more verbose code. We therefore omit the implementation details here.
Computations involving grouped subsets are extremely common, as evidenced by the numerous related questions easily found on prominent Q&A platforms, such as:
https://stackoverflow.com/questions/78324132/how-do-i-perform-recursive-search-in-oracle
https://stackoverflow.com/questions/78346354/adding-tuple-rows-to-each-subtuple-group-in-sql
…
- Ordered grouping Grouped subsets are often involved with ordered grouping operations.
For example, to find the longest consecutive days of price increase for a stock, besides the direct traversal method, you can also utilize a grouping approach: First, sort the stock’s closing prices by date and traverse the data. When the price increases on a given day, group that day with the previous day; when the price decreases, start a new group. After completing the traversal, consecutive days of price increases will be grouped together. Then, you simply need to identify which group has the most members.
However, this grouping is not conventional equivalence grouping, but an ordered grouping based on conditions during traversal.
SPL supports this kind of ordered grouping based on changing conditions.
stock.sort(Date).group@i(Price<=Price[-1]).max(~.len())
By appending the @i option to the group function, a new group will be created whenever the expression Price<=Price[-1] evaluates to true, corresponding to scenarios where the stock price has not risen. SPL supports cross-row referencing. Specifically, Price[-1] represents the closing price of the preceding row.
SQL lacks this feature, making it much harder.
SELECT MAX(ContinuousDays)
FROM(
SELECT COUNT(*) AS ContinuousDays
FROM (
SELECT SUM(RisingFlag) OVER (ORDER BY Date) AS NoRisingDays
FROM(
SELECT Date,
CASE WHEN Price > LAG(Price) OVER (ORDER BY Date) THEN 0
ELSE 1 END AS RisingFlag
FROM stock
)
)
)
GROUP BY NoRisingDays
Due to the lack of ordered grouping support, SQL has to convert it to common equivalence grouping through “accumulating non-increasing days”, resulting in a highly convoluted process. Determining increases and calculating cumulative values both require window functions in conjunction with subqueries, resulting in a situation with two window functions nested four layers deep, making the code difficult to write and understand.
Python also lacks direct ordered grouping support. Like SQL, it requires deriving a cumulative value column and then converting it to equivalence grouping, following similarly convoluted logic. However, Python's stronger adjacent element computation capabilities make the code more concise than SQL:
stock = pd.read_csv('stock.csv')
stock.sort_values(by=['Date'], inplace=True)
stock['NoRisingDays'] = stock['Price'].diff() > 0
grouped=stock.groupby((~stock['NoRisingDays']).cumsum())['NoRisingDays'].cumsum().where(stock['NoRisingDays'], 0)
max_increase_days = grouped.max()
There are numerous practical problems involving ordered grouping, such as:
https://stackoverflow.com/questions/78442803/list-of-all-the-last-people-to-enter-the-elevator
https://stackoverflow.com/questions/64099063/oracle-sql-data-migration-row-to-column-based-in-month
…
Grouped subsets and ordered grouping can also be nested.
For instance, if we want to calculate the longest consecutive days of price increase for each stock, we can first group by stock code to keep grouped subsets of each stock. Then, within each subset, we can apply the aforementioned ordered grouping method to compute the longest consecutive days of price increase.
SPL supports both true grouped subset concept and ordered grouping, enabling effortless implementation of nested computations.
stock.sort(Date).group(Code;~.group@i(Price<=Price[-1]).max(~.len()):max_increase_days)
You simply need to combine the two methods described previously.
SQL, on the other hand, requires adding the stock code as a partition field within the window function:
SELECT Code, MAX(ContinuousDays)
FROM(
SELECT Code, COUNT(*) AS ContinuousDays
FROM(
SELECT Code, Date, SUM(RisingFlag) OVER (PARTITION BY Code ORDER BY Code, Date) AS NoRisingDays
FROM(
SELECT Code, Date,
CASE WHEN Price > LAG(Price) OVER (PARTITION BY Code ORDER BY Code, Date) THEN 0
ELSE 1 END AS RisingFlag
FROM stock
)
)
GROUP BY Code, NoRisingDays
)
GROUP BY Code
Once you understand the previous code, making such modifications isn’t particularly difficult, but the code becomes even more cumbersome.
Python neither supports continuous operations on grouped subsets nor ordered grouping. The only approach is to convert ordered grouping to equivalence grouping using the previously mentioned method, combined with repeated regrouping:
import pandas as pd
stock = pd.read_csv('stock.csv')
stock.sort_values(by=['Code', 'Date'], inplace=True)
stock['NoRisingDays']=stock.groupby('Code')['Price'].diff().fillna(0).le(0).astype(int).cumsum()
grouped=stock.groupby(['Code','NoRisingDays']).size().reset_index(name='ContinuousDays')
max_increase_days = grouped.groupby('Code')['ContinuousDays'].max()
max_rise_df = max_increase_days.reset_index(name='max_increase_days')
The code is convoluted and cumbersome.
There are also numerous practical problems involving nested ordered grouping within grouped subsets:
https://stackoverflow.com/questions/78319976/postgresql-how-to-calculate-swipe-in-and-swipe-out-time
https://stackoverflow.com/questions/64116840/sql-formatting-to-user-friendly-date
https://stackoverflow.com/questions/78422651/identify-groups-of-sequential-records
…
SPL offers additional industry-unique grouping methods such as alignment grouping, enumeration grouping, and sequence-based grouping, far surpassing the capabilities of SQL and Python, truly earning its reputation as the most powerful in history.
SPL is open source free, open source address
Top comments (0)