DEV Community

Tomas@PawSQL
Tomas@PawSQL

Posted on • Edited on

Outer Join Optimization

Channel of advanced SQL tuning

Types of Joins

The types of SQL joins mainly include inner join, outer join (left outer join, right outer join, full outer join), and cross join. C.L. Moffatt explains them very nicely in his article Visual Representation of SQL Joins in a visual way. You will find the information you need in following diagram he presented in his article.

Image description

How does Join Type affect Query Execution

As we know, the database optimizer uses dynamic programming algorithm to estimate the cost of possible execution paths and select the execution plan with the smallest cost for execution. When the optimizer plans for table joins, it mainly performs three types of planning: access mode of data tables, order of table joins, and method of table joins.

When the optimizer plans for table joins, for Nested Loop Join method, it only considers the plan that the outer table is accessed before the inner table for each join operation. Therefore, when using outer joins, the outer and inner tables are explicitly defined in SQL (the outer table is the left table of left join/right table of right join), and the optimizer's selection is limited, which may lead to a lower-performing execution plan.

Let's use the example from the MySQL official document to illustrate:

Consider a query of this form, where R(T2) greatly narrows the number of matching rows from table T2:

SELECT * T1 FROM T1 LEFT JOIN T2 ON P1(T1,T2) WHERE P(T1,T2) AND R(T2) 
Enter fullscreen mode Exit fullscreen mode

If the query is executed as written, the optimizer has no choice but to access the less-restricted table T1 before the more-restricted table T2, which may produce a very inefficient execution plan.

Outer Join to Inner Join Conversion

If R(T2) is a NULL Filtered Condition (NFC), the outer join above can be transformed into an inner join, i.e.,

SELECT * T1 FROM T1 JOIN T2 ON P1(T1,T2) WHERE P(T1,T2) AND R(T2) 
Enter fullscreen mode Exit fullscreen mode

In this way, the optimizer can first apply R(T2) to obtain a very small result set, and then associate it with T1.

NFC Condition (NULL Filtered Condition)

A NULL Filtered Condition refers to a condition that filters out a row if the input is NULL and its evaluation result is FALSE or UNKNOWN. The rules for determining whether a condition is an NFC of an outer join operation are simple:

  • It is of the form A IS NOT NULL, where A is any column of the inner table.
  • It is a predicate that contains a reference to the inner table, and when one of its parameters is NULL, it evaluates to UNKNOWN.
  • It is a "AND" combination of NFCs.
  • It is a "OR" combination of NFCs.

Therefore, for this outer join:

T1 LEFT JOIN T2 ON T1.A=T2.A 
Enter fullscreen mode Exit fullscreen mode

the following conditions are considered as NFCs:

  • T2.B IS NOT NULL
  • T2.B > 3
  • T2.C <= T1.C
  • T1.B < 3 AND T2.B IS NOT NULL
  • T2.B < 2 OR T2.C > 1

The following conditions are not considered as NFCs:

  • T2.B IS NULL
  • T1.B < 3 OR T2.B IS NOT NULL
  • T1.B < 3 OR T2.B > 3
  • T2.B in (1,2, NULL)

Outer Join Optimization in DBMS

Most of the optimizer in relational databases can provide the above outer join simplification rewrites. Here is an example of outer join simplification rewrites in MySQL and PostgreSQL.

  • Example SQL statement:
select c_custkey from orders left join customer on c_custkey=o_custkey where C_NATIONKEY  < 20
Enter fullscreen mode Exit fullscreen mode
  • Execution Plan in MySQL:
-> Inner hash join (orders.O_CUSTKEY = customer.C_CUSTKEY)  (cost=20541.08 rows=20013)
    -> Table scan on orders  (cost=2529.21 rows=200128)
    -> Hash
        -> Filter: (customer.C_NATIONKEY < 20)  (cost=0.35 rows=1)
            -> Table scan on customer  (cost=0.35 rows=1)
Enter fullscreen mode Exit fullscreen mode
  • Execution Plan in PostgreSQL:
Hash Join  (cost=100.19..410.47 rows=33 width=4)
  Hash Cond: (orders.o_custkey = customer.c_custkey)
  ->  Seq Scan on orders  (cost=0.00..284.01 rows=10001 width=4)
  ->  Hash  (cost=99.78..99.78 rows=33 width=4)
        ->  Bitmap Heap Scan on customer  (cost=4.54..99.78 rows=33 width=4)
              Recheck Cond: (c_nationkey < 20)
              ->  Bitmap Index Scan on c_nationkey_idx  (cost=0.00..4.53 rows=33 width=0)
                    Index Cond: (c_nationkey < 20)
Enter fullscreen mode Exit fullscreen mode

As seen from the execution plan, both MySQL and PostgreSQL support the rewrite optimization of outer join to inner join.

Outer Join Optimization in PawSQL

PawSQL implements similar rewrite optimization through the Outer2InnerConversionRewrite rule. It supports the rewrite optimization of left outer join, right outer join to inner join, and the rewrite optimization of full outer join to left outer join, right outer join, or inner join.

Note: The purpose of implementing the Outer2InnerConversionRewrite optimization in PawSQL is not to reproduce the rewrite optimization logic of the database optimizer. PawSQL implements it because it can trigger index recommendation functions that the database optimizer cannot do.

Step 1. Input SQL statement

select c_custkey from orders left join customer on c_custkey=o_custkey where C_NATIONKEY  < 20
Enter fullscreen mode Exit fullscreen mode

Step 2. After Outer2InnerConversionRewrite applied, left join is converted to inner join

select c_custkey from orders inner join customer on c_custkey=o_custkey where C_NATIONKEY  < 20
Enter fullscreen mode Exit fullscreen mode

Step 3. Recommended Index
Two following indexes recommended by the PawSQL index engine for the rewritten query.

CREATE INDEX CONCURRENTLY PAW_IDX1028958902 ON PUBLIC.ORDERS(O_CUSTKEY);
CREATE INDEX CONCURRENTLY PAW_IDX1196677611 ON PUBLIC.CUSTOMER(C_NATIONKEY,C_CUSTKEY);
Enter fullscreen mode Exit fullscreen mode

Step 4. Execution Plans

After incorporating the recommended indexes into the optimizer planning, the execution plans are:

  • Execution Plan in PostgreSQL:
Nested Loop  (cost=0.57..85.53 rows=33 width=4)
->  Index Only Scan using paw_idx1196677611 on customer  (cost=0.29..8.86 rows=33 width=4)
      Index Cond: (c_nationkey < 20)
->  Index Only Scan using paw
Enter fullscreen mode Exit fullscreen mode
  • Execution Plan in MySQL
-> Nested loop inner join  (cost=1.91 rows=6)
  -> Filter: (customer.C_NATIONKEY < 20)  (cost=0.35 rows=1)
      -> Index scan on customer using PAW_IDX1360881332  (cost=0.35 rows=1)
  -> Covering index lookup on orders using PAW_IDX1028958902 (O_CUSTKEY=customer.C_CUSTKEY)  (cost=1.56 rows=6)
Enter fullscreen mode Exit fullscreen mode

As we can see, through the entire optimization process of PawSQL, two indexes recommended by the index engine reduced the query cost from 410.47 to 85.53 on a PostgreSQL database with small size of data, resulting in a 379.91% performance improvement. Similarly, on a MySQL database server with a big size of data, the query cost decreased from 20541.08 to 1.91, resulting in a 553161.15% performance improvement.

Top comments (0)