World's Fastest Scalable Join

articles: 

One glance at my golf clubs would be enough to determine that I'm a terrible golfer. The pitching wedge is dirty. Nine-iron: dirty. Same with the eight, seven and six irons. Five, four and three irons are fairly clean. Woods: pristine. I play percentage golf (actually 110%, if you count penalties); I figure a 5-iron 150 meters down the fairway is a better bet than a 3-wood 200 meters into the trees.

So I've got a golf bag with 2 clubs that I paid for but never use. Madness? Well no, not really; but then I'm not paid to play golf. Can you imagine a professional golfer never using the driver? It wouldn't happen.

Can you picture an Oracle programmer never using the most powerful join method available? No? Get a mirror.

Scalability

The problem with most database joins is that they are not scalable. What do I mean by a scalable join? Two things:

  1. There is no upper limit to the size of the join. A join that uses TEMP space is not fully scalable because it is restricted by the availability of TEMP space.
  2. Performance is in (roughly) linear proportion to the size of the tables. eg. If joining A and B takes n seconds, then doubling the size of A (or B)should no more than double the join time.

Commonly used joins

The three most commonly used joins are Indexed Nested Loops, Hash Join, and Sort-Merge Join.

Indexed Nested Loops

The Nested Loop join is an iterative join: for each row in the first (inner) row source, lookup matching rows in the second (outer) row source. If the nested lookup of the second row source performs a Unique or Range Index Scan, then we call this Indexed Nested Loops.

Indexed Nested Loops is used primarily in low volume joins; it is efficient over small volumes and versatile enough to be used in a variety of situations. Although it is fully scalable, Indexed Nested Loops is inefficient over large data volumes.

Hash Join

The hash join is used for high-volume equi-joins (joins with equals predicates). Oracle performs a single read of the smaller row source (call this T1) and builds a hash table in memory. The join key is used as the hash-key of the hash table. Then a single pass of the larger row source (call this T2) is performed, hashing the join key of each row to obtain an address in the hash table where it will find matching T1 rows.

Provided T1 remains small enough to build the hash table in memory, T2 can be scaled up to any arbitrarily large volume without affecting throughput or exceeding temp space. If T1 cannot be hashed in memory, then a portion of the hash-table spills to disk. When the hash table is probed by T2, the rows with join keys that match those parts of the in-memory hash table are joined immediately; the rest are written to TEMP and joined in a second pass. The bigger T1 is, the smaller the proportion of the hash table that can fit in memory, and the larger the proportion of T2 that must be scanned twice. This slows the Hash Join down considerably and also makes the join non-scalable.

Sort-Merge

A sort-merge join works by reading each row-source in the join separately; sorting both sets of results on the join column(s); then concurrently working through the two lists, joining the rows with matching keys. Sort-Merge is generally faster than Indexed Nested Loops but slower than Hash Join for equi-joins. It is used almost exclusively for non-equi joins (>, <, BETWEEN) and will occasionally be used when one of the row sources is pre-sorted (eg. a GROUP BY inline view)

If both row sources are small then they may both be sorted in memory, however large sorts will spill to disk making then non-scalable.

There is no way to make a Sort-Merge join scalable. The only other way to resolve a non-equijoin is to use Nested Loops, which is slower. As volumes increase, Sort-Merge will continue to out-perform Nested Loops, but will eventually run out of Temp space. The only solution is to extend TEMP, or convert the join to Nested Loops (and then wait).

How can I tell if my join is using TEMP space?

Indexed Nested Loops may be slow over large volumes, but at least they don't use TEMP space. One of the subordinate operations (eg. you may be joining to a GROUP BY inline view) may use TEMP, but it would still do so even without the join.

You cannot determine whether a Sort-Merge join or a Hash join is using TEMP space unless you run it. For Sort-Merge, you can use SQL*Plus AUTOTRACE, and for Hash you can use a 10104 event trace.

Using AUTOTRACE to identify sorts using TEMP space

  • Copy your SQL into SQL*Plus (but don't run it!)
  • Wrap the SQL in SELECT * FROM ( ... ) WHERE ROWNUM > 1. This allows you to execute the SQL - preserving the execution plan - without fetching any of the results.
  • Switch on AUTOTRACE and run the SQL.

SQL> set autotrace on statistics
SQL>
SQL>select * from (
  2     select /*+ ORDERED USE_MERGE(customers)*/ *
  3     from sales
  4     join customers using (cust_id)
  5* ) where rownum > 1

no rows selected


Statistics
----------------------------------------------------------
         42  recursive calls
         13  db block gets
       3601  consistent gets
       6015  physical reads
        116  redo size
       1534  bytes sent via SQL*Net to client
        373  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          1  sorts (memory)
          1  sorts (disk)
          0  rows processed

Observe the second-last line in the Statistics: sorts (disk). If this value is anything but 0, then the SQL is performing a sort too large to be completed in memory. If your explain plan shows two or more sort steps (as is the case with a Sort-Merge join) then it is not possible to determine which step resulted in the disk sort.

Using 10104 event trace to identify Hash Joins using TEMP space

  • Start the 10104 event trace.
  • Copy your SQL into SQL*Plus (but don't run it!)
  • Wrap the SQL in SELECT * FROM ( ... ) WHERE ROWNUM > 1. This allows you to execute the SQL - preserving the execution plan - without fetching any of the results.
  • Run the SQL.
  • Search the trace file for "spilling"

SQL> alter session set events '10104 trace name context forever'
  2  /

Session altered.

SQL>
SQL> select * from (
  2          select *
  3          from orders
  4          join order_lines using (ord_num)
  5  ) where rownum > 1
  6  /

no rows selected

SQL>
SQL> select p.value || '/' || '*_ora_' || sys_context('USERENV','SESSIONID') || '.trc'
  2  from v$parameter p
  3  where p.name = 'user_dump_dest'
  4  /

P.VALUE||'/'||'*_ORA_'||SYS_CONTEXT('USERENV','SESSIONID')||'.TRC'
--------------------------------------------------------------------------------
/ora/oracle/admin/cdmdev/udump/*_ora_198129.trc

The following is the beginning of the 10104 trace; note the line in near the bottom: hash-join is spilling to disk.

*** 2007-02-22 16:55:01.639
*** ACTION NAME:() 2007-02-22 16:55:01.637
*** MODULE NAME:(SQL*Plus) 2007-02-22 16:55:01.637
*** SERVICE NAME:(CDMDEV) 2007-02-22 16:55:01.637
*** SESSION ID:(147.9146) 2007-02-22 16:55:01.637
kxhfInit(): enter
kxhfInit(): exit
*** RowSrcId: 3 HASH JOIN STATISTICS (INITIALIZATION) ***
Join Type: INNER join
Original hash-area size: 7572790
Memory for slot table: 5898240
Calculated overhead for partitions and row/slot managers: 1674550
Hash-join fanout: 16
Number of partitions: 16
Number of slots: 24
Multiblock IO: 15
Block size(KB): 16
Cluster (slot) size(KB): 240
Minimum number of bytes per block: 16352
Bit vector memory allocation(KB): 512
Per partition bit vector length(KB): 32
Maximum possible row length: 191
Estimated build size (KB): 33
Estimated Build Row Length (includes overhead): 35
# Immutable Flags:
  Not BUFFER(execution) output of the join for PQ
  Evaluate Left Input Row Vector
  Evaluate Right Input Row Vector
# Mutable Flags:
  IO sync
kxhfSetPhase: phase=BUILD
kxhfAddChunk: add chunk 0 (sz=32) to slot table
kxhfAddChunk: chunk 0 (lbs=0x40d0dd20, slotTab=0x40d0de74) successfuly added
kxhfWrite: hash-join is spilling to disk
kxhfSetPhase: phase=PROBE_1
qerhjFetch: max build row length (mbl=28)
*** RowSrcId: 3 END OF HASH JOIN BUILD (PHASE 1) ***
...

Scalable Joins

The bad news is that making joins scalable is not a tuning exercise, it's a design exercise. It should be taken into consideration when a table is first built, as it can be difficult and risky to retrofit post-implementation.

It's worth the effort to design tables for scalable joins; not only do you not have to worry about TEMP space, the joins also run faster.

Other than Indexed Nested Loops, which performs sub-optimally over large data volumes, there are three other scalable joins: Serial Partition-Wise joins, Hash Cluster Nested Loops, and Cluster Joins. Prior to writing this article, I was of the mistaken belief that Cluster Join was the fastest type of join. In the course of benchmarking these three joins, I discovered that the awesome memory management of the Hash Join when combined with the scalability of partitions performed up to twice as fast. Although Hash Cluster Nested Loops and Cluster Joins are both fully scalable, they are a specialised solution to problems other than scalability so will not be discussed further.

Serial Partition-Wise Joins

A Serial Partition-Wise Join is only possible when joining two equi-partitioned tables that are partitioned on the join key, ie. both tables have the same number of partitions with the same bounds. They work by breaking a very large join down into several smaller independent parts. Partition-Wise joins are most commonly used in parallel queries where the purpose is to improve performance by joining different chunks at the same time, but when run in serial mode they can also be used to limit or avoid the use of TEMP space.

Serial Partition-Wise Joins are not inherently scalable; you need to make sure that no pair of partitions is too large to be rewritten in available TEMP space. It is impractical to avoid using TEMP space altogether because you need to make sure that each partition of the smaller table can be hashed in memory; this would result in impractically small partitions.

Note the much reduced disk usage (and execution time) in the following example when the join does not use TEMP space.

Serial Partition-Wise Join - 52 sec Regular Hash Join - 93 sec
SELECT *
FROM
 ORDERS_P JOIN ORDER_LINES_P USING (ORD_NUM)


call     count       cpu    elapsed       disk      query
------- ------  -------- ---------- ---------- ----------
Parse        1      0.02       0.01          0          0
Execute      1      0.00       0.00          0          0
Fetch     2160     52.66      52.18      12017      13425
------- ------  -------- ---------- ---------- ----------
total     2162     52.68      52.19      12017      13425


Rows     Row Source Operation
-------  ---------------------------------------------------
2159524  PARTITION HASH ALL PARTITION
2159524   HASH JOIN
1016271    TABLE ACCESS FULL ORDERS_P PARTITION
2159524    TABLE ACCESS FULL ORDER_LINES_P PARTITION
SELECT *
FROM
 ORDERS JOIN ORDER_LINES USING (ORD_NUM)


call     count       cpu    elapsed       disk      query
------- ------  -------- ---------- ---------- ----------
Parse        1      0.03       0.03          0          0
Execute      1      0.00       0.00          0          0
Fetch     2160     92.73      93.71      20545      11625
------- ------  -------- ---------- ---------- ----------
total     2162     92.76      93.75      20545      11625


Rows     Row Source Operation
-------  ---------------------------------------------------
2159524  HASH JOIN
1016271   TABLE ACCESS FULL ORDERS
2159524   TABLE ACCESS FULL ORDER_LINES

That's great, but how practical is it considering most tables are partitioned by date but joined on some other foreign key?

Partition-wise joins are the primary reason for the existence of Composite partitioning (sub-partitions within partitions). If you want to enable a partition-wise join between two tables but one (or both) of them is already partitioned by a date, then you hash-sub-partition that table by the join-key. The other table must also be hash-partitioned or hash-sub-partitioned on the join key.

When to use Serial Partition-Wise Joins

  • Use Serial Partition-Wise Joins when you are joining two large tables (>1M rows), especially when one of them is monolithic (>10M rows). You will need to use hash-sub-partitions if one or both is already partitioned on a column other than the join-key.

Things to watch out for

  • Take care with locally partitioned indexes when partitioning and sub-partitioning tables. Range scans on locally partitioned indexes can dramatically degrade in performance because they have to scan many index segments instead of just the one. You should never physically reorganise a table (partitioning, clustering, index-organised) without prior performance analysis and regression testing.

Conclusion

If you are designing a physical data model with large tables, it is simply irresponsible not to consider the scalability of joins. Hash Joins are almost always the top-performers, however there are unusual circumstances with very large volumes (that I have been unable to reproduce here) described in the Oracle Performance Tuning Manual where a Hash Join requires as many as two reads an a re-write of the larger row-source in TEMP space.

Large tables should be designed to exploit partition-wise joins wherever possible, this will all but guarantee future join scalability. If scalability is not an issue then the joins can be parallelised making them even faster.

Partitioning and sub-partitioning a table for performance is not simple; it is very easy to detriment performance in an attempt to improve it. Before attempting this technique on a live system, make sure you read the relevant chapters for the Oracle Performance Tuning manual, perform copious benchmarking on full production volumes, and regression test all affected software.

Comments

Kevin Meade's picture

now this is the kind of stuff I wanted to read when I registered for this online outfit. Gi'me more dude.

This very nearly didn't get published. I've been working on it on-an-off for two weeks. Wasn't happy with it and chucked it in the 'maybe later' folder. Did a major edit yesterday changing the focus and deleting about 200 lines. The original article was about the benefits of clusters - perhaps another day.

You can join V$SESSION and V$SORT_USAGE views to get SQL statements that currently are using TEMP tablespace

I'll keep that one in mind. There's v$sql_workarea_active as well, but the article was already getting lengthy, and a side-track into v$ views just seemed like too much effort. Thanks for reading.

this one:
"You cannot determine whether a Sort-Merge join or a Hash join is using TEMP space unless you run it."
isn't completely true.

SQL> create table a as select * from dba_objects;

Table created.

SQL> insert into a select * from a;

49893 rows created.
...
SQL> /

798288 rows created.
SQL> explain plan for 
  2  select a1.* from a a1, a a2
  3  where a1.owner = a2.owner;

Explained.
SQL> select * from table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------
Plan hash value: 1585513397

-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |   193G|    34T|       |  1681K (99)| 05:36:21 |
|*  1 |  HASH JOIN         |      |   193G|    34T|    43M|  1681K (99)| 05:36:21 |
|   2 |   TABLE ACCESS FULL| A    |  1584K|    25M|       |  4870   (1)| 00:00:59 |
|   3 |   TABLE ACCESS FULL| A    |  1584K|   267M|       |  4902   (2)| 00:00:59 |
-----------------------------------------------------------------------------------

TempSpc that's it!
It is a prediction of course and might be inaccurate but at least there is a prediction.