Parallel Execution and Partitioning

What is Parallel Execution?
At a very high level, parallel execution is a means for Oracle to “divide and conquer” the process of fetching data Oracle tables.
Parallel execution in Oracle allows a single database operation to apply multiple CPU and IO resources (i.e., processes) to work together to retrieve the results.  Parallel execution is often a desirable goal when and if:
  • A large amount of data needs to be retrieved.
  • Multiple CPU and IO resources are available on the database instance infrastructure to support additional processes.
  • Your system has performance requirements that can benefit from multiple processes working in concert to achieve faster query run-time.
You can use parallel execution for many operations in Oracle, including table scan, index fast full scan, and partitioned indexed scan access operations. Additional, parallel execution can work for various join methods, DDL, DML, parallel query, and so forth.
At a very high level, here’s how it works:
1. A SQL statement is issued that qualifies for parallel execution – this can be due to hints, degree of parallelism (DOP) settings on the underlying objects, or instance configuration settings. I’ll go into this in detail in a bit.
2. The user session takes the role of the “query coordinator”. The job of the query coordinator is to spawn and coordinate parallel servers.
3. The parallel servers will be given sets of data to “operate on”, called granules. In other words, the QC (query coordinator) delegates which sets of granules, or sets of database blocks, that each parallel server is responsible for retrieving or operating on.
4. Each parallel server will issue the necessary IO calls to obtain the data, in parallel when possible. When finished, results are fed back to the QC for additional portions of execution.
5. The QC then sends the results to the user.
In the context of traditional Oracle consumer/producer roles, the parallel servers are producers – they produce rows that the consumer (QC, usually) will consume.
For sake of reference, parallelism works the same on Exadata as it does on non-Exadata 11gR2.  With Exadata, which you’ve cost-justified based on your database’s performance requirements, business-criticality, or some other reason, you want to get the most performance out of your investment, so exploiting the benefits of parallel query is important.
Parallel Execution and Partitioning
In many data warehouse and query-intensive applications, parallel execution and partitioning can compliment each other from a performance and resource utilization perspective. In this section I’ll talk about a few reasons why and show some examples. The assumption is that the reader understands partitioning concepts. Please to Oracle documentation for details.
Let’s begin by looking at SOE.ORDER_ITEMS, a large table built using the SwingBench load test tool. This table is partitioned using HASH partitions, with 16 partitions created. There are also a collection of locally partitioned indexes and global indexes created on the table:
Currently, the table is defined with noparallel DOP:
Let’s start by running a sample query, summing QUANTITY by PRODUCT_ID:
As we can see, we did a full table scan (TABLE ACCESS FULL) and it ran in 7.63 seconds. Let’s color this table and its indexes with a DOP of DEFAULT and compare results:
The query ran in 4.89 seconds and spawned 16 PQ servers, with a DOP of 8, on each instance in the RAC cluster. With the DOP of ORDER_ITEMS being DEFAULT, we may have expected that the parallel execution DOP to be 8, which is the computed value of parallel_degree_limit. So why did it spawn 16? In short, it does so because it can – the volume of data being requested is such that it can use the maximum number of PQ servers the system allows, so it’ll spawn them and use them. Let’s create an unpartitioned table as a copy of SOE.ORDER_ITEMS and do some additional tests to show how partitioning and parallel execution show up in an execution plan.
First, we’ll create our table:
After running a similar test query, let’s look the execution plans for ORDER_ITEMS vs. ORDER_ITEMS_NP:
If you look closely at the two execution plans, you’ll see that they are nearly identical and this is no surprise – our access method operation in both cases is a “TABLE ACCESS FULL” operation, which basically will scan either all hash partitions (in the case of ORDER_ITEMS) or the entire table (in the case of ORDER_ITEMS_NP).
Let’s run a more interesting example.  In the SOE schema, we’ve got a ~1 million row ORDERS table, partitioned again via hash partitioning on ORDER_ID.  Let’s create a copy of this and create a range-partitioned table on ORDER_DATE based on this and so some testing:
Once we’re all setup, let’s run a query with a predicate that should do partition elimination. At this point, we’re not going to give the table a default or other DOP – we want to make sure partition pruning is working.
Below, I’m going to use AUTOTRACE instead of SQL Monitor because the duration of the queries I’m running isn’t long enough to qualify (5 seconds of CPU or IO time, in the case of serial operations):
We can see above that partition pruning did occur, by nature of the PARTITION RANGE ITERATOR operation in the execution plan as well as the Pstart and Pstop columns. Now let’s parallelize ORDERS_P1:
In the above output, here are some things to note:
  • The PX BLOCK ITERATOR operation alone doesn’t tell you that partition pruning occurred, as this is the same operation that shows up in all PQ operations as the parallel servers query the block granules.
  • The Pstart and Pstop columns, as in the case when the query was not parallelized, are the indications that partition pruning occurred.
  • Note than since parallelism was used, we can use SQL Monitor to show results in the above test.
So let’s see it – note that the Pstart and Pstop columns don’t show up in the SQL Monitor report:
Now let’s run a two-table join with hash-partitioned tables – I’ll make all indexes invisible on both tables so we can see a really nice full-scan example. In this first example, I’ll run without a default DOP, so both tables will have a DOP of 1:
From the above, we can see:
  • The query ran in 6.75 seconds.
  • The PARTITION RANGE ITERATOR step in line 6 indicates the partition-pruning step for the full-scan on ORDERS_P1.  The optimizer then created a partition join filter (PART JOIN FILTER CREATE, step 5) to hash join wit the result of the scan on ORDER_ITEMS.
  • The partition scan on ORDER_ITEMS produced a PARTITION HASH JOIN-FILTER operation, the result of which was hash joined with the results from ORDERS_P1.
Now let’s paint both tables with a default DOP and re-run the test:
  • The parallel partition-wise query ran in under 5 seconds.
  • Since the optimizer estimated that the result of the scan from ORDERS_P1 was substantially smaller than the result from ORDER_ITEMS, a bloom filter was create – this is evident in step 10, the PX SEND BROADCAST with :TQ10000 producing rows for the :BF00000 bloom filter to consume.
  • The :BF00000 bloom filter is fed to the parallel, partition-wise full scan on ORDER_ITEMS.
  • It’s worth noting that 16 PQ servers were used in this query – see below: