Measuring Parallel Execution

What is Parallel Execution?
At a very high level, parallel execution is a means for Oracle to “divide and conquer” the process of fetching data Oracle tables.
Parallel execution in Oracle allows a single database operation to apply multiple CPU and IO resources (i.e., processes) to work together to retrieve the results. Parallel execution is often a desirable goal when and if:
  • A large amount of data needs to be retrieved.
  • Multiple CPU and IO resources are available on the database instance infrastructure to support additional processes.
  • Your system has performance requirements that can benefit from multiple processes working in concert to achieve faster query run-time.
You can use parallel execution for many operations in Oracle, including table scan, index fast full scan, and partitioned indexed scan access operations. Additional, parallel execution can work for various join methods, DDL, DML, parallel query, and so forth.
At a very high level, here’s how it works:
1. A SQL statement is issued that qualifies for parallel execution – this can be due to hints, degree of parallelism (DOP) settings on the underlying objects, or instance configuration settings. I’ll go into this in detail in a bit.
2. The user session takes the role of the “query coordinator”. The job of the query coordinator is to spawn and coordinate parallel servers.
3. The parallel servers will be given sets of data to “operate on”, called granules.  In other words, the QC (query coordinator) delegates which sets of granules, or sets of database blocks, that each parallel server is responsible for retrieving or operating on.
4. Each parallel server will issue the necessary IO calls to obtain the data, in parallel when possible. When finished, results are fed back to the QC for additional portions of execution.
5. The QC then sends the results to the user
In the context of traditional Oracle consumer/producer roles, the parallel servers are producers – they produce rows that the consumer (QC, usually) will consume.
For sake of reference, parallelism works the same on Exadata as it does on non-Exadata 11gR2.  With Exadata, which you’ve cost-justified based on your database’s performance requirements, business-criticality, or some other reason, you want to get the most performance out of your investment, so exploiting the benefits of parallel query is important.
Parallel Query in Oracle Execution Plans
Being able to interpret a SQL execution plan is important in order to understand parallel execution behavior. In this section we’ll use SQL Monitor (dbms_sqltune.report_sql_monitor) after our test to show execution plan details. Using SQL Monitor is very valuable for many reasons:
  • It shows the execution plan, as dbms_xplan would.
  • It provides global statistics for the query.
  • It shows parallel execution details and drill-down statistics, for every SQL statement with a parallel operation.
Below is an example of using dbms_sqltune.report_sql_monitor for our previous test case. First, our query:
The first section of the SQL Monitor report shows the Global Information.  The output is relatively self-explanatory:
Next, we see the global statistics from the query:
You can see that it took 3.69 seconds of elapsed time, of which 3.24 seconds were IO-related.  We read 44MB of data.  Next, it shows parallel execution details:
From the above we can see the DOP was selected at 8 and we can also see the relatively smooth distribution of IO for each PQ server. Three of our PQ servers waited on direct path read waits, which is what you’d see for parallel query slaves because they return data to the user’s PGA. Next, we see an instance breakdown of detail that shows which PQ servers were run on each instance, along with the query coordinator (QC). You can see that on instance 1 (XRAC1), it spend slightly more time.
Finally, we see the execution plan:
One of the great things about SQL Monitor is the report of estimated rows and actual rows.  From the above, we can see that CBO estimated 446k rows on the full-scan of MYOBJ but retrieved 452k rows. These are pretty close to the same thing, but if they weren’t it’d probably mean that we had bad statistics on the table. Let’s examine these execution plan details and talk through what each line means.
1. First, it does a parallel full-scan on MYOBJ. We know that this is a parallel operation because the operation above it is a “PX BLOCK ITERATOR” operation:
2. As mentioned, the operation associated with the full-scan is a “PX BLOCK ITERATOR” step.  PX BLOCK ITERATOR means that the parallel server processes will iterate over the generated block range granules to complete the table scan.
3. The next operation is a SORT AGGREGATE operation. Typically, a SORT AGGREGATE operation will return one row (and you can see later in the plan above that it does) – in the context of a parallel operation, however, it’s basically the sorted data from each of the parallel servers involved in the query. The SQL Monitor report shows a row estimate of “1” and an actual rows of 8, which means 1 row returned for each parallel server.
4. Next we see the “PX SEND QC (RANDOM)” operation. This is our query coordinator, consuming rows that our PQ servers produced. Any line below “PX SEND%” is a producer, and a “PX SEND” is a consumer of the parallel operation rows and in our case, a producer of the next set of operations. Because our query did a grouping function (count(*)), it will be a producer to the QC sort operation.
5. The next operation is the PX COORDINATOR operation, which his fed by the QC to produce data to the final sort operation. You’ll note 9 “Execs” – one for each PQ server and one for the QC.
6. Finally, a SORT AGGREGATE operation is used to sort the results and produce the single-row to the user.
Now let’s do a 2-table join and explain the execution plan:
Here’s how to read the plan:
1. The first thing done is at line 9 – an index fast full scan on SYS.OBJ$.I_OBJ1 index. This is done in parallel, as indicated from the “PX SEND” line above.
2. In line 8, we’re doing a “PX SEND BROADCAST” operation. When joining tables in parallel, Oracle can choose to either broadcast results (rows) from one operation to apply to the other table scan, or it can choose PX SEND HASH. In this case, our CBO determined that a BROADCOAST was appropriate because the results from the OBJ$ table were much lower than the MYOBJ table.
3. Line 7, the PX RECEIVE step, is basically the consumer of the broadcasted rows in step 8.
4. Line 6 is an in-memory BUFFER SORT of the rows returned from the index scan on OBJ$.
5. Lines 11 and 10, respectively, indicate the full scan and PX BOCK ITERATOR operation for the granules involved in the 8 PQ servers.
6. In line 5, Oracle is doing a hash join on the resulting rows from the parallel scans on MYOBJ and OBJ$.
7. Line 4 is a per-PQ server sort of data from the joined PQ servers.
8. Line 3 is the consumer QC that holds the result of the each of the PQ servers.
9. Line 2 is the PX Coordinator (QC) collecting, or consuming the rows of the joined data.
10. Line 1 is the final SORT AGGREGATE line that performs the grouping function.
Let’s do another test joining MYOBJ to itself to produce a larger result and see what changes in our execution plan:
Above, there are a few key differences between this test and the previous test, in addition to the obvious fact that we’re joining to a different table
  • The PX SEND BROADCAST changed to a PX SEND HASH because the optimizer knew that the result sets of each full parallel scan were roughly (or exactly) the same size.
  • The plan shows that we basically did two full parallel scans and hash joined the results together.
Specifically, each full scan operated on its own set of granules (each had a PX BLOCK ITERATOR operation), and each had a PX RECEIVE consumer to consume the rows of the parallel scan.
If we display the parallel execution statistics of the query, we can see 8 PQ serves established for each table scan in the join:
This points out something that’s worth noting – if we run a parallel-enabled join, Oracle will spawn PQ servers for each table that has a parallel DOP, so in this test we had 16 PQ servers accessing data. The more tables you join in parallel with a non-serial DOP, the more work your system will do. In this instance, we’ve got parallel_max_servers=80, so we could possibly see up to 80 servers established if we had a query that needed it. Let’s show what it would look like if we reduced parallel_max_servers to 10.
We still see 16 PQ servers established, although no more than 10 on a single instance, since parallel_max_servers works at the instance level.  So let’s set to parallel_max_servers is 4 and see what we can see:
Above we can see the DOP calculated at 4 and only 4 PQ servers established per node. Let’s force our DOP to 8 when parallel_max_servers is only 4 and see what our plan looks like:
It looks similar to the previous test – we’ve requested 16 servers due to the DOP of 8 on the table, but it only allocated 8, 4 for each instance.
Measuring Parallel Execution with V$PQ_TQSTAT
Using an example on a table called MYOBJ, a copy of SYS.OBJ$ ballooned up with many more rows I’ll show some information from V$PQ_TQSTAT after running a parallel query. This MYOBJ table is colored with a DEFAULT DOP, which, on the environment it’s running, equates to a DOP of 8. If you query this view immediately after running a parallel query, you’ve got a good chance at seeing how it was parallelized. V$PQ_TQSTAT shows DFOs, or “data flow operations” – parallel execution doesn’t happen in the context of a query, but on a data flow operation.  There are some things that make this difficult – first, the contents of the view disappear quickly, and second, the output can contain more than just parallel stuff since a single query can have multiple data flow operations.
But let’s test anyway:
Things to note from the above:
  • There are 9 rows – one for each of the 8 producers (specified by the DOP) and one for the query coordinator.
  • The number of bytes processed per producer was equally sized, so we can be assured that each parallel server did the same amount of work.
  • The processes were split evenly between the different instances in the RAC cluster.
Let’s test one on a smaller table:
And now let’s do a test where we actually select a bunch of rows and see if the output is any different:
Now let’s do another test with a join, so we can see the impact on the output form V$PQ_TQSTAT.
In the above we can see two QC DFO operations, one of which is a consumer (as expected) and one that is a producer. If we look at the explain plan (below) we can see that the optimizer decided do a hash join on the results of with a fast full scan on SYS.I_OBJ1 index, then broadcast results to the parallel full scan on TEST.MYOBJ. So in reading the output above, the operations associated with TQ_ID 1 represent the producer/consumer operations with the parallel operation on SYS.OBJ$ and the operations on TQ_ID 0 are associated with MYOBJ.  We’ll get into details on how to read parallel execution plans in a bit, but as you can see, it’s not necessarily obvious how to read the results from V$PQ_TQSTAT.
Tracing Parallel Execution with “_px_trace”
When querying V$PQ_TQSTAT, there is definitely room for some odd and unpredictable results when querying V$PQ_TQSTAT. To help understand exactly what’s happening with a parallel execution, we can trace using “_px_trace”.  Please refer to MOS document 444164.1 for details.  Let’s show an example:
Looking at the trace file, there is a lot of information generated because I’ve selected compilation, execution, and messaging options.  To see how many parallel slaves were started,  you can do something like this:
You can also see which instances the slaves were started/used on by doing something like this – you’ll see that we’ve got 6 slaves started on the remote node and two on the local node: