Automatic Degree of Parallelism in Oracle 11gR2

What is Parallel Execution?
At a very high level, parallel execution is a means for Oracle to “divide and conquer” the process of fetching data Oracle tables.
Parallel execution in Oracle allows a single database operation to apply multiple CPU and IO resources (i.e., processes) to work together to retrieve the results.  Parallel execution is often a desirable goal when and if:
  • A large amount of data needs to be retrieved.
  • Multiple CPU and IO resources are available on the database instance infrastructure to support additional processes.
  • Your system has performance requirements that can benefit from multiple processes working in concert to achieve faster query run-time.
You can use parallel execution for many operations in Oracle, including table scan, index fast full scan, and partitioned indexed scan access operations. Additional, parallel execution can work for various join methods, DDL, DML, parallel query, and so forth.
At a very high level, here’s how it works:
1. A SQL statement is issued that qualifies for parallel execution – this can be due to hints, degree of parallelism (DOP) settings on the underlying objects, or instance configuration settings.  I’ll go into this in detail in a bit.
2. The user session takes the role of the “query coordinator”. The job of the query coordinator is to spawn and coordinate parallel servers.
3. The parallel servers will be given sets of data to “operate on”, called granules. In other words, the QC (query coordinator) delegates which sets of granules, or sets of database blocks, that each parallel server is responsible for retrieving or operating on.
4. Each parallel server will issue the necessary IO calls to obtain the data, in parallel when possible. When finished, results are fed back to the QC for additional portions of execution.
5. The QC then sends the results to the user.
In the context of traditional Oracle consumer/producer roles, the parallel servers are producers – they produce rows that the consumer (QC, usually) will consume.
For sake of reference, parallelism works the same on Exadata as it does on non-Exadata 11gR2.  With Exadata, which you’ve cost-justified based on your database’s performance requirements, business-criticality, or some other reason, you want to get the most performance out of your investment, so exploiting the benefits of parallel query is important.
Historical Context
  • Some of the historical challenges with parallel query and parallel execution have been:
  • How many parallel query slaves does my query need?
  • How many concurrent parallel operations will saturate my system?
  • How can I ensure that parallelism will scale up to the point of saturation but not exceed it?
  • Can I ensure that my query will only run if it can establish a minimum amount of parallel query slaves?
Prior to 11gR2, finding this balance has been difficult. With 11gR2, Oracle has introduced “automatic degree of parallelism”, or Auto DOP. Auto DOP automatically calculates the degree of parallelism on a per SQL statement basis, in addition to several other key features. In short, here’s with Auto DOP does:
  • Automatically calculates DOP based on size of segment being accessed (and their DOP parameters), combined with overall system load.
  • Provides parallel statement queuing, which ensures that parallel operations will queue when a threshold is met and not either downgrade or saturate the system.
  • Provides a new, optimized mechanism called “in memory parallel execution”, which essentially reads blocks from PQ operations into the buffer cache and makes them re-usable for use later.
Below are a couple of facts and tips for Auto DOP:
  • Auto DOP is enabled by setting parallel_degree_policy = AUTO or LIMITED.
  • When parallel_degree_policy is set to AUTO, all features of Auto DOP are enabled. This means that all SQL operations that qualify for parallelism (full table scan, fast full index scan, etc.) are evaluated, and a system-specified DOP is determined by the optimizer for the operations. It also means that parallel statement queuing is enabled, as well as in-memory parallel execution.
  • Parallel statement queuing is controlled by the setting for parallel_servers_target, which defaults to 4 * the default DOP. In layman’s terms, this means that as soon as the equivalent of 4 SQL statements are running in parallel, all subsequent parallel attempts will queue until the running processes complete.
  • A related parameter, parallel_max_servers, is used to limit the maximum number of parallel servers per instance. If parallel_servers_target is less than parallel_max_servers, parallel statement queuing can occur – if not, it will not because the parallel_servers_target limit will be reached before Auto DOP queuing logic kicks in.
  • In-memory parallel execution, enabled with Auto DOP = AUTO, will effectively disable direct path reads. In Oracle Exadata environments, this means that the cell offload and Smart Scan features will not be used. This is very important, as this is one of the most powerful aspects of Exadata. On Exadata, if  you want to take advantage of Auto DOP, you should do this:

o Set parallel_degree_policy to LIMITED

o Color all your tables with a DEFAULT DOP

o Set “_parallel_statement_queuing”=TRUE

  • When parallel_degree_policy is set to LIMITED, only operations on tables with a DEFAULT DOP will be evaluated for Auto DOP.
  • The parallel_min_time_threshold parameter is used to specify the minimum amount of estimated time a query will need to take, according to the optimizer, before Auto DOP decides to parallelize it.  The default is either 10 or 30 seconds, depending on which documentation you refer to – we’ll try to prove this out later in this document.  The thinking behind this is simple – Oracle knows that the act of starting an Oracle process is an expensive operation, and why start a bunch of PQ servers if the query is doing to be a short-lived one?  This is one of the ways Auto DOP works to not over-parallelize execution when it doesn’t need to.
One of the interesting things you’ll see with systems running Auto DOP is some interesting per-statement DOP run-time values. Instead of seeing everything with a default DOP or a per-segment DOP, you’ll see DOP variations, many of which are lower than you’d expect. This is the optimizer’s way of saying “hey,  I know how many PQ servers I need, let me only start as many as a I need instead of maximizing to the limit of the system’s parallel_degree_limit”.
Again, Oracle introduced Auto DOP in 11gR2 to make the determination of a query’s DOP “automatic”, instead of based on trial-and-error and often attempted without an understanding of the underlying hardware configuration. Many times, I’ve seen bits of custom and Oracle packaged software code with hints like the below:
SELECT /*+ parallel (x,4) */ …
SELECT /*+ parallel (20) */ …
In these, the degree of parallelism seems to have been sized more out of guesswork than anything else. And sometimes, I’m just as guilty as developers sometimes and often put in a PQ hint in based on a gut feel for how much horsepower I think it needs. Auto DOP takes care of this guesswork and automatically determines your DOP.
Because of the varied and interesting impact of Auto DOP, I’ll do a variety of test cases in this section and attempt to shed some light on how it works in practice.  In the tests below, I’ll use the SOE.ORDERS and SOE.ORDER_ITEMS tables, each of which are hash-partitioned with all indexes invisible to motivate full table scans.
Setting parallel_degree_policy=AUTO 
Let’s first enable Automatic DOP:
We’ll also validate the DOP on our tables:
Before starting our test case, let’s validate parallel_max_servers and parallel_degree_limit:
Recall from our Environment Comparison earlier than in this XRAC environment, a parallel_degree_limit=CPU means that the parallel degree limit is 8 on each instance in the 2-node cluster. Now let’s run a full-scan test and see what we can learn:
From the above, we can see that no parallel execution took place – how can this be?  Let’s run an explain plan on this and see if we can learn anything:
The explain plan tells us that automatic DOP was in fact engaged, but determined that the computed DOP was 1 due to “parallel threshold”. This could very well indicate that the optimizer decided against parallelism because it estimated the query time to run in less than parallel_min_time_threshold. To validate this, let’s trace our parallel query using a “_px_trace” event and see what we can learn from the trace file:
The trace file above shows nothing conclusive, which is what we’d expect in hindsight because no operation(s) was parallelized. So let’s do this – let’s triple the size of our table and do the same type of test and see if we can get it to parallelize. Again, we’re going under the assumption that parallel execution is not being used due to not exceeding the parallel_min_time_threshold.
The above shows us two things:
  • The query wasn’t parallelized – if it were, our SQL Monitor report would have picked it up, since all parallel operations are eligible for SQL Monitor.  This is a good thing and for now, backs up our suspicion that things weren’t parallelized because the optimizer assumed it would run in less than parallel_min_time_threshold.
  • The query consumed less than 5 CPU seconds or 5 IO seconds of elapsed time, since we can tell from above that the last statement being monitored was the DBMS_STATS execution
So now we’ll balloon row-counts in our test table:
We’ve now got 112 million rows, so ideally Oracle will parallelize a full-scan on this:
As we can see above, the query ran in about 33 seconds and Auto DOP calculated a degree of parallelism of 2 for the query – you can tell this by counting the PQ servers launched for the query.  Since the default DOP is 8, we would have expected a maximum of 16 PQ servers, but the optimizer decided to only launch 2. With calculated DOP of 2 and parallel_servers_target=32 for this environment, we would then expect that the 17th concurrent execution of this query would actually start to queue with Auto DOP. We’ll get into this later.
Here’s what an EXPLAIN PLAN looks like for this:
The very bottom line indicates what DOP Auto DOP calculated, and as you can see it matches what our SQL Monitor trace showed.
Before going on to the next section, let’s see what the difference in query time is if we switched Auto DOP off and re-ran the query, allowing it to fully-parallelize:
We can see from the above that with 16 PQ servers, the query completed in 27.25 seconds.  This was about 18% faster than with 2 PQ servers, and is a very interesting difference because:
  • It was faster, but …
  • It wasn’t that much faster, and consumed much more overall system resources (DB Time).  The 2-PQ server run calculated by Auto DOP consumed 66 seconds of database time 7.48 seconds of CPU time, where the MANUAL run with 16 PQ servers consumed 210 seconds of database time and 11 CPU seconds.
DOP with Specific parallel clauses on Table & Auto DOP
In this section, we’ll set parallel_degree_policy=AUTO and do two tests, one with the parallel degree set to 1 (noparallel) for our table, and one with a DOP of 6.  According to Oracle documentation and in reference to the test case in the previous section, we should still see a DOP of 2 and Auto DOP overriding the default settings:
In both cases, we see Oracle choose a DOP of 2 regardless of what the segment’s parallel configuration was, which is what we’d expect.
Setting parallel_degree_policy=LIMITED 
Here, we’ll use the same test case with but with parallel_degree_policy=LIMITED.  We should see this:
  • If DEGREE=DEFAULT, Auto DOP will take effect
  • If DEGREE <> DEFAULT, statement will be parallelized according to its parallel degree
Let’s start by changing our initialization parameter and setting the DOP for our table to 6:
If we run our test query, we see this:
The above output confirms that a table’s specific DOP will override the effect of Auto DOP and the query was parallelized with degree 6. Now let’s set the default DOP on the table and test:
The takeaway from this section is this:
  • A specific, non-DEFAULT DOP for a table will be the chosen DOP when parallel_degree_policy=LIMITED
  • Auto DOP will calculate its own DOP when the DOP for a table is DEFAULT
  • You may want to consider setting parallel_degree_policy=LIMITED, set all tables with a DOP of DEFAULT by default, and then override this with specific degrees of parallelism in cases in which you know you can get and need to get better mileage.
  • You can still benefit from parallel statement queuing by setting “_parallel_statement_queueing”=TRUE
Investigating parallel_min_time_threshold
A few sections ago, we found that we needed to populate our test table with a ton rows in order for Auto DOP to give us a “non-1” DOP due to our parallel threshold. Let’s dial parallel_min_time_threshold down to a low number (in seconds) to see how we can influence behavior. I’ll set parallel_degree_policy back to AUTO in this section and also create a second table, ORDER_ITEMS_NOTASBIG that’s got about 10 million rows.
I’ll use EXPLAIN PLAN in this section because it’s quicker and also because it shows the “automatic DOP” note at the bottom.  First, let’s test with the default setting for parallel_min_time_threshold:
Now we’ll reduce our threshold and test:
As you can see, Auto DOP computed a DOP of 1 for this query after reducing the setting of parallel_min_time_threshold to 2.
Parallel Statement Queuing
In addition to automatically calculating the DOP of queries, Auto DOP also enables a feature called parallel statement queuing, which will begin queuing parallel operations when the computed (or adjusted) value of parallel_servers_target is reached. In our test system, parallel_servers_target is automatically set to 32 per the formulas outlined earlier in this document. In order to make this section meaningful, I’m going to reduce parallel_servers_target to 4 and run our test query multiple times against ORDER_ITEMS_BIGGER, which we know Auto DOP will compute a DOP of 2 for:
Then I’ll run a simple script that will fire off 8 concurrent queries full-scanning our table, passing an argument to each script with the spooled output file:
After running the queries, we’ll have 8 different files with SQL Monitor output to examine, trial[1-8].lst.  Let’s take a look at one of these output files, and I’ll delete the global output section:
From the above, we can see several wait activity detail lines with the wait event “resmgr; pq queued”. This is an indication that parallel queuing took place for the query. Also note that in this example, our overall elapsed time went from about 33 seconds to 3 minutes and 26 seconds.  So did all our queries queue? The answer in this case is yes, based on the output files. So let’s run a query by itself with our reduced setting for parallel_servers_target and see what we can learn:
As we can see above, we did not experience parallel statement queuing and calculated DOP of 2 was able to be attained as-is.
Let’s now try reducing parallel_servers_target to 1 and see if queuing takes place, or if Auto DOP computes the DOP to 1:
In the above, since Auto DOP would have calculated a DOP of 2 anyway, Oracle was able to use a PQ servers on each node in our cluster, so the test was sort of inconclusive.
We’ve tested so far with parallel_degree_policy=AUTO, but not let’s test with parallel_degree_policy=LIMITED and “_parallel_statement_queueing”=TRUE.  In this test, we’ll run our same script that runs 8 of these simultaneously, after setting parallel_servers_target back to 4:
Upon examining the output from SQL Monitor, all 8 executions showed evidence of parallel statement queuing, as in the earlier test case.
What if a table is decorated with a non-DEFAULT DOP, parallel_degree_policy=AUTO, and we’ve got concurrency levels that go beyond  parallel_servers_target? In other words, does queuing only work if tables have a DEFAULT DOP? Let’s test:
Now let’s fire off 8 concurrent, DOP=10 scans against this table, or try. When done, I’ll look inside the output file it spooled and as we can see below, parallel statement queuing did in fact take place. This was expected, and is a good thing.
In summary:
  • Parallel statement queuing will cause parallel execution operations to queue, as advertised, with parallel_degree_policy=AUTO.
  • With parallel_degree_policy=LIMITED, we can still achieve parallel statement queuing by setting “parallel_statement_queuing”=TRUE.
In-memory parallel execution
Traditionally, blocks requested via parallel query bypassed Oracle’s buffer cache and were read directly from disk into the PQ server’s PGA using a direct path read mechanism. This design was done to avoid the pain of saturating the buffer cache with large chunks of data. With Auto DOP and parallel_degree_policy=AUTO, Oracle can choose to read PQ-scanned blocks into the buffer cache, which in theory would help performance in cases when multiple sessions commonly request the same blocks, in parallel.   The trick to getting this to work is:
  • Make sure parallel_degree_policy=AUTO
  • Make sure the objects being scanned are large enough to receive a non-serial Auto DOP-computed DOP. If the amount of data is too small, Oracle won’t consider in-memory parallel execution
  • Make sure your buffer cache is large enough to hold the blocks
On my test system (XRAC), I can’t scale a buffer cache large enough to hold blocks that Auto DOP will consider parallelizing, so let’s test this on Exadata. The downside to in-memory parallel execution on Exadata is that it means no serial direct reads, and without serial direct reads, Smart Scan is not possible. This effectively takes the most powerful feature of Exadata out of play.
It’s still kind of difficult to get in-memory parallel execution to take place on Exadata. The following needs to be considered:
  • The table needs to be small enough to fit in a local buffer cache or the aggregate buffer caches across both instances in our quarter rack.
  • The query needs to run in a long enough time to be over the parallel_min_time_threshold setting.
For our test, we disabled indexes on SOE.ORDER_ITEMS, set the parallel degree DEFAULT, and ran a select against it:
As you can see, the query was NOT offload-eligible, too 1 minute and 32 seconds, and executed with 2 PQ slaves. Let’s run the same query again and see if we can benefit from in-memory parallel execution:
As we can see, the query time went from 1:30 to a little over 2 seconds, a pretty significant reduction. This is in-memory parallel execution at work.
Let’s set parallel_degree_limit to LIMITED and re-test, twice:
As we can see, with parallel_degree_limit=LIMITED, in-memory parallel execution is disabled and Smart Scan operations can continue.
So our summary Exadata is this:
  • Don’t set parallel_degree_limit=AUTO, it will sometimes bypass serial direct reads
  • Set parallel_degree_limit=LIMITED and “_parallel_statement_queuing”=TRUE.
Tracing Auto DOP
Tracing Auto DOP can be done the same way we’ve done it before, by enabling “_px_trace”.  Let’s do this for a test query and see if the trace file has anything Auto DOP-specific. I’ve doubled the size of my test table for this test to see if I can get a higher DOP:
An explain plan now shows a DOP of 6:
So let’s run a test query doing a full-scan against this larger table and see what we can learn from the trace file – make sure you flush the shared pool first so all the information is written to the trace file. The first sections shows this (output abbreviated):
Each of these kkscsid_pdm_eval lines represent recursive queries executed to determine information about the objects being scanned. Auto DOP uses this information to determine the DOP. When finished with these, the trace file will show the requested number of PQ threads, as indicated by reqthreads below:
This tells us that the number of PQ threads had already been established before writing to the trace file, so to understand how the optimizer selected a DOP of 6, we need to look at a SQL compiler trace (i.e., event 10053):
If we find the trace file with the “SQLCOMP” string in it that does not belong to one of the PQ slaves, we’ve found our trace file to analyze.  Near the top of the trace file, you’ll see the parallel degree limit of 8:
The normal SQL Compiler tracing stuff shows up after this, and eventually you’ll get to a section in the trace file under “AUTO DOP PLANS EVALUATION”, which in our case looks like this:
You can see that the “AutoDOP: kkopqAdjustDop” line shows a re-calculated DOP of 6 after “affinity normalization”.  I’m trying to work out how Oracle selected the DOP of 6, because the SQL Compiler trace didn’t show anything specific that I can see that evaluates other potential DOPs.  This is something I’ll want to spend more time trying to understand and hopefully, I’ll have an update soon.
IO Calibration and Auto DOP
In order for Automatic DOP to work, IO calibration needs to be done.  If not, you’ll get errors like this on an explain plan or execution plan:
automatic DOP: skipped because of IO calibrate statistics are missing
On non-Exadata systems, use dbms_resource_manager.calibrate_io as shown below:
With Exadata systems, since all the systems are the same engineered systems, you can do this instead. The same type of thing should be able to be leveraged on ODA, but probably with different numbers.
delete from resource_io_calibrate$;
insert into resource_io_calibrate$
values(current_timestamp, current_timestamp, 0, 0, 200, 0, 0);