Settings that Impact Parallel Execution in Oracle 11gR2

What is Parallel Execution?
At a very high level, parallel execution is a means for Oracle to “divide and conquer” the process of fetching data Oracle tables.
Parallel execution in Oracle allows a single database operation to apply multiple CPU and IO resources (i.e., processes) to work together to retrieve the results. Parallel execution is often a desirable goal when and if:
  • A large amount of data needs to be retrieved.
  • Multiple CPU and IO resources are available on the database instance infrastructure to support additional processes.
  • Your system has performance requirements that can benefit from multiple processes working in concert to achieve faster query run-time.
You can use parallel execution for many operations in Oracle, including table scan, index fast full scan, and partitioned indexed scan access operations.  Additional, parallel execution can work for various join methods, DDL, DML, parallel query, and so forth.
At a very high level, here’s how it works:
1. A SQL statement is issued that qualifies for parallel execution – this can be due to hints, degree of parallelism (DOP) settings on the underlying objects, or instance configuration settings.  I’ll go into this in detail in a bit.
2. The user session takes the role of the “query coordinator”.  The job of the query coordinator is to spawn and coordinate parallel servers.
3. The parallel servers will be given sets of data to “operate on”, called granules.  In other words, the QC (query coordinator) delegates which sets of granules, or sets of database blocks, that each parallel server is responsible for retrieving or operating on.
4. Each parallel server will issue the necessary IO calls to obtain the data, in parallel when possible.  When finished, results are fed back to the QC for additional portions of execution.
5. The QC then sends the results to the user
In the context of traditional Oracle consumer/producer roles, the parallel servers are producers – they produce rows that the consumer (QC, usually) will consume.
For sake of reference, parallelism works the same on Exadata as it does on non-Exadata 11gR2. With Exadata, which you’ve cost-justified based on your database’s performance requirements, business-criticality, or some other reason, you want to get the most performance out of your investment, so exploiting the benefits of parallel query is important.
Instance Initialization Parameters that Impact Parallel Execution
Running a query or operation in parallel can improve performance, and this improvement is typically a factor of how many parallel servers the operation uses and needs. This is known as the “degree of parallelism” for an operation, or “DOP”. Getting the DOP right is critical to ensuring a query runs quickly, uses as many resources as it needs but not more than it should, establishing consistent performance without impacting concurrency, and so forth. Finding this mix can be difficult, as it’s dependent on underlying hardware infrastructure, size and nature of segments being accessed, system load and concurrency, and other factors. Oracle provides a number of initialization parameters tat impact parallel execution and in this section I’ll review what these are, what they default to, how they’re calculated, when it may be beneficial to adjust from default values, and so forth.
Parameter: cpu_count
cpu_count has been around forever and is a measure of the number of processors on which an instance runs. It takes into consideration hyperthreading; for example, if you’ve a 2-socket server with 6 cores, for a total of 12 cores, each with a 2-thread hyperthead, you’ll see cpu_count set to (2 x 6 x 2) = 24.
Parameter: parallel_adpative_multiuser
parallel_adapative_multiuser has been around since pre-11gR2 and its purpose us to provide a method for reducing the number of PQ slaves under system load. With parallel_adaptive_multiuser, Oracle can downgrade the DOP of an operation under system load, which may or may not yield the intended result.  This is not suitable for setting when AutoDOP is in use, and I’ll cover Auto DOP in a different blog post in detail.
Parameter: parallel_automatic_tuning
parallel_automatic_tuning has been around since pre-11gR2 and is used to enable automatic DOP calculations on objects defined with parallel parameters.   The default is FALSE, and it should remain at FALSE most of the time and certainly if AutoDOP is in use.
Parameter: parallel_degree_limit
parallel_degree_limit is new to 11g and sets the upper limit on  the DOP for a single SQL statement.  It defaults to “CPU”, which means it’s setting is based on information derived from system CPU settings.  The calculation for this is:
parallel_degree_limit = (cpu_count) * (parallel_threads_per_cpu) * (active instance count in RAC cluster)
Parameter: parallel_degree_policy
parallel_degree_policy is new to 11g and is used to enable or disable Automatic DOP. I’ll cover this in more detail later, but for now, let’s review the settings and see what they mean:
  • MANUAL = Auto DOP is turned off. DOP is determined by SQL hints or parallel degree parameters on the underlying objects.
  • AUTO = Auto DOP is turned on. Oracle automatically determines DOP based on segment statistics, and additionally, Oracle will enable parallel statement queuing and in-memory parallel execution.
  • LIMITED = Auto DOP is enabled for all objects in which the parallelism is set to DEFAULT. In-memory parallel execution and parallel statement queuing is disabled (unless “_parallel_statement_queuing”=TRUE, which I’ll cover in a different blog post in detail).
Parameter: parallel_execution_message_size
parallel_execution_message_size is not new to 11g and determines the size, in bytes, that messages are used to exchange between parallel query slaves and the query coordinator. These chunks are taken out of the shared pool and the default should be 16k (16384), but I’ve noticed in situations in which the database has been upgraded from a prior version, it’s set to 2152.
Parameter: parallel_force_local
parallel_force_local is new to 11g and specifies whether or not to spawn parallel slaves for a given SQL statement all on the same node in a RAC cluster or span instances. The default is FALSE and should probably be kept as-is unless there are compelling reasons not to allow other instances spin up parallel servers. And even if there are, it’s probably a better idea to control this with a resource manager plan.
Parameter: parallel_instance_group
parallel_instance_group is an old parameter used to restrict PQ processes to a specific instance or group of instances in a RAC cluster.  The default is null, and should probably be left unset – use a resource plan to control this if you need to.
Parameter: parallel_max_servers
parallel_max_servers is used to specify the maximum number of parallel query servers that can be spawned in an instance – in other words, this is the high-water mark.  The default is derived from the following formula:
parallel_max_servers = (cpu_count) * (parallel_threads_per_cpu) * (concurrent users * 5), where:
if ( memory_target is specified in spfile or pfile OR sga_target is specified in spfile or pfile), concurrent_users = 5
If (pga_aggregate_target is specified in spfile or pfile and neither memory_target or sga_target is), concurrent_users = 2
else (if neither of above), concurrent users = 1
On Exadata, the templates that ACS used to deploy databases specifies a non-standard setting of 128 for parallel_max_servers, which his considerably lower than the computed value of 960 (assuming sga_target is also set during the database build via the template). The theory behind this lower setting is predicated on the nature in which Exadata is designed – with Smart Scan processing, Smart Flash Cache, an optimal hardware configuration with high-speed InfiniBand interconnect, applications running on Exadata are not typically IO-bound like they’ve always been in the past. So faster IO means that PQ servers are able to produce data much more quickly to the PQ consumers, placing the balance of workload for parallel operations more on the CPUs than on the IO subsystem. So a default value of 960 for parallel_max_servers could certainly saturate compute node processors. This being said, however, this is one parameter that is worthwhile investigating more research depending on your workload.  With Auto DOP, there are also implications on this parameter’s setting with respect to parallel statement queuing – you want queuing to take place before parallel_max_servers is reached.
In situations with a single processor and memory_target set, the computation should be (1 * 2 * (4 * 5) = 20, but I’ve seen it default to 10. This seems to go against the formulas Oracle publishes, so let’s see if it’s related to the cpu_count being only 1. The official Oracle documentation (http://docs.oracle.com/cd/E18283_01/server.112/e16541/parallel005.htm#CIHHJFFC) seems to be wrong as well and not account for the “4 *” multiplier – this could either be because the memory_target is low or the cpu_count is low.  Another explanation could be that if cpu_count=1, the parallel_threads_per_server part of the formula is discarded. I like this explanation a little bit because it fits in well with parallel_servers_target.
Parameter: parallel_min_percent
parallel_min_percent is a pre-11gR2 parameter that specifies the minimum number of PQ slaves required to run a query.  It defaults to zero and should be left at 0.
Parameter: parallel_min_servers
parallel_min_servers specifies the minimum number of PQ servers to run, instance-wide..  It defaults to 0.  If you want to pre-spawn PQ servers, setting this will fire them up when the instance starts and avoid the overhead of process creation, but most folks set this to the default.  In theory, if you do almost all your queries in parallel, setting this to a sensible value can eliminated some overhead.
Parameter: parallel_min_time_threshold
parallel_min_time_threshold is an Auto DOP-related parameter that specifies the minimum amount time the CBO will predict a serial operation to run before qualifying for Auto DOP. This parameter defaults to AUTO, which according to documentation implies 30 seconds. So what this means is this:
If parallel_degree_policy = AUTO (or if it’s set to LIMITED and the parallelism on the table is DEFAULT), then the optimizer will first evaluate with the serial execution of the statement will take longer than 30 seconds. If so, Auto DOP will kick in. If not, things will run serially. I’ve tested setting this to a lower value (1, 5, 10, etc.) and have had some success getting Auto DOP to kick in, but have not been able to do so with HCC-compressed tables – yet. Later in this document I’ll provide some additional research and test cases.
Parameter: parallel_servers_target
parallel_servers_target is an Auto DOP-related parameter that specifies the maximum number of parallel server processes that can run on a database before being subject to parallel statement queuing.  It’s only relevant with Auto DOP, and the formula is defined as:
parallel_servers_target = 4 * (DOP)
= 4*(cpu_count) * (parallel_threads_per_cpu) * (instances in a cluster)
This formula holds true unless:
  • parallel_max_servers is set to a non-default value. In this case, parallel_servers_target = cpu_count * parallel_max_servers.
  • cpu_count=1. In this case, parallel_max_servers = cpu_count * 4.
Parameter: parallel_threads_per_cpu
parallel_threads_per_cpu is used to represent the number of concurrent processes a CPU can support.  Generally, it’s set to 2 but it OS-dependent.
Parameter: “_parallel_statement_queuing”
“_parallel_statement_queuing” is an Auto DOP parameter that can be used to leverage parallel statement queuing in the event you set parallel_degree_policy to LIMITED. In the Auto DOP section below, I’ll go into details why you may want to do this.
Parameter: “_parallel_cluster_cache_policy”
“_parallel_cluster_cache_policy” is an Auto DOP parameter that enabled in-memory parallel execution.  It defaults to ADAPTIVE if Auto DOP is not configured, LOADED if it is. You’ll likely never need to change this because if you want to use in-memory parallel execution, you’d just set the parallel_degree_policy to AUTO.
Sample Environment Comparisons
Below, I’ll show the values for multiple 11gR2 environments for the various parallel execution influencing parameters outlined above. In the table below, I’ll denote in red situations in which the parameter is set to something other than the default. The four environments I’ll examine are described below:
  • DBM: 2-node RAC cluster on Oracle Exadata X2-2 Quarter Rack
  • XRAC: 2-node RAC cluster on OEL, running on VMware
  • TST1: Single-node 11gR2 on Red Hat/Fedora Release 9
  • PROD: Single-node 11gR2 on AIX 6.1
Parameter DBM XRAC TST1 PROD
cpu_count 24 2 1 16
parallel_adaptive_multiuser FALSE TRUE TRUE TRUE
parallel_automatic_tuning FALSE FALSE FALSE FALSE
parallel_degree_limit 96 8 2 32
parallel_degree_policy LIMITED MANUAL MANUAL MANUAL
parallel_execution_message_size 16384 16384 16384 2152
parallel_force_local FALSE FALSE FALSE FALSE
parallel_instance_group NULL NULL NULL NULL
parallel_max_servers 960 80 10 8[1]
parallel_min_percent 0 0 0 0
parallel_min_servers 0 0 0 0
parallel_min_time_threshold AUTO AUTO AUTO AUTO
parallel_servers_target 384 32 4 256
parallel_threads_per_cpu 2 2 2 2
“_parallel_cluster_cache_policy” ADAPTIVE ADAPTIVE ADAPTIVE ADAPTIVE
“_parallel_statement_queueing” TRUE FALSE FALSE FALSE
memory_target 0 1.6 G 850 M 0
sga_target 17 G Defaulted Defaulted 4.2 G
pga_aggregate_target 8 G Defaulted Defaulted 1 G

[1] This was specifically set in the initialization parameter and was not calculated

Controlling Degree of Parallelism
A successful implementation of parallelism usually boils down to setting the correct degree of parallelism, or DOP, for a SQL statement or operation. There are two contexts at which a DOP can be set and controlled – a default DOP and a per-segment DOP.
  • The default DOP = parallel_degree_limit = (cpu_count) * (parallel_threads_per_cpu) * (active instance count in RAC cluster).
  • A per-segment DOP can be specified on a table – look at DBA_TABLES.DEGREE.
You can specify a DOP for a table by issuing something like this:
You can also set the parallel degree for a table to DEFAULT, which means that it’ll default to the system default DOP. This is required when parallel_degree_policy=LIMITED if you want your queries to parallelize:
To validate these configurations, I’ll show a little example:
I won’t go into the execution plan details just yet, but note this:
  • The parallel degree is set to 4 for the table.
  • Prior to the run, we had 3576 parallel servers started and 3576 started after the completion. This simply means we had some parallel server processes already established and didn’t need to start new ones.
  • The “Server Sessions” prior to the run showed 32150, and after, 32156 – a difference of 6, not 4.  This is because there were other sessions, system-wide, starting up PQ servers.
There are better ways to measure how many parallel servers were used for the execution of a SQL statement, and I’ll save these for a different post.