What We Think We Know
Here are some facts that we know about Exadata:
In addition to this, we conducted a few tests with ASM disk groups of different AU sizes – one test with a 1MB AU_SIZE and one test with an 8MB AU_SIZE. See below two sections and the summary at the bottom:
Tests with 1Mb ASM AU Size
In this section we’ll perform tests for full-scanning a table when it’s stored in an ASM disk group with a 1MB ASM allocation unit size and compare with the same table stored in a tablespace residing on an ASM disk group with a 4MB AU size. The control-case done with SYSTEM.MYOBJ, which has the following characteristics:
Let’s create a tablespace in the AU1MDATA_CM01 ASM disk group:
Now we’ll create the copy of MYOBJ:
Let’s run our test:
As we can see from the above test:
Tests with 8Mb ASM AU Size
In this section we’ll perform tests for full-scanning a table when it’s stored in an ASM disk group with a 8MB ASM allocation unit size and compare with the same table stored in a tablespace residing on an ASM disk group with a 4MB AU size. The control-case done with SYSTEM.MYOBJ, which has the following characteristics:
Let’s create a tablespace in the AU8MDATA_CM01 ASM disk group:
Now we’ll create our test table:
And finally, run our test:
As we can see from the above test:
Since this test and the previous one with the 1MB ASM allocation unit size showed relatively the same results, and considering the 8MB AU size would seem to yield better response, we’re going to do some additional testing with multiple runs (5 each) to see if the tests are indeed what saw with just one sample:
We can see from this that the timings per test indeed indicate that things are more efficient (i.e., run faster) with a 4MB AU size, but let’s try to figure out exactly why. Using V$SESSION_EVENT, we can see that the “cell smart table scan” wait event was the event responsible for most of the wait time, so let’s compare test results:
As we can see above, the number and total time of “cell smart table scan” waits was larger with 1MB AU and 8MB AU sizes for the second two tests. They retrieved roughly the same number of bytes over the interconnect, and the number of bytes and blocks per segment is very nearly the same. If we do some math on the above table, we can see that the number of waits increased for 1MB and 8MB ASM AU sizes, but the total time didn’t increase linearly with the number of waits, so we could suspect that alignment boundary issues are the cause of more waits.
Based on this, we can conclude that more work is required to satisfy cell IO requests when ASM disk groups don’t use a 4MB AU size. So setting ASM disk group attributes will clearly yield a better overall result, in this case about 50% time-savings.
But Why? More Research …
Knowing that data is stored in 1Mb storage regions on Exadata Grid Disks seems to imply, to me, that 4 parallel IO requests issuing 1Mb requests would be initiated by CELLSRV on the storage servers for Smart Scan operations in order to align with Oracle’s mandate on 4Mb AU_SIZE settings. If we can prove this to be the case, then any other setting for AU_SIZE would be wasteful (although you could argue that you could get marginal gains for sequentially-scanning large segments with a larger AU_SIZE).
Let’s take a look at the cellsrv process on one of our Cell server nodes:
The last process is the cellsrv process – process ID 11122. The first argument in to cellsrv is the number of threads it spawns, in our case, 100. If you run “strace –f –p 11122” and add up all the distinct PID values, you’ll see the number equate to 100. Sample output of the strace:
And since I saved this to a text file using “script”, we can get the number of distinct processes that submit IO requests, during my strace tracing interval, like so:
The fact that the trace on cellsrv has calls to io_submit and io_getevents indicates that asynchronous IO calls are being made. In the strace output, the io_getevents calls should show a collection of 1MB IO requests, if indeed we had IO requests that extended beyond a single storage region. An excerpt is provided below:
As you see from the above, several of the io_getevents calls perform single, 1MB reads, as indicated by the “ = 1” at the end of the line, but one of the cellsrv threads did 7 1Mb reads. Let’s table this investigation for a bit and ask this – do all IO requests do reads in chunks of 1Mb? Check out the below:
This shows us that several did 512 byte reads, some did 4K reads, etc – so some IO calls are not in 1Mb chunks and based on this, I’m surmising that these are not true disk IO reads to satisfy querying data from ASM disk groups. And in fact, if we traced cellsrv with no activity at all against our cell server, we’d confirm that these IOs are for non-data-access purposes:
Let’s get back on track. From our tracing that was done when actual IO was being performed, let’s determine how many asynchronous IO calls acted on more than a single 1MB chunk of data (which maps, again, to the size of our storage region).
As we can see from an excerpt above, the number of 1Mb IO requests submitted for the cellsrv threads ranged from 1 request to multiples, but certainly not consistently in unit of 4. So to assume that one thread issued four 1Mb requests or multiple threads issued a total of 4 to satisfy a chunk of data doesn’t seem to be true, or at the very least, doesn’t seem to be provable from cellsrv process thread traces.
One more way to check this 1Mb IO size. We know we have 12 drives in each Exadata cell – let’s dig into some of the details:
Conclusion
In talking with Oracle analysts and in researching some internals of how ASM allocates extents, below is the summary of why 4MB allocation unit sizes are ideal:
So in summary, 4Mb is the optimal “stripe size” of an Exadata storage cell considering the capabilities of the underlying disks, the fact that IO is performed in units of 1 Mb, and considering the desire to be able to do relatively large sequential scans to satisfy cell multiblock physical read or cell smart scan operations.
1050 Wilshire Drive,
Suite 170,
Troy, MI 48084
Phone: (248) 465-9533
Toll free: 1-877-868-1753
Email: [email protected]
© Centroid, Inc. All rights reserved. Contact Privacy Policy Terms of Use CCPA Policy
Centroid is a cloud services and technology company that provides Oracle enterprise workload consulting and managed services across Oracle, Azure, Amazon, Google, and private cloud. From applications to technology to infrastructure, Centroid’s depth of Oracle expertise and breadth of cloud capabilities helps clients modernize, transform, and grow their business to the next level.
© Centroid, Inc. All rights reserved. Contact Privacy Policy Terms of Use CCPA Policy