Why Does Exadata Require a 4MB Allocation Unit Size (AU_SIZE)?

Oracle publishes that Exadata IO works best with a 4MB allocation unit size for ASM disk groups, so ASM disk groups should be created with AU_SIZE=4M.  Further, Oracle also recommends that segment extents sizes should be a multiple of 4Mb if they’re to be accessed via Smart Scan. Here, we’ll attempt to demonstrate why this is the case.

What We Think We Know

Here are some facts that we know about Exadata:

  • Data is stored on Grid Disks in 1Mb storage regions.
  • When you perform a smart-scan eligible query, the IO requests to the underlying Grid Disks will be in units of 1Mb and we can validate this by running a “list activerequest” query from CELLCLI while a Smart Scan operation is running:

 

 

  • We know that CELLSRV is a multi-threaded server process running on the storage cells.
  • We can see a handful of database initialization parameters that appear to be IO-related on Exadata, set to 304 (4Mb), including “_db_block_table_scan_buffer_size”.
  • This setting of “_db_block_table_scan_buffer_size” is set to 4Mb for non-Exadata 11.2 databases as well, so it would appear that this is an 11gR2 thing, not necessarily and Exadata thing.

In addition to this, we conducted a few tests with ASM disk groups of different AU sizes – one test with a 1MB AU_SIZE and one test with an 8MB AU_SIZE. See below two sections and the summary at the bottom:

Tests with 1Mb ASM AU Size

In this section we’ll perform tests for full-scanning a table when it’s stored in an ASM disk group with a 1MB ASM allocation unit size and compare with the same table stored in a tablespace residing on an ASM disk group with a 4MB AU size.  The control-case done with SYSTEM.MYOBJ, which has the following characteristics:

  • Table: SYSTEM.MYOBJ
  • Size (Gb): 14.8
  • Blocks: 1,940,352
  • Tablespace Name: USERS
  • ASM Disk Group: DATA_CM01
  • ASM AU Size: 4MB

 

Let’s create a tablespace in the AU1MDATA_CM01 ASM disk group:

 

 

Now we’ll create the copy of MYOBJ:

 

 

Let’s run our test:

 

 

As we can see from the above test:

  • The query execution time went from 4.32 seconds to 7.58 seconds with a 1MB AU size.
  • Roughly the same number of bytes where received from the interconnect, which was to be expected.
  • The cell scan efficiency remained the same, so Smart Scan was definitely saving time.
  • A 1MB AU size will required more physical IOs and be slower than a 4MB AU size in this test.

 

Tests with 8Mb ASM AU Size

In this section we’ll perform tests for full-scanning a table when it’s stored in an ASM disk group with a 8MB ASM allocation unit size and compare with the same table stored in a tablespace residing on an ASM disk group with a 4MB AU size.  The control-case done with SYSTEM.MYOBJ, which has the following characteristics:

  • Table: SYSTEM.MYOBJ
  • Size (Gb): 14.8
  • Blocks: 1,940,352
  • Tablespace Name: USERS
  • ASM Disk Group: DATA_CM01
  • ASM AU Size: 4MB

 

Let’s create a tablespace in the AU8MDATA_CM01 ASM disk group:

 

 

Now we’ll create our test table:

 

 

And finally, run our test:

 

 

As we can see from the above test:

  • The query execution time went from 4.32 seconds to 7.92 seconds with an 8 MB AU size.
  • Approximately the same number of bytes was returned.
  • Using a larger AU size (8MB) didn’t improve times in the scan; in fact, it reduced it to the same as when using a 1MB AU size.  This would tend to conclude that a 4MB AU size is indeed ideal on Exadata.

Since this test and the previous one with the 1MB ASM allocation unit size showed relatively the same results, and considering the 8MB AU size would seem to yield better response, we’re going to do some additional testing with multiple runs (5 each) to see if the tests are indeed what saw with just one sample:

 

 

We can see from this that the timings per test indeed indicate that things are more efficient (i.e., run faster) with a 4MB AU size, but let’s try to figure out exactly why.  Using V$SESSION_EVENT, we can see that the “cell smart table scan” wait event was the event responsible for most of the wait time, so let’s compare test results:

 

 

As we can see above, the number and total time of “cell smart table scan” waits was larger with 1MB AU and 8MB AU sizes for the second two tests.  They retrieved roughly the same number of bytes over the interconnect, and the number of bytes and blocks per segment is very nearly the same.  If we do some math on the above table, we can see that the number of waits increased for 1MB and 8MB ASM AU sizes, but the total time didn’t increase linearly with the number of waits, so we could suspect that alignment boundary issues are the cause of more waits.

Based on this, we can conclude that more work is required to satisfy cell IO requests when ASM disk groups don’t use a 4MB AU size.  So setting ASM disk group attributes will clearly yield a better overall result, in this case about 50% time-savings.

But Why?  More Research …

Knowing that data is stored in 1Mb storage regions on Exadata Grid Disks seems to imply, to me, that 4 parallel IO requests issuing 1Mb requests would be initiated by CELLSRV on the storage servers for Smart Scan operations in order to align with Oracle’s mandate on 4Mb AU_SIZE settings.  If we can prove this to be the case, then any other setting for AU_SIZE would be wasteful (although you could argue that you could get marginal gains for sequentially-scanning large segments with a larger AU_SIZE).

Let’s take a look at the cellsrv process on one of our Cell server nodes:

 

 

The last process is the cellsrv process – process ID 11122.  The first argument in to cellsrv is the number of threads it spawns, in our case, 100.   If you run “strace –f –p 11122” and add up all the distinct PID values, you’ll see the number equate to 100. Sample output of the strace:

 

 

And since I saved this to a text file using “script”, we can get the number of distinct processes that submit IO requests, during my strace tracing interval, like so:

 

 

The fact that the trace on cellsrv has calls to io_submit and io_getevents indicates that asynchronous IO calls are being made.  In the strace output, the io_getevents calls should show a collection of 1MB IO requests, if indeed we had IO requests that extended beyond a single storage region.  An excerpt is provided below:

 

 

As you see from the above, several of the io_getevents calls perform single, 1MB reads, as indicated by the “ = 1” at the end of the line, but one of the cellsrv threads did 7 1Mb reads.  Let’s table this investigation for a bit and ask this – do all IO requests do reads in chunks of 1Mb?  Check out the below:

 

 

This shows us that several did 512 byte reads, some did 4K reads, etc – so some IO calls are not in 1Mb chunks and based on this, I’m surmising that these are not true disk IO reads to satisfy querying data from ASM disk groups.  And in fact, if we traced cellsrv with no activity at all against our cell server, we’d confirm that these IOs are for non-data-access purposes:

 

 

 

Let’s get back on track. From our tracing that was done when actual IO was being performed, let’s determine how many asynchronous IO calls acted on more than a single 1MB chunk of data (which maps, again, to the size of our storage region).

 

 

As we can see from an excerpt above, the number of 1Mb IO requests submitted for the cellsrv threads ranged from 1 request to multiples, but certainly not consistently in unit of 4.  So to assume that one thread issued four 1Mb requests or multiple threads issued a total of 4 to satisfy a chunk of data doesn’t seem to be true, or at the very least, doesn’t seem to be provable from cellsrv process thread traces. 

 

One more way to check this 1Mb IO size.  We know we have 12 drives in each Exadata cell – let’s dig into some of the details:

 

CellCLI> list lun attributes diskType ,deviceName, lunSize,isSystemLun where disktype = HardDisk
HardDisk /dev/sda 557.861328125G TRUE
HardDisk /dev/sdb 557.861328125G TRUE
HardDisk /dev/sdc 557.861328125G FALSE
HardDisk /dev/sdd 557.861328125G FALSE
HardDisk /dev/sde 557.861328125G FALSE
HardDisk /dev/sdf 557.861328125G FALSE
HardDisk /dev/sdg 557.861328125G FALSE
HardDisk /dev/sdh 557.861328125G FALSE
HardDisk /dev/sdi 557.861328125G FALSE
HardDisk /dev/sdj 557.861328125G FALSE
HardDisk /dev/sdk 557.861328125G FALSE
HardDisk /dev/sdl 557.861328125G FALSE
We can see that the first two are “system LUNs” and the remaining 10 are not.  This means the first 2 drives have a system area on them, which is where the storage server’s OS resides as well as the Exadata storage server metadata repository.  Let’s look at another cellcli command output to see the device mappings:
CellCLI> list celldisk attributes name,deviceName,devicePartition,diskType,size where diskType = HardDisk
CD_00_cm01cel01 /dev/sda /dev/sda3 HardDisk 528.734375G
CD_01_cm01cel01 /dev/sdb /dev/sdb3 HardDisk 528.734375G
CD_02_cm01cel01 /dev/sdc /dev/sdc HardDisk 557.859375G
CD_03_cm01cel01 /dev/sdd /dev/sdd HardDisk 557.859375G
CD_04_cm01cel01 /dev/sde /dev/sde HardDisk 557.859375G
CD_05_cm01cel01 /dev/sdf /dev/sdf HardDisk 557.859375G
CD_06_cm01cel01 /dev/sdg /dev/sdg HardDisk 557.859375G
CD_07_cm01cel01 /dev/sdh /dev/sdh HardDisk 557.859375G
CD_08_cm01cel01 /dev/sdi /dev/sdi HardDisk 557.859375G
CD_09_cm01cel01 /dev/sdj /dev/sdj HardDisk 557.859375G
CD_10_cm01cel01 /dev/sdk /dev/sdk HardDisk 557.859375G
CD_11_cm01cel01 /dev/sdl /dev/sdl HardDisk 557.859375G
The first two cell disks have presented Linux partitions on /dev/sda3 and /dev/sdb3, respectively.  Based on /etc/fstab though, we can tell we’re using software RAID mirroring, via mdadm, by the device names:
[[email protected] ~]# cat /etc/fstab
/dev/md6           /                       ext3    defaults,usrquota,grpquota        1 1
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
/dev/md2              swap                    swap    defaults        0 0
/dev/md8 /opt/oracle ext3 defaults,nodev 1 1
/dev/md4 /boot ext3 defaults,nodev 1 1
/dev/md11 /var/log/oracle ext3 defaults,nodev 1 1
If we select, say, /dev/md11 and run an mdadm on it, we can see that the /dev/md11 is mirrored across /dev/sda11 and /dev/sdb11:
[[email protected] ~]# mdadm –misc -D /dev/md11
/dev/md11:
        Version : 0.90
  Creation Time : Mon Feb 21 13:06:29 2011
     Raid Level : raid1
     Array Size : 2433728 (2.32 GiB 2.49 GB)
  Used Dev Size : 2433728 (2.32 GiB 2.49 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 11
    Persistence : Superblock is persistent
    Update Time : Sun Mar  4 19:11:49 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
           UUID : 9d76d724:5a2e31a1:fa34e9e7:a875f020
         Events : 0.76
    Number   Major   Minor   RaidDevice State
       0       8       11        0      active sync   /dev/sda11
       1       8       27        1      active sync   /dev/sdb11
But enough of on the OS and system LUNs.  We know we’re using an LSI MegaRAID controller inside each storage cell by checking the output of lsscsi -v (output truncated):
[[email protected] ~]# lsscsi -v
[0:2:0:0]    disk    LSI      MR9261-8i        2.12  /dev/sda
  dir: /sys/bus/scsi/devices/0:2:0:0  [/sys/devices/pci0000:00/0000:00:05.0/0000:13:00.0/host0/target0:2:0/0:2:0:0]
[0:2:1:0]    disk    LSI      MR9261-8i        2.12  /dev/sdb
  dir: /sys/bus/scsi/devices/0:2:1:0  [/sys/devices/pci0000:00/0000:00:05.0/0000:13:00.0/host0/target0:2:1/0:2:1:0]
[0:2:2:0]    disk    LSI      MR9261-8i        2.12  /dev/sdc
  dir: /sys/bus/scsi/devices/0:2:2:0  [/sys/devices/pci0000:00/0000:00:05.0/0000:13:00.0/host0/target0:2:2/0:2:2:0]
[0:2:3:0]    disk    LSI      MR9261-8i        2.12  /dev/sdd
  dir: /sys/bus/scsi/devices/0:2:3:0  [/sys/devices/pci0000:00/0000:00:05.0/0000:13:00.0/host0/target0:2:3/0:2:3:0]
[0:2:4:0]    disk    LSI      MR9261-8i        2.12  /dev/sde
  dir: /sys/bus/scsi/devices/0:2:4:0  [/sys/devices/pci0000:00/0000:00:05.0/0000:13:00.0/host0/target0:2:4/0:2:4:0]
So we can also use MegaCLI64 to query disk details:
 [[email protected] ~]# /opt/MegaRAID/MegaCli/MegaCli64 -ShowSummary -aALL|more
System
        OS Name (IP Address)       : Not Recognized
        OS Version                 : Not Recognized
        Driver Version             : Not Recognized
        CLI Version                : 8.00.23
Hardware
        Controller
                 ProductName       : LSI MegaRAID SAS 9261-8i(Bus 0, Dev 0)
                 SAS Address       : 500605b002f4aac0
                 FW Package Version: 12.12.0-0048
                 Status            : Optimal
        BBU
                 BBU Type          : Unknown
                 Status            : Healthy
        Enclosure
                 Product Id        : HYDE12
                 Type              : SES
                 Status            : OK
                 Product Id        : SGPIO
                 Type              : SGPIO
                 Status            : OK
        PD
                Connector          : Port 0 – 3<Internal><Encl Pos 0 >: Slot 11
                Vendor Id          : SEAGATE
                Product Id         : ST360057SSUN600G
                State              : Online
                Disk Type          : SAS,Hard Disk Device
                Capacity           : 557.861 GB
                Power State        : Active
                Connector          : Port 0 – 3<Internal><Encl Pos 0 >: Slot 10
                Vendor Id          : SEAGATE
                Product Id         : ST360057SSUN600G
                State              : Online
                Disk Type          : SAS,Hard Disk Device
                Capacity           : 557.861 GB
                Power State        : Active
                Connector          : Port 0 – 3<Internal><Encl Pos 0 >: Slot 9
                Vendor Id          : SEAGATE
                Product Id         : ST360057SSUN600G
                State              : Online
                Disk Type          : SAS,Hard Disk Device
                Capacity           : 557.861 GB
                Power State        : Active
Now, we can query additional disk details by supplying additional details:
[[email protected] ~]# /opt/MegaRAID/MegaCli/MegaCli64 -cfgDsply -aALL |egrep ‘(DISK|tripe)’
Number of DISK GROUPS: 12
DISK GROUP: 0
Stripe Size         : 1.0 MB
DISK GROUP: 1
Stripe Size         : 1.0 MB
DISK GROUP: 2
Stripe Size         : 1.0 MB
DISK GROUP: 3
Stripe Size         : 1.0 MB
DISK GROUP: 4
Stripe Size         : 1.0 MB
DISK GROUP: 5
Stripe Size         : 1.0 MB
DISK GROUP: 6
Stripe Size         : 1.0 MB
DISK GROUP: 7
Stripe Size         : 1.0 MB
DISK GROUP: 8
Stripe Size         : 1.0 MB
DISK GROUP: 9
Stripe Size         : 1.0 MB
DISK GROUP: 10
Stripe Size         : 1.0 MB
DISK GROUP: 11
Stripe Size         : 1.0 MB
As we can see, each disk has a 1MB stripe size.  This is why we see 1MB IO requests when tracing cell server processes.  What does this, by itself, imply in relation to Oracle’s 4MB AU Size?   Nothing, directly – it’s all about the geometry of the drives and the ideal IO size to drive maximum bandwidth.

Conclusion

In talking with Oracle analysts and in researching some internals of how ASM allocates extents, below is the summary of why 4MB allocation unit sizes are ideal:

  • The AU size governs how much data Oracle/ASM will write on one disk before going to the next disk in the ASM disk group that contains the object.
  • Based on the physical mechanics of the disk drives in an Exadata storage cell, the disks are able to sequentially scan data in chunks of 4 Mb more efficiently than any small size.
  • Depending on the size of the sequential IO, we theoretically could achieve more IO bandwidth (MBPS) with a larger AU_SIZE, which would imply that more data would be written to one disk before moving to the next disk in the ASM disk group, but the overall time savings and MBPS delivered as very nominal, if not detrimental, gains after 4 Mb.

So in summary, 4Mb is the optimal “stripe size” of an Exadata storage cell considering the capabilities of the underlying disks, the fact that IO is performed in units of 1 Mb, and considering the desire to be able to do relatively large sequential scans to satisfy cell multiblock physical read or cell smart scan operations.