Where Do Our Extents Reside on ASM Grid Disks in Exadata?

We had quite a bit of discussion about where our extents resided on our Exadata Grid Disks.  The question was posed – if Exadata farms out IO requests to each cell in our storage grid, how does it know which extents reside on which disks? How do these get placed on grid disks, and is Exadata intelligent enough to know exactly where to look for the extents or will it simply search for matching extents regardless and allow the result set to determine where the data resides?
This question boils down to how ASM allocates extent according to its redundancy level and how these get balanced on the underlying grid disks that make up the ASM disk groups.
But first, let’s check cell disk metrics from cellcli:




These cellcli requests show the IO, for both small (< 128k) and large (> 128K) reads, across all cell disks on all here storage servers.  As you can see, the numbers are pretty close to each other across the cells.  You’ll see larger variations for small reads, especially on the first two cell disks, for two reasons – first, the first two disks have less space available for “data” because of their 29Gb system area reservation.  Second, the variation could be explained by reads being satisfied in flash cache – Smart Flash Cache will prevent disk IO from occurring.  So to answer the question “do Exadata read requests know ahead of time where to find the data on full-scans”, the answer is no.  The iDB messages are sent to CELLSRV and read requests are issues across the extents evenly across all cells.  These read IOs may or may not be satisfied proportionally on the ASM extents in the underlying Grid Disks, but the storage system won’t know this ahead of time without scanning them.  But after it does, it could place these extents in Flash Cache, so logically it wouldn’t have to “go to disk” in even proportions across all cells/grid disks in the event of requested data being resident in Flash Cache.

But let’s look at our ASM extent layouts and understand this in a little more detail.  First, let’s query our disk groups from ASM and check for imbalances:



As we can see, for DATA_CM01 and RECO_CM01, we’re achieving nearly 100% extent-allocation balance, which is what we’d expect (unless ASM had a bug on Note the disk group redundancy – it’s “NORMAL” for all disk groups.

Now let’s check failure group partnerships:



An interesting thing to note here is that there are 3 failure groups per disk group, even though we created the disk group with “Normal redundancy”.  Checking out alert log (ASM) shows that we did in fact create the disk groups with normal redundancy.



Looking at out ASM disk details, with disk numbers and failure groups:



Let’s keep these disk numbers in mind, as they’ll be important later in this section…

One thing we can say already though, from our queries above, is that on a quarter rack we’ve got three failure groups.  So each ASM extent will be mirrored on one of 8 partner disks on the other cells.  But in the event of a cell failure, there needs to be sufficient space on the surviving cells to carry the capacity of the mirrored extents, which means that you can’t “fill the ASM disk groups” up as much as on half-rack of full-racks.  On a half-rack and full-rack, you’ve got more space available on the storage servers to restore from a failed cell’s failure group.

If we look at V$ASM_DISKGROUP we can see how much free space must be available on each disk group to survive loss of a cell:



So let’s take a look at one of our data files from the SOE tablespace, and determine the ASM extent maps.  First, let’s find a file ID:



So we’ve got file 274 and 300 for the SOE tablespace.



We’ve got 36 disks that comprise the DATA_CM01 disk group, as we expected based on our output above and the fact that we’re on a quarter rack.  Let’s find how many partners:



I’ve left off some of the output above, but you can see that each disk has 8 partners in which it could write mirrored extents.  It won’t write them on all of them, but it has the flexibility to.

Let’s check the extent map for one of our SOE tablespace files, file #300.  We’ll just check the first 10 virtual extents:



This shows two extent copies per virtual extent – copy 0 is the primary extent and copy 1 is the failure group extent, per each virtual extent.  So for virtual extent 0, it resides on disk 33, AU 16063.  Disk 33 is on cell disk 5 on cm01cel01.  Its mirrored extent is on disk 8, which resides on cell disk 6 on cm01cel03.  If you were to check the extent copies for each of the extents you’d find the mirror would always reside on a different cell, and further, you’d find that the number of primary extents and extent copies is balanced pretty much perfectly across all three storage cells.  Also, if you added up the number of primary extents per file and multiplied by the AU size, you’d see the size of the file.  In the case of file #300 in our XBM database, there are 2877 primary extents, and thus 2877 4MB AUs.

Multiplying 2877 by 4MB you get 11.23Gb.  And if we check in the XBM database:


So in Summary:

  • Cell IO statistics prove that IO is balanced across cells.
  • Queries against X$KFFXP prove this by showing a balance between primary and mirror extents across the 3 physical cells.