Exadata Smart Flash Cache: Centroid Webinar Follow-Up

Centroid’s October 25th, 2011 webinar that covered Exadata Smart Flash Cache generated a number of questions. In this blog I’ll do my best to answer these questions, and I’ll list the questions exactly as posed during the webinar.
Question: How do I increase the effectiveness of Smart Flash Cache?
This is sort of an open-ended question, and it really depends on what is meant by “increased effectiveness”. By default, Smart Flash Cache will automatically cache segments it finds to be “suitable” for caching based on table size and access method (i.e., read via Smart Scan or single-block reads). So as long as your SQL load profile matches candidates that are well suited for Smart Flash Cache, it should just work. Period. This being said, there are several ways you can determine what sort of benefit you’re seeing from Smart Flash Cache, and I’ll go into this below.

First, let’s do a simple script to see which sessions are benefiting from Smart Flash Cache. The query below shows any sessions that have a non-zero value for “cell flash cache read hits”:

You can also look at historical data from AWR to generate a report showing cell flash cache read hits summed per day – disregard the first row due to the lag function and the nature of the data:

 

The examples above show you how to measure cell flash cache read hits per session and system-wide, but this doesn’t really give you a clue as to how “efficient” Smart Flash Cache is. The storage cell maintains multiple flash cache related metrics, however, that can be used to get some sort of degree of efficiency, and these all have an objectType metadata attribute of ‘FLASHCACHE’. See below:

 

There are a few ways to analyze this data if you’re interested in Smart Flash Cache efficiency:

  • Comparing IOs satisfied in Flash Cache (FC_IO_BY_R) vs. IOs satisfied from normal hard drives (CD_IO_BY_R_SM + CD_IO_BY_R_LG)
  • Comparing IOs satisfied by Flash Cache (FC_IO_BY_R) to the sum of this statistic (FC_IO_BY_R) plus the sum of IO requests that did not find all data in Flash Cache (FC_IO_BY_R_MISS)

Let’s look at the first example:

 

From the above, we can see that we’ve cumulatively satisfied 23,592 MB from Flash Cache. Now:

 

If you add up the numbers for small and large IO read requests from Cell disks, we get 4,705,467 MB of IO satisfied from cell disk access, so our Flash Cache efficiency is a dismal (23,592/4,705,467) = 0.5%. Keep in mind, this is on an R&D Exadata environment, so your mileage will almost definitely be better!

Now let’s look at the second scenario – let’s calculate the efficiency for reads in which the entire IO request was satisfied in Flash Cache to the total of Flash Cache hits + Flash Cache misses.

To put into formula:

Efficiency = (FC_IO_BY_R)/(( FC_IO_BY_R) + (FC_IO_BY_R_MISS))

= (23,592)/((23,593)+ (17,092)) = 58%

Another aspect to look out for is whether your large segments, which potentially could be read via full-table scans and satisfied via Smart Scan, are being loaded into Flash Cache and consequently providing performance benefits from subsequent reads. A potential strategy for identifying this type of efficiency metric could be:

  • Find top IO segments by IO using DBA_HIST_SEG_STAT
  • Identify the OBJECT_ID for these objects, from DBA_OBJECTS
  • List cell statistics for the OBJECT_IDs in question by “list flashcachecontent” in CELLCLI.

Let’s walk through an example. First, let’s grab a list of OBJ#’s from DBA_HIST_SEG_STAT:

The actual object name is not really important for sake of this example, but you could obviously join to DBA_OBJECTS or SYS.OBJ$ to get the object name. For now, let’s focus on the Top 3: object 15891, 477, and 476. Now, let’s query our cell server metrics:

As you can see, we do have data for each of our Top 3 segments cached in Flash Cache, which is a good thing!

I’ve covered a couple of topics here, but let’s summarize. To improve Smart Flash Cache efficiency:

  • Use your PCI Flash entirely for Smart Flash Cache unless and until you prove you should do otherwise. Factors that may lead you otherwise could be a) long-term demonstration that you’re not using your entire Flash storage, or b) you’ve got specific IO requirements on specific segments or redo logs that motivate you to use Flash Grid Disks
  • “Pin” segments to Flash Cache using CELL_FLASH_CACHE=KEEP sparingly and on an as-needed basis. For every segment you keep, you’re reducing valuable flash storage and reducing the potential benefit of Smart Flash Cache. The same argument holds true for your database buffer cache, in our opinion
  • Monitor Flash Cache metrics from CELLCLI regularly to get a feel for how much it’s actually helping you, and if you have spare flash storage and see large, full-scanned objects missing from “flashcachecontent”, consider keeping them.

Question: How is Smart Flash Cache different from Cache currently available in System and Buffer Cache? What is the difference between the two?

Smart Flash Cache is an extra layer of cache that resides on the storage tier – think of it as analogous to cache that exists on most mid-range and enterprise class storage arrays. Except it’s generally better for a few reasons – first, it’s engineered specifically for Oracle, which means that only data that an Oracle database would typically consider as “valuable” via caching (i.e., likely to requested again) is cached. Disk array cache is typically an inflexible, non-adjustable, on-or-off type of cache that basically accepts and stores all types of data, including blocks read (and written) for backups, Data Pump exports and imports, etc. Second, there’s a lot of it – 5.4 TB on an Exadata Full Rack. And while this number alone probably doesn’t offset the financial investment of an Exadata Database Machine when compared to traditional storage arrays, all the other software features of Exadata certainly carry a strong financial argument for Exadata. Smart Flash Cache thus provides a very attractive, very powerful set of performance features.

But let’s talk about the database buffer cache and the “system” cache, which I’m assuming means file-system buffer cache. Oracle’s database buffer cache provides for a shared memory area in which all single-block read requests pass blocks through to satisfy Oracle read and write requests. Access to the buffer cache is typically very fast, but there’s a finite amount of buffer cache that can be configured in an Oracle environment as specified with DB_CACHE_SIZE initialization parameter.

There are a couple cases in which IO requests cannot be satisfied in the buffer cache:

  • When aggregate IO requests across all users saturate available space in the configured buffer cache
  • When the nature of the IO is such that blocks are aged out more frequently than the demand for blocks
  • When blocks are read via serial direct read mechanism, which is the default in 11g and by default for parallel query for all recent versions of Oracle
  • After the instance is bounced

Smart Flash Cache can address these scenarios by proving an extra, very large, intelligent cache on the storage grid to cushion the potential blow of not being able to find a buffer in the local or remote (RAC) buffer cache. And like the buffer cache, this will avoid the need to perform a physical IO. As demonstrated in the webinar, blocks/segments resident in Smart Flash Cache remain valid spanning instance bounces, which can help provide a streamlined performance experience for the users in the event of instance outage and/or maintenance.

So is Smart Flash Cache just as good as the buffer cache, and why do we need a buffer cache at all? Well, the answer is no – accessing a block from the buffer cache is nearly always faster except possibly in situations where the overhead of inter-instance block shipping causes interconnect delays, but with Exadata and the InfiniBand interconnect serving as the fabric for the RAC interconnect, the argument is really a “RAC vs. No RAC” argument, not a “buffer cache vs. Smart Flash Cache” argument.

Let’s put this into perspective by showing a list of the average IO service times from different types of storage or memory:

  • L2 CPU cache: ~1 Nanosecond
  • Virtual memory ~1 micro second – this is where the database buffer cache plays, more or less
  • NUMA far memory: ~10 micro seconds – this could also be where “buffer cache” plays
  • Flash Memory (PCI): ~.01 milliseconds – this where Smart Flash Cache plays
  • Flash Memory (networked): ~ 0.1 milliseconds – this is where Disk array Flash plays (EMC FastCache, etc)
  • Disk IO ~ 5-10 milliseconds

From the numbers above, clearly a PCI flash solution provide better access times than disk access, but accessing from a buffer cache will be a couple of orders of magnitude faster – so don’t configure your buffer cache down to small values just because you’ve got a really fast Smart Flash Cache solution on the Exadata storage servers – especially if your application has some OLTP-ish behavior.

To answer the question about how Smart Flash Cache differs from system cache, or file-system buffer cache – in principle they achieve the same goal; adding additional caching mechanisms on top of the database buffer cache. However, with Oracle ASM, IO is all direct anyway and bypasses any system buffer cache, so it’s really not a good comparison. Further, with ASM on Exadata, the actually IO calls happen on the storage servers so any system cache on the compute nodes is largely irrelevant. But if we want to compared buffered IO on non-ASM Oracle systems to Smart Flash Cache, the former is an unintelligent, dynamic cache that can provide performance gains for some types of multi-block reads, but is not generally a performance feature that can be relied on to provide consistent performance and many times presents a layer of overhead with respect to Oracle databases.

Question: Can other storage arrays be connected to the Exadata Database Machine via the InfiniBand or the Cisco switch?

In short, the answer is yes, with some caveats. Here’s a quick little summary:

  • You can connect the Exadata database servers (compute nodes) to external storage arrays via iSCSI or NFS protocols and theoretically use this storage for database files.
  • If the external storage array has InfiniBand support, you could in theory access over your IB network. I personally haven’t tried this.
  • You can store database files on external arrays if you’re able to connect via iSCSI or NFS, and while it’s discouraged because it upsets the overall balanced architecture of Exadata, it can be a way to migrate data into an Exadata environment.
  • FCoE is not supported – I haven’t tried it, but it’s clearly documented in the Exadata Owner’s Guide.
  • You cannot connect 3rd party switches to the InfiniBand network. Again, this is documented under the restrictions section of the Owner’s Guide.

Question: Is LIBCELL available on all 11gR2 database releases (even non-Exadata)?

The libcelllibrary that support iDB communications do exist on non-Exadata systems, but the code is only used if you’re on Exadata. The libcell11.so library is linked with the oracle kernel in both the RDBMS and ASM homes, so let’s walk through this in both an Exadata environment and a non-Exadata Oracle 11gR2 environment.

First, let’s look at $ORACLE_HOME/rdbms/lib/env*mk in the non-Exadata environment:

Now let’s look at the same on Exadata:

What the above is doing is showing “LIBSAGENAME” link directives for the objects that are compiled under these homes; i.e., “oracle”. For those curious about what “SAGE” means, it’s the historical Oracle code name for the Exadata development initiatives that eventually led to Exadata. The above output shows that in both cases, binaries are linked with a “cell$(RDBMS_VERSION) library.

So if we go to $ORACLE_HOME/lib, we see this:

 

 

We can see that in both cases, there are libcell11.so libraries that exist in the 11.2 ORACLE_HOME, and since the Make File link directives utilize these libraries, we can infer that bothExadata and non-Exadata installations are linked with libcell.

We can double-check this by stringing out or doing a binary edit on the “oracle” executable in each case:

 

 

Let’s take it a step further. On my Exadata, let’s see if libcell11.so is in use:

 

Now let’s check our non-Exadata 11.2 environment:

 

In both cases, we can see libcell11.so is being used by oracle processes, so we can confirm that libcell is linked with the oracle kernel for both Exadata and non-Exadata systems. This being the case, it stands to reason that code exists in the libcell11.so library that determines whether the system is an Exadata system or not.

On our Exadata compute-node machine, if we take a peek using bvi at libcell11.so, we see some stuff like this:

 

Knowing that Exadata systems rely on configurations in /etc/oracle/cell/network-config/cellinit.ora and /etc/oracle/cell/network-config/cellinit.ora to identify the IB storage network and storage cell nodes, respectively, it appears as if the ossconf.o object file is likely part of libcell11.so in Exadata systems and probably not in non-Exadata systems.

Doing a search for this in our Exadata system shows:

From the above, it appears as if ossconf.o is part of libcell11.a, an archive from which libcell11.so is constructed. Looking inside ossconf.o, we see several interesting things:

None of this exists on the non-Exadata 11gR2 systems I’ve got access to. And while I haven’t been able to trace exactly how this code is written, the assumptions I can make seem to indicate:

  • The libcell library, and specifically libcell11.so, exists in both Exadata and non-Exadata 11.2.0.2 environments (I’ve checked a non-Exadata platform on AIX 6.1 and OEL 5.6)
  • The oracle binaries and linked with libcell11.so
  • In Exadata environments, cellinit.ora and cellip.ora files exist and they do not in non-Exadata environments
  • From what I can tell, ossconf.o is an object file inside libcell11.a on Exadata environment. I have not found this on non-Exadata environments, but this could be simply because Exadata patches don’t exist or haven’t been applied to these environments
  • So in short – either the simple existence of the cellinit.ora and cellip.ora files causes code to be executed for iDB in Exadata and not in non-Exadata, or libcell11.so on Exadata systems has extra code bits from ossconf.o built in as part of it.
  • Regardless, Oracle documentation clearly states that libcell provides mechanisms to speak iDB – this much we know for certain.

Question: Is LIBCELL also present on the Exadata Storage Server nodes?

Yes, libcell does exist on the Exadata storage server nodes. It’s used by CELLSRV to process iDB messages:

Question: Can a cell disk span multiple LUNS and/or multiple physical disks?

No. The hierarchy goes like this:

Physical Disk => LUN => Cell Disk => Grid Disk => <ASM Disk Group>

 

But it’s important to understand how database extents are allocated within tablepaces in ASM disk groups in order to appreciate the Exadata storage server design. Cell disks are simply the usable storage within a LUN that’s available for Grid Disks, and Grid Disks are what ASM disk groups are built upon. If an administrator configures Grid Disks and ASM disk groups “appropriately”, you’ll generally get the ideal, balanced storage configuration. And it’s worth mentioning that “appropriately” is the easiest and simplest way for an administrator to do things on Exadata – you don’t have to think too hard to get it right, whereas in most other storage array scenarios, if you don’t think real hard you’ll almost certainly get it wrong.

Let’s take a look at some CELLCLI listings to show how it works. First, let’s see our physical disks:

As you can see, we’ve got 12 drives, as advertised. Now let’s look at the LUNs on these physical disks – I’m omitting flash disks in these examples but they will show up:

Now, for our cell disks:

By now you can see that there’s a one-to-one mapping between cell disks, LUNs, and physical disks. If we look at one of these in detail, we can see how it’s mapped to a LUN and what the byte offsets are – in this case we’ve created this cell disk as an interleaving cell disk:

Now let’s look at our Grid Disks:

If you look at the pattern of the grid disk names, you’ll be able to see that there are 3 Grid Disks on each cell disk. So for example, on the CD_03 cell disk, you’ll see DATA_CD_03_cm01cel01, DBFS_CD_03_cm01cel01, and RECO_CD_03_cm01cel01. Each one of these Grid Disks is built on specific tracks in the Cell Disk.

In the above, I’m only looking at one storage cell – if you examines all 3 cells in my quarter rack, you’d see identical Grid Disk configurations:

If you’re wondering how I’ve been able to create such a great balance of Grid Disk configurations across all cells, here’s how:

The above command creates a Grid Disk on all Cell Disks in a storage cell, each with the same size. And of course, you could use dcli to do this across all storage cells simultaneously.

The final steps is to create an ASM disk group on these Grid Disks, which is done like this:

 

In the above, we’ve basically created a disk group called DATA_CM01 with normal redundancy and striped it across all storage server nodes, as indicated by the ‘o/*’ part of the disk string. “o” means use InfiniBand network and reference cellip.ora, and “*” means to use all IB IP addresses listed in cellip.ora. The “DATA*” wildcard instructs to use all Grid Disks that start with “DATA”.

At the end of the day, you get an ASM disk group like this:

 

This is showing that our DATA_CM01 ASM disk group is comprised of 36 disks, or 12 (the number of disks in a cell) times 3. And of course, when we created tablespaces and segments in this disk group, they’ll be spread by 4MB allocation unit-sized chunks across the Grid Disks.

Question: Is data cached on the Smart Flash Cache on the 1st, 2nd, or 3rd read request? Is it after a specific number of initial reads or writes?

Data is cached in Smart Flash Cache after the first read (or write) and under the assumption that the segment qualified for Smart Flash Cache. Data for a given segment will be aged out via DML, forced (hinted) eviction, or via an LRU mechanism employed by CELLSRV.

Question: Is data stored in Flash Cache compressed?

Data is compressed in Flash Cache if the segments are defined as compressed, whether via OLTP compression or Hybrid Columnar Compression. If data is requested via Smart Scan, whether it lives in Flash Cache or physical disks, it’ll be decompressed on the storage tier before being returned over the InfiniBand interconnect. If it’s accessed without Smart Scan (i.e., single-block reads), it’ll be transmitted over the interconnect as compressed and decompressed on the database server(s).

Question: If you set Cell_Flash_Cache = NONE for the table, will it override the tablespace settings if Cell_Flash_Cache=DEFAULT or KEEP for the tablespace?

During the webinar I replied that the segment-specific setting overrides the tablespace setting, but let’s validate this:

This seems to confirm that you can’t specific default CELL_FLASH_CACHE parameters at the tablespace level. So my reply that the segment-level CELL_FLASH_CACHE setting would override it was not entirely accurate; in fact, the only way you can set this is at the storage layer.

Question: Why does LUN to ASM have two extra layers (cell disk and grid disk)? Why would you not map LUN to ASM directly?

Not being the guy who designed Exadata, I can’t respond with 100% authority on this but I can understand why it’s done the way it is.

  • A physical disk is just a physical disk, and like any other physical disk in a storage system, you can’t access it “directly”; some sort of abstraction interface is required to allow programs to read, write, destroy, format, etc.
  • With Exadata, a LUN is basically that interface. And there are 2 main ways the physical disks are accessed – one via database IO operations, and one from the storage servers operating environment (CELLSRV, RS, MS, etc). If you recall from the webinar, there’s a small chunk of mirrored storage on the first two drives in a storage cell on which the OS resides, the metadata repository, cell server metrics, etc. Oracle doesn’t want an administrator to be able to overwrite or destroy this area of storage, so they instead make a cell disk on top of it.
  • A cell disks is the “configurable” and usable chunk of storage on a LUN, or physical disk. The only things you can do with cell disks is basically drop and create them, but in doing so, Oracle protects us from modifying the system area by basically aligning the boundaries of the cell disk outside the reserved system area space. Think about the consequences and support nightmares if you allowed a DBA to allow cell disk boundaries to infringe onto the system area – you’d end up with a pretty severely damaged storage server if a DBA had to consider byte boundaries and alignment.
  • So a cell disk is basically the entity on which an administrator has access to make changes. In the EMC, Hitachi, IBM, etc. worlds, this is basically an analogy to your LUN. And since it’s tied directly to a physical disk and doesn’t span multiple physical disks, the concept of Grid Disks comes into place.
  • Grid Disks are where administrators can become relatively creative on Exadata, but it’s also where there should probably be conservative, use good naming conventions, and always think about the end-state ASM disk groups that’ll be built on them. Think of a set of Grid Disks, spanning all drives on all storage servers, as “meta-LUN” if you’re an EMC guy. But to complete the metalun analogy, a properly wild-carded ASM disk group disk string will yield what you want.

Question: Traditionally, if data is in SGA, no disk access is needed. In Exadata, if the data is already in SGA, does it still scan Flash Cache? Which cache is the first layer?

If the requested blocks are in the buffer cache, the IO will be satisfied in either the local or remote (RAC instance) buffer cache – no physical IO is required. From the Oracle process’s perspective, a physical IO is any IO that isn’t satisfied in the buffer cache, irrespective of whether the data resides in Flash Cache or on a physical disk.

Question: When cell offloading is heavily used and serial_direct_read is turned on, will this impact the effectiveness of the Smart Flash Cache?

It certainly could, depending on the size of the segment(s) being accessed, but not always. A serial direct read, from the Oracle instance’s perspective, bypasses the buffer cache for block reads and instead reads into the PGA of the user’s process. Whether the data resides in Flash Cache or physical disk, or a combination of both, ultimately a serial direct read (which is required for Smart Scan) will use PGA memory for buffer handling. If the table being scanned is not eligible for Smart Flash Cache due to segment size, CELL_FLASH_CACHE keep/evict/none parameters, etc., it won’t be loaded into Flash Cache. So if you’ve got capacity to store the Smart Scanned segment into Flash Cache, you can increase overall IO read times by altering CELL_FLASH_CACHE parameters and forcing it to cache. And similarly, for “smaller” segments that are still large enough to be Smart Scan eligible, Flash Cache can and will be used.

Question: What are the system requirements for creating smart flash cache? Can it be done on a laptop?

ExadataSmart Flash Cache is an Exadata feature, so this is the requirement. You can’t do it on a laptop. Database Smart Flash Cache is a different but similar feature. Database Smart Flash Cache, described inhttp://www.oracle.com/technetwork/articles/systems-hardware-architecture/oracle-db-smart-flash-cache-175588.pdf, is a means to feature supported on Oracle Enterprise Linux and Solaris 11 Express for Oracle 11gR2. You can use this if you add PCI flash cards to your database tier nodes.

Question: What if the Smart Flash Cache gets corrupted?

I haven’t come across any specificcorruption-related bugs specific to Smart Flash Cache, but if you’ve got corruption in your physical segments the IO will fail before being loaded into Smart Flash Cache. In this case, you’ll see a statistic called “flash cache insert skip: corrupt” be incremented and this would typically accompany an Oracle error message on the read.

If, but “corrupt”, you determine that Smart Flash Cache is populated with the wrong data with respect to your profile, you can always “drop flashcache all” and “create flashcache all”.