- About Us
- Events and Webinars
- Contact Us
First, let’s do a simple script to see which sessions are benefiting from Smart Flash Cache. The query below shows any sessions that have a non-zero value for “cell flash cache read hits”:
You can also look at historical data from AWR to generate a report showing cell flash cache read hits summed per day – disregard the first row due to the lag function and the nature of the data:
The examples above show you how to measure cell flash cache read hits per session and system-wide, but this doesn’t really give you a clue as to how “efficient” Smart Flash Cache is. The storage cell maintains multiple flash cache related metrics, however, that can be used to get some sort of degree of efficiency, and these all have an objectType metadata attribute of ‘FLASHCACHE’. See below:
There are a few ways to analyze this data if you’re interested in Smart Flash Cache efficiency:
Let’s look at the first example:
From the above, we can see that we’ve cumulatively satisfied 23,592 MB from Flash Cache. Now:
If you add up the numbers for small and large IO read requests from Cell disks, we get 4,705,467 MB of IO satisfied from cell disk access, so our Flash Cache efficiency is a dismal (23,592/4,705,467) = 0.5%. Keep in mind, this is on an R&D Exadata environment, so your mileage will almost definitely be better!
Now let’s look at the second scenario – let’s calculate the efficiency for reads in which the entire IO request was satisfied in Flash Cache to the total of Flash Cache hits + Flash Cache misses.
To put into formula:
Efficiency = (FC_IO_BY_R)/(( FC_IO_BY_R) + (FC_IO_BY_R_MISS))
= (23,592)/((23,593)+ (17,092)) = 58%
Another aspect to look out for is whether your large segments, which potentially could be read via full-table scans and satisfied via Smart Scan, are being loaded into Flash Cache and consequently providing performance benefits from subsequent reads. A potential strategy for identifying this type of efficiency metric could be:
Let’s walk through an example. First, let’s grab a list of OBJ#’s from DBA_HIST_SEG_STAT:
The actual object name is not really important for sake of this example, but you could obviously join to DBA_OBJECTS or SYS.OBJ$ to get the object name. For now, let’s focus on the Top 3: object 15891, 477, and 476. Now, let’s query our cell server metrics:
As you can see, we do have data for each of our Top 3 segments cached in Flash Cache, which is a good thing!
I’ve covered a couple of topics here, but let’s summarize. To improve Smart Flash Cache efficiency:
Question: How is Smart Flash Cache different from Cache currently available in System and Buffer Cache? What is the difference between the two?
Smart Flash Cache is an extra layer of cache that resides on the storage tier – think of it as analogous to cache that exists on most mid-range and enterprise class storage arrays. Except it’s generally better for a few reasons – first, it’s engineered specifically for Oracle, which means that only data that an Oracle database would typically consider as “valuable” via caching (i.e., likely to requested again) is cached. Disk array cache is typically an inflexible, non-adjustable, on-or-off type of cache that basically accepts and stores all types of data, including blocks read (and written) for backups, Data Pump exports and imports, etc. Second, there’s a lot of it – 5.4 TB on an Exadata Full Rack. And while this number alone probably doesn’t offset the financial investment of an Exadata Database Machine when compared to traditional storage arrays, all the other software features of Exadata certainly carry a strong financial argument for Exadata. Smart Flash Cache thus provides a very attractive, very powerful set of performance features.
But let’s talk about the database buffer cache and the “system” cache, which I’m assuming means file-system buffer cache. Oracle’s database buffer cache provides for a shared memory area in which all single-block read requests pass blocks through to satisfy Oracle read and write requests. Access to the buffer cache is typically very fast, but there’s a finite amount of buffer cache that can be configured in an Oracle environment as specified with DB_CACHE_SIZE initialization parameter.
There are a couple cases in which IO requests cannot be satisfied in the buffer cache:
Smart Flash Cache can address these scenarios by proving an extra, very large, intelligent cache on the storage grid to cushion the potential blow of not being able to find a buffer in the local or remote (RAC) buffer cache. And like the buffer cache, this will avoid the need to perform a physical IO. As demonstrated in the webinar, blocks/segments resident in Smart Flash Cache remain valid spanning instance bounces, which can help provide a streamlined performance experience for the users in the event of instance outage and/or maintenance.
So is Smart Flash Cache just as good as the buffer cache, and why do we need a buffer cache at all? Well, the answer is no – accessing a block from the buffer cache is nearly always faster except possibly in situations where the overhead of inter-instance block shipping causes interconnect delays, but with Exadata and the InfiniBand interconnect serving as the fabric for the RAC interconnect, the argument is really a “RAC vs. No RAC” argument, not a “buffer cache vs. Smart Flash Cache” argument.
Let’s put this into perspective by showing a list of the average IO service times from different types of storage or memory:
From the numbers above, clearly a PCI flash solution provide better access times than disk access, but accessing from a buffer cache will be a couple of orders of magnitude faster – so don’t configure your buffer cache down to small values just because you’ve got a really fast Smart Flash Cache solution on the Exadata storage servers – especially if your application has some OLTP-ish behavior.
To answer the question about how Smart Flash Cache differs from system cache, or file-system buffer cache – in principle they achieve the same goal; adding additional caching mechanisms on top of the database buffer cache. However, with Oracle ASM, IO is all direct anyway and bypasses any system buffer cache, so it’s really not a good comparison. Further, with ASM on Exadata, the actually IO calls happen on the storage servers so any system cache on the compute nodes is largely irrelevant. But if we want to compared buffered IO on non-ASM Oracle systems to Smart Flash Cache, the former is an unintelligent, dynamic cache that can provide performance gains for some types of multi-block reads, but is not generally a performance feature that can be relied on to provide consistent performance and many times presents a layer of overhead with respect to Oracle databases.
Question: Can other storage arrays be connected to the Exadata Database Machine via the InfiniBand or the Cisco switch?
In short, the answer is yes, with some caveats. Here’s a quick little summary:
Question: Is LIBCELL available on all 11gR2 database releases (even non-Exadata)?
The libcelllibrary that support iDB communications do exist on non-Exadata systems, but the code is only used if you’re on Exadata. The libcell11.so library is linked with the oracle kernel in both the RDBMS and ASM homes, so let’s walk through this in both an Exadata environment and a non-Exadata Oracle 11gR2 environment.
First, let’s look at $ORACLE_HOME/rdbms/lib/env*mk in the non-Exadata environment:
Now let’s look at the same on Exadata:
What the above is doing is showing “LIBSAGENAME” link directives for the objects that are compiled under these homes; i.e., “oracle”. For those curious about what “SAGE” means, it’s the historical Oracle code name for the Exadata development initiatives that eventually led to Exadata. The above output shows that in both cases, binaries are linked with a “cell$(RDBMS_VERSION) library.
So if we go to $ORACLE_HOME/lib, we see this:
We can see that in both cases, there are libcell11.so libraries that exist in the 11.2 ORACLE_HOME, and since the Make File link directives utilize these libraries, we can infer that bothExadata and non-Exadata installations are linked with libcell.
We can double-check this by stringing out or doing a binary edit on the “oracle” executable in each case:
Let’s take it a step further. On my Exadata, let’s see if libcell11.so is in use:
Now let’s check our non-Exadata 11.2 environment:
In both cases, we can see libcell11.so is being used by oracle processes, so we can confirm that libcell is linked with the oracle kernel for both Exadata and non-Exadata systems. This being the case, it stands to reason that code exists in the libcell11.so library that determines whether the system is an Exadata system or not.
On our Exadata compute-node machine, if we take a peek using bvi at libcell11.so, we see some stuff like this:
Knowing that Exadata systems rely on configurations in /etc/oracle/cell/network-config/cellinit.ora and /etc/oracle/cell/network-config/cellinit.ora to identify the IB storage network and storage cell nodes, respectively, it appears as if the ossconf.o object file is likely part of libcell11.so in Exadata systems and probably not in non-Exadata systems.
Doing a search for this in our Exadata system shows:
From the above, it appears as if ossconf.o is part of libcell11.a, an archive from which libcell11.so is constructed. Looking inside ossconf.o, we see several interesting things:
None of this exists on the non-Exadata 11gR2 systems I’ve got access to. And while I haven’t been able to trace exactly how this code is written, the assumptions I can make seem to indicate:
Question: Is LIBCELL also present on the Exadata Storage Server nodes?
Yes, libcell does exist on the Exadata storage server nodes. It’s used by CELLSRV to process iDB messages:
Question: Can a cell disk span multiple LUNS and/or multiple physical disks?
No. The hierarchy goes like this:
Physical Disk => LUN => Cell Disk => Grid Disk => <ASM Disk Group>
But it’s important to understand how database extents are allocated within tablepaces in ASM disk groups in order to appreciate the Exadata storage server design. Cell disks are simply the usable storage within a LUN that’s available for Grid Disks, and Grid Disks are what ASM disk groups are built upon. If an administrator configures Grid Disks and ASM disk groups “appropriately”, you’ll generally get the ideal, balanced storage configuration. And it’s worth mentioning that “appropriately” is the easiest and simplest way for an administrator to do things on Exadata – you don’t have to think too hard to get it right, whereas in most other storage array scenarios, if you don’t think real hard you’ll almost certainly get it wrong.
Let’s take a look at some CELLCLI listings to show how it works. First, let’s see our physical disks:
As you can see, we’ve got 12 drives, as advertised. Now let’s look at the LUNs on these physical disks – I’m omitting flash disks in these examples but they will show up:
Now, for our cell disks:
By now you can see that there’s a one-to-one mapping between cell disks, LUNs, and physical disks. If we look at one of these in detail, we can see how it’s mapped to a LUN and what the byte offsets are – in this case we’ve created this cell disk as an interleaving cell disk:
Now let’s look at our Grid Disks:
If you look at the pattern of the grid disk names, you’ll be able to see that there are 3 Grid Disks on each cell disk. So for example, on the CD_03 cell disk, you’ll see DATA_CD_03_cm01cel01, DBFS_CD_03_cm01cel01, and RECO_CD_03_cm01cel01. Each one of these Grid Disks is built on specific tracks in the Cell Disk.
In the above, I’m only looking at one storage cell – if you examines all 3 cells in my quarter rack, you’d see identical Grid Disk configurations:
If you’re wondering how I’ve been able to create such a great balance of Grid Disk configurations across all cells, here’s how:
The above command creates a Grid Disk on all Cell Disks in a storage cell, each with the same size. And of course, you could use dcli to do this across all storage cells simultaneously.
The final steps is to create an ASM disk group on these Grid Disks, which is done like this:
In the above, we’ve basically created a disk group called DATA_CM01 with normal redundancy and striped it across all storage server nodes, as indicated by the ‘o/*’ part of the disk string. “o” means use InfiniBand network and reference cellip.ora, and “*” means to use all IB IP addresses listed in cellip.ora. The “DATA*” wildcard instructs to use all Grid Disks that start with “DATA”.
At the end of the day, you get an ASM disk group like this:
This is showing that our DATA_CM01 ASM disk group is comprised of 36 disks, or 12 (the number of disks in a cell) times 3. And of course, when we created tablespaces and segments in this disk group, they’ll be spread by 4MB allocation unit-sized chunks across the Grid Disks.
Question: Is data cached on the Smart Flash Cache on the 1st, 2nd, or 3rd read request? Is it after a specific number of initial reads or writes?
Data is cached in Smart Flash Cache after the first read (or write) and under the assumption that the segment qualified for Smart Flash Cache. Data for a given segment will be aged out via DML, forced (hinted) eviction, or via an LRU mechanism employed by CELLSRV.
Question: Is data stored in Flash Cache compressed?
Data is compressed in Flash Cache if the segments are defined as compressed, whether via OLTP compression or Hybrid Columnar Compression. If data is requested via Smart Scan, whether it lives in Flash Cache or physical disks, it’ll be decompressed on the storage tier before being returned over the InfiniBand interconnect. If it’s accessed without Smart Scan (i.e., single-block reads), it’ll be transmitted over the interconnect as compressed and decompressed on the database server(s).
Question: If you set Cell_Flash_Cache = NONE for the table, will it override the tablespace settings if Cell_Flash_Cache=DEFAULT or KEEP for the tablespace?
During the webinar I replied that the segment-specific setting overrides the tablespace setting, but let’s validate this:
This seems to confirm that you can’t specific default CELL_FLASH_CACHE parameters at the tablespace level. So my reply that the segment-level CELL_FLASH_CACHE setting would override it was not entirely accurate; in fact, the only way you can set this is at the storage layer.
Question: Why does LUN to ASM have two extra layers (cell disk and grid disk)? Why would you not map LUN to ASM directly?
Not being the guy who designed Exadata, I can’t respond with 100% authority on this but I can understand why it’s done the way it is.
Question: Traditionally, if data is in SGA, no disk access is needed. In Exadata, if the data is already in SGA, does it still scan Flash Cache? Which cache is the first layer?
If the requested blocks are in the buffer cache, the IO will be satisfied in either the local or remote (RAC instance) buffer cache – no physical IO is required. From the Oracle process’s perspective, a physical IO is any IO that isn’t satisfied in the buffer cache, irrespective of whether the data resides in Flash Cache or on a physical disk.
Question: When cell offloading is heavily used and serial_direct_read is turned on, will this impact the effectiveness of the Smart Flash Cache?
It certainly could, depending on the size of the segment(s) being accessed, but not always. A serial direct read, from the Oracle instance’s perspective, bypasses the buffer cache for block reads and instead reads into the PGA of the user’s process. Whether the data resides in Flash Cache or physical disk, or a combination of both, ultimately a serial direct read (which is required for Smart Scan) will use PGA memory for buffer handling. If the table being scanned is not eligible for Smart Flash Cache due to segment size, CELL_FLASH_CACHE keep/evict/none parameters, etc., it won’t be loaded into Flash Cache. So if you’ve got capacity to store the Smart Scanned segment into Flash Cache, you can increase overall IO read times by altering CELL_FLASH_CACHE parameters and forcing it to cache. And similarly, for “smaller” segments that are still large enough to be Smart Scan eligible, Flash Cache can and will be used.
Question: What are the system requirements for creating smart flash cache? Can it be done on a laptop?
ExadataSmart Flash Cache is an Exadata feature, so this is the requirement. You can’t do it on a laptop. Database Smart Flash Cache is a different but similar feature. Database Smart Flash Cache, described inhttp://www.oracle.com/technetwork/articles/systems-hardware-architecture/oracle-db-smart-flash-cache-175588.pdf, is a means to feature supported on Oracle Enterprise Linux and Solaris 11 Express for Oracle 11gR2. You can use this if you add PCI flash cards to your database tier nodes.
Question: What if the Smart Flash Cache gets corrupted?
I haven’t come across any specificcorruption-related bugs specific to Smart Flash Cache, but if you’ve got corruption in your physical segments the IO will fail before being loaded into Smart Flash Cache. In this case, you’ll see a statistic called “flash cache insert skip: corrupt” be incremented and this would typically accompany an Oracle error message on the read.
If, but “corrupt”, you determine that Smart Flash Cache is populated with the wrong data with respect to your profile, you can always “drop flashcache all” and “create flashcache all”.
Centroid is a cloud services and technology company that provides Oracle enterprise workload consulting and managed services across Oracle, Azure, Amazon, Google, and private cloud. From applications to technology to infrastructure, Centroid’s depth of Oracle expertise and breadth of cloud capabilities helps clients modernize, transform, and grow their business to the next level.