Exadata Smart Scan Predictate and Column Filtering

The purpose of this post is to demonstrate Smart Scan processing on Centroid’s Exadata X2-2 Quarter Rack. Smart Scan processing is one of the components of Exadata Cell Offload Processing whose goal is to offload processing to storage cells instead of performing block IO requests on the database tier. The goal of Smart Scan processing is to do all the “heavy lifting”, IO-wise, in the storage cell and only send back data to the compute nodes that is actually needed and requested. This reduces interconnect traffic between the storage tier (storage cells) and database tier, reduces the clutter in the database buffer cache, improves query performance, and improves concurrency for multiple full-scan types of queries happening at once. With Smart Scan processing, row filtering, column filtering, some join processing, and other functions are performed in the Exadata cells. Results are returned to the foreground session’s PGA instead of the shared buffer cache.
With Smart Scan processing:
  1. User issues a SQL statement
  2. Optimizer determines whether SQL statement is offloadable based on predicate(s).
  3. An iDB command is constructed and sent to Exadata cells
  4. SQL is processed in Exadata cells
  5. Filtered result set is returned from disks (or FastCache or Storage Index region) to cell server
  6. A consolidated result set is built from all Exadata cells
  7. Rows are returned to compute nodes.

 

Smart Scan Column and Predicate Filtering in Action

The examples in the previous section did several tests in which cell offload and Smart Scan processing was measured by selecting all rows from the table, or a “count(*)”. Let’s see what the Smart Scan impact is when selecting a subset of columns from a table. In the example below, we’ll use the SOE.ORDERS table and make all indices invisible.

 

Above, we can see that the second query essentially performed the same magnitude of work. Let’s do another test with “select distinct” and a column with nulls:

 

 

Again, same results. Let’s put this into a simple anonymous PL/SQL block and avoid the “select count(*)” part of it.

 

 

Again, the results are similar. Now, we’ll try with a non-partitioned table with some null column values, using MYOBJ (a copy of DBA_OBJECTS loaded with 150 million rows):

 

 

 

As we can see above, the Smart Scan savings was more than in the example against SOE.ORDERS, but we didn’t see much of a change in “select *” vs. “select generated” in the query.

Based on the examples so far in this section, it doesn’t look as if column filtering is doing anything. Let’s try an example with using a predicate on our SOE.ORDERS table:

 

 

 

In this example, we’re passing far less data back to the database tier and thus getting a better Smart Scan efficiency. One more example:

 

 

 

In this example above, we’re selecting based on a predicate that we know what return any rows. Smart Scan processing is taking place. In the first example, the amount of Smart Scan MB returned from Exadata cells is roughly 3,800Mb, yielding a Smart Scan efficiency of 99.8%. In the second example, we’ve put in the same predicate but instead select two additional columns, ORDER_ID and ORDER_STATUS. In this test, roughly 4,900Mb of data was returned via Smart Scan processing, giving a Smart Scan efficiency of 99.74%.

Based on these test results, we can see that row filtering certainly plays a very large role in Smart Scan efficiency, but clearly column filtering also plays a role. Like anything, your mileage will likely vary based on the SQL statements being issue, the column selection list, and predicate information.

Since I’m not entirely happy with the result of the test so far (I expected a larger scan efficiency delta to be seen), I’m going to add a very wide on MYOBJ and do some additional testing.

Below, I’m going to show the Smart Scan behavior with column filtering when selecting two columns from a large table (SYSTEM.MYOBJ):

 

 

Now, I’ll do the same query while selecting just one column, the OBJECT_NAME column:

 

 

In the examples above, I’m querying MYOBJ, which looks like this:

 

 

As you can see, BIGCOL is a column of data type VARCHAR2(4000). The tests above returned 18,513 rows, so if every row had BIGCOL fully-populated, we should be doing about (18513 * 4000) bytes less of IO. But since I know that I’m only populating 255 bytes of data…

 

 

…our IO savings should be about (18,513 * 255) =~ 4.5Mb.

In our second query, we benefited from the use of a storage index, which helped the run-time, but shouldn’t impact the number of bytes returned over the InfiniBand interconnect. If you look at the numbers from the queries above, we see 7,730,380 bytes returned from the first query and 1,230,720 returned from the second query – a savings of about 6.2Mb. Which is a greater savings than what we actually expected by NOT selecting the larger column, an indication that column filtering is playing a pretty big role.

But let’s look at some different tests against these tables and see the storage index impact in both:

 

 

 

Here, the difference in bytes over the InfiniBand interconnect between the first query and second is 1,176,032 – 651,744 = 524,288 bytes. Each query returned 2048 rows. The row length of our BIGCOL column is 255 bytes for all 2,048 rows:

 

 

Further, 524,288/2,048 = 256 bytes per row saved when NOT selecting the BIGCOL column. So this pretty much shows the impact of column filtering wit Smart Scan.

Summary

Exadata Smart Scan processing is a means to improve query response time and reduce the amount of data passed from the storage cell servers to the database tier nodes. In the tests in this post, predicate-filtering has a much greater impact on cell offload efficiency than column filtering. In other words, the number of bytes returned over the InterConnect is what determines offload efficiency, and in our tests, reducing or changing the number of columns selected did had an impact on interconnect bytes, but not nearly as much as reducing the number of rows returned.