Using DBFS and External Tables

What is DBFS?

DBFS stands for Database File System, and it’s Oracle’s method for housing an operating system-like file-system within an Oracle relational database.  It has its roots with iFS, but its functionality and capabilities have been extended.  Here are some reasons why DBFS is a good thing:

  • When you store files in DBFS, they can participate in the same RMAN backup strategy/deployment as the rest of your database files.
  • Operations against DBFS files can be can be parallelized using parallel query for optimal performance.
  • As such, external tables on DBFS file-systems provide the probably the most high-performance way to bulk load data into your database.
  • In Exadata (and elsewhere), DBFS file-systems can reside in ASM disks, so you can get the performance and availability features of ASM within your DBFS file-system.

I’m sure there are more excellent features of DBFS, but the list above is particular relevant to the Exadata database machine.

Setting up DBFS on Exadata

Create the DBFS Tablespace

The first step in the process is to create your DBFS tablespace. The steps below will go through this process:

  1. Determine how much space is available in our DBFS_DG ASM disk group
  2. Create a tablespace on DBFS_DG ASM disk group

 

From this, we can see we’ve got a 872Gb chunk of free space in our DBFS_DG ASM disk group. I’ll create a tablespace, for starters, at 100Gb:

 

 

Create the DBFS Oracle user

Next we’ll create a database user, DBFS:

 

Perform additional OS configuration

The steps below are required to enable oracle to mount a DBFS file-system.  I’ll do these on both database tier nodes in our Exadata database machine:

 

 

Create the DBFS store

 

Mount DBFS

The first thing we need to do is create a text file containing the DBFS password. I’ve created this on /home/oracle/dbfs/passwd.txt on both nodes – note that this is only required when mounting the file-system, not for normal use of DBFS.

 

Now we mount the DBFS file-system:

 

 

After this, let’s check it:

 

 

When complete with cm01dbm01, we’ll do the same mount on cm01dbm02:

 

 

Configure the target database

After we’ve got our DBFS file-system created, there are a couple of things we need to ensure in order to use it:

  1. Create an Oracle directory pointing to the DBFS mount.
  2. Grant permissions on the directory to whoever needs to use it.
  3. Create required external tables (more to come on this).
  4. Use bigfile tablespaces (already done).
  5. Use 8Mb initial extents for large segments.  You can do this by setting INITIAL to 8Mb or use CELL_PARTITION_LARGE_EXTENTS parameter.
  6. Use unlimited quotas for oracle user to bypass quota management.
  7. Use a parallel clause to set default degree of parallelism for target tables.

Let’s walk through these.

 

 

 

 

Recommendations for Data Loads

  1. Use external tables
  2. Parallel direct path load for high performance
  3. In-flight processing using SQL – do transformations using SQL on direct path load and sort data while loading
  4. No need to re-stage table
  5. Use input file processing if applicable
  6. You can also use SQL*Loader but this will be slower and not as scalable.

Sample: External Table

In this section, I’ll show how to create an external table based on a comma-separated text file.  We’ll use this external table as our staging table, and from this, create an operational table.  I’ll use our SOE.CUSTOMER tables to generate a CSV file first:

 

 

This will spool a 1,000,000-row file into our DBFS file-system – I’m going to “amplify” it to 10 million rows to demonstrate some performance:

 

 

OK, now let’s create our external table and validate the row counts:

 

 

Loading a “real” table from our external table

Below, let’s follow our test case and insert values from our external table to a real table:

 

 

Now we’ll do a direct path insert:

 

 

We loaded 10 million rows in under 13 seconds.  Not to bad.  Let’s do a PCTAS:

 

Using SQL*Loader to Compare Times

Now that we see how much time it took to load via an external table, let’s compare with direct path SQL*Loader:

 

It’s difficult to tell from the above, but the job completed it a couple of minutes.  Longer than using parallel direct path insert, but still not too bad.

 

Load and Transform the “Old Way”

Developers typically write data load programs, and many developers do things this way:

  1. Load a flat file into a staging table.
  2. Open a cursor on the staging table and fetch the rows from the cursor.
  3. Perform validations on the data to support business requirements.
  4. Insert cleansed data into production tables.
  5. Insert exception data into exception tables.

In the example below, we’re doing to write a program that does just this.  Specifically, it will:

  • Load data from cust_ext.csv via SQL*Loader into CUST_STAGE.
  • Index CUST_STAGE appropriately to support the transform program (below).
  • With PL/SQL, open a cursor on CUST_STAGE and loop through all rows.
  • Inside the loop of the cursor, we’re going to perform these transformations:
    • Lower-case all the letters in the EMAIL_ADDRESS field.
    • Set the CREDIT_LIMIT to the current value of CREDIT_LIMIT + 10,000 for all customers having an ACCOUNT_MGR_ID = 150.
  • After validation/transformation is done, we’ll insert rows into CUST_FINAL if and only if the row does not already exist.
  • If the program determines that the inserted row could be a duplicate, we’ll insert into CUST_DUPE.   So at the end of the program, we’ll have 3 tables:
    • CUST_STAGE is the staging table, loaded via SQL*Loader, and contains the unprocessed data.
    • CUST_FINAL contains a cleansed version of the customers.
    • CUST_DUPE contains all duplicate values.

Let’s run our example:

 

 

 

 

As you can see, the main PL/SQL job ran in 25.5 minutes, and end-to-end the execution was a little under 27 minutes.

 

Load and Transform with External Tables and SQL

Now we know how long the old way took, let’s try it a new way, using SQL and external tables.  In the below, I’m doing an analytic function, ROW_NUMBER(), to get the unique/distinct values, which is what the “old style” program did:

 

 

As you can see, the job ran in a little over 8 seconds. If we still wanted to populated our CUST_DUPE table, it would look like this:

 

 

So end-to-end, the ELT job took about 30 seconds with SQL and external tables, compared to over 25 minutes with PL/SQL.  Clearly, using SQL is in this case a much more efficient way to do things.  As a developer, if you think it’s impossible to do conditional transformation logic with SQL instead of PL/SQL, it’s time to update your skills – you can do a great deal of work with analytic functions, SQL case statements, and so forth.