Archive for the ‘Database Administration’ Category
EMC RecoverPoint Behavior under Heavy Load
I’m sure this is well-known and documented behavior to RecoverPoint appliance experts, but we recently found what we think to be a very nice feature in EMC’s RecoverPoint behavior that perhaps some folks new to the technology are unaware of …
Today we did a large (1Tb) backup of an assortment of Oracle database files on HP-UX file-systems to a set of ASM disk groups. The ASM disk groups were the source LUNs in a RecoverPoint remote replication consistency group. Our RMAN backup to ASM disk groups was done using 6 channels, and the source ASM LUNs resided in a RAID5 Enterprise Flash Drive Raid Group.
As the RMAN backup job allocated channels and wrote backup copies to the target LUNs, we noticed that the RecoverPoint appliances essentially “suspended” transmission of data with a helpful message about the source volumes being too busy. As channels de-allocated and the RMAN job switched from tablespace to tablespace (it was a tablespace at a time RMAN “backup as copy” operation), the RecoverPoint data transfer would kick back in, then suspend itself again as the drives became busy.
Granted, we configured our consistency groups with pretty much default settings, didn’t set any non-standard or overly aggressive RPOs, and essentially were “out of the box” with a 100Gb-sized RecoverPoint journal size. The nice implication of this behavior is this - in the event we need to do “large” data migration or storage-related activities on our source RP volumes, it looks like we’ll never have to worry about suspending our data transfer manually (through the RP administrative web interface or otherwise) - it all just happens by itself.
One less thing to worry about …
Cloning Oracle Databases with EMC SnapView and RecoverPoint
Many are familiar with the steps required to clone Oracle database using “rman duplicate” or “hot backup” cloning . Many are also familiar with steps required to create EMC SnapView Clones or SnapView Snapshots, either with the Navisphere web interface or CLI. In this post, I’ll outline steps required to build consistent, usable Oracle database “clones” within the framework of the following environment/architecture:
- Oracle 11.1.0.7, HP-UX 11iV2
- EMC CLARiiON CX4 storage arrays at production and DR site
- EMC RecoverPoint appliances at production and DR site
- Source database (production) uses Oracle ASM for its storage
- Requirement is to replicate production data from production storage array to remote “DR” array.
- Requirement is to use this replicated data at the DR site as the source for both SnapView Clones or SnapView Snapshots
- Requirement is to Clone or Snap from DR Replica LUNs, re-create an Oracle control file to build a new database, recover this target database, open with resetlogs, and use it
- The LUN numbers and names for the LUNs that comprise or will comprise the RecoverPoint consistency group
- Ensure (or assume) the LUNs are in a storage group and zoned to the production host
- LUN numbers/names of Replica LUNs (i.e., LUNs in the RecoverPoint Consistency Group)
- LUN numbers/names for all to-be SnapView Clones. When using SnapView Clones, the number and size of Replica LUNs needs to match that on the Clone LUNs for each clone group you’ll be creating
- Sufficient LUNs carved into the Reserve LUN pool to hold snapshot data
- How many SnapView Clones will I need? (will govern how many LUNs to build on the DR array, and place in the primary host storage group)
- How many SnapView Snapshots will I need? (this information, combined with source database size, will help size reserve LUN pool)
- What will the shelf-life be for my snapshots?
- How much DML/DDL will occur in my snapshot instances over time?
- Implement a standard ASM diskgroup naming convention (i.e., PROD_DG1, DEV_DG1, etc)
- Implement strategy for consistent symbolic linking of O/S files to the ASM devices that will be defined in the ASM disk group. For example, if ASM diskgroup PROD_DG1 is designed to use /dev/rdsk/c57t0d0, which is LUN1 on the production CX4 storage array, we should symbolically link /asm/disk1 to /dev/rdsk/c57t0d0 and build the ASM diskgroup with the “/asm/disk1″ string
- Set asm_diskstring in both production and DR server ASM instance to the same thing, with wild-cards. For example, “/asm*/disk*”
- Map Source to Replica LUN numbers
- Map Replica to Clone Group LUN for each Clone Group, and ensure “target” clone LUNs are added to the right storage group
- Implement a strategy for snapshot LUN number conventions. For example, if you will expect to build 3 different snapshots on the Replica LUNs, you can start LUN numbering on the first set at LUN 3000, the second set at LUN 4000, the third set at LUN 5000.
- prodhost = production HP-UX host
- drhost = DR HP-UX host
- PROD = production database name
- CLN1 = 1st clone of PROD using SnapView clones
- CLN2 = 2nd clone of PROD using SnapView clones
- SNP1 = 1st snapshot of PROD using SnapView snapshots
- SNP2 = 2nd snapshot of PROD using SnapView snapshots
- rpa1 = host name of RecoverPoint appliance’s admin interface
- cx4-dr = DNS name of DR CLARiiON CX4, used for NaviSphere
- EMC PowerPath is installed and configured on both prodhost and drhost
- 3 ASM Diskgroups: PROD_DG1, PROD_DG2, and PROD_DG3, all replicating in the RP Consistency group and all used as sources to Clones/Snapshots
- Ensure RPA is transmitting data from primary storage array to DR array (and ensure the consistency groups are setup and functional)
- Ensure an ASM instance is running on drhost
- Obtain LUN numbers to use for the CLN1 clone group from NaviSphere
- Make sure the LUNs are in the proper storage group, zoned to drhost, and visible via EMC PowerPath
- Run “powermt display dev=all” as root and search contents for the Replica LUN and Clone LUN names/numbers.
- Consider primary (PROD) ASM device to HP-UX device mappings and ensure you’ve got it documented. For sake of example:
- Create symbolic link from /asm_c1/disk1 to the target Clone LUN that will be synced from the Replica LUN mapped to the primary LUN for PROD_DG1
- Repeat for /asm_c1/disk2 and /asm_c1/disk3.
- Create SnapView clone on the 3 LUN. Below, assume the Replica LUNs are 1, 2, and 3
# naviseccli -h cx4-dr -Scope 0 -User admin -Password <Nav pwd> snapview -createclonegroup -name lun1CloneGrp_1 -luns 1 -o
# naviseccli -h cx4-dr -Scope 0 -User admin -Password <Nav pwd> snapview -createclonegroup -name lun2CloneGrp_1 -luns 2 -o
# naviseccli -h cx4-dr -Scope 0 -User admin -Password <Nav pwd> snapview -createclonegroup -name lun3CloneGrp_1 -luns 3 -o
- Add target LUNs to clone group and begin synchronizing data. When you create a clone group (above) and specify the “-luns” clause, the LUN number following the “-luns” argument is the source LUN for the clone, which in this case is the Replica LUN on the DR storage array. The following will create a clone group for LUN 11 (mapped to Replica LUN 1), LUN 12 (mapped to Replica LUN 2), and LUN 13 (mapped to Replica LUN 3)
# naviseccli -h cx4-dr -Scope 0 -User admin -Password <Nav pwd> snapview -addclone -name lun1CloneGrp_1 -luns 11 -syncrate high
# naviseccli -h cx4-dr -Scope 0 -User admin -Password <Nav pwd> snapview -addclone -name lun1CloneGrp_2 -luns 12 -syncrate high
# naviseccli -h cx4-dr -Scope 0 -User admin -Password <Nav pwd> snapview -addclone -name lun1CloneGrp_3 -luns 13 -syncrate high
- Wait for clone synchronization to complete. You can use the below to monitor this based on the clone group configurations above:
# naviseccli -h cx4-dr -Scope 0 -User admin -Password <Nav pwd> snapview -listclone -name lun1CloneGrp_1 | egrep ‘(^Name|^CloneState|^CloneCon|^Percent)’
# naviseccli -h cx4-dr -Scope 0 -User admin -Password <Nav pwd> snapview -listclone -name lun1CloneGrp_2 | egrep ‘(^Name|^CloneState|^CloneCon|^Percent)’
# naviseccli -h cx4-dr -Scope 0 -User admin -Password <Nav pwd> snapview -listclone -name lun1CloneGrp_3 | egrep ‘(^Name|^CloneState|^CloneCon|^Percent)’
- Put source database (PROD) in backup mode. First though, grab the max first_change# from V$ARCHIVED_LOG to show the earliest archived redo log we’ll need at a later step …
- Enable “Image Access” on the RecoverPoint Appliance (RPA). This is required to put the source of the clones, which are the RPA Replica LUNs, in a consistent state. If you omit this step you’ll get to the end of this, try to recover your database, and will be left with the only option to recover all the way up through the most current redo log on the primary site - something we don’t want to do … To enable image access through the RPA CLI:
# ssh admin@rpa1 ‘enable_image_access group=<your RP consistency group> copy=<name of copy site> image=latest’
- Fracture your SnapView clone
- Disable image access to RPA
- End backup mode on source
- Modify ASM diskgroup name. On ASM 11gR2, we can use “renamedg” from asmcmd, but since our test is on 11gR1, we need to use kfed to modify the header block of the ASM devices. First, do a “kfed read” on all devices that comprise the target ASM diskgroups you want to mount. Direct this to a text file, edit the file and search for string “grpname”. Change the diskgroup from “PROD_DG” to “CLN1_DG” and save the file. Then, use “kfed merge” to modify the disk.
- Mount ASM diskgroups
- Generate backup controlfile from source environment, edit and save so you have a “CREATE CONTROLFILE” script to use on your CLN1 database
- Build controlfile for CLN1
- At this point, in order for CLN1 to be recoverable, you need the archive log preceding the “begin backup” and archive log after the “end backup” in a place where CLN1 can see them. I use RMAN to copy these archivelogs to a location CLN1 can “see”
- Login to SQL*Plus with CLN1 set and set LOG_ARCHIVE_DEST_1 to the location you’ve copied the source archive logs to.
- Issue a “recover database using backup controlfile”
- Specify the archive logs copied from 3 steps ago, and cancel after the last one
- Open with RESETLOGS
- Add TEMP files
- Do whatever other post-cloning needs to be done
- Put source database in backup mode and note latest archive log
- Enable image access on RPA Replica LUNs (see previous section)
- Start Snapview Session. In the below example, the “-lun 1 2 3″ creates a snapshot session on Replca LUNs 1, 2, and 3
- Create Snapshots
- Activate Snapshots
- Add Snapshot LUNs to EMC Storage group. Assuming storage group is SG_drhost and snapshot LUNs will be named, 3000, 3001, and 3002 respectively. It’s good to map out which snapshot LUN numbers you want to use ahead of time
- Find and fix host (HP-UX) devices so they’re usable. Since we’re added a new set of LUNs to our storage group (3000, 3001, and 3002), we need to do the following on HP-UX for them to be visible and mountable:
# /sbin/init.d/agent stop
# ioscan -fnCdisk
# insf
# /sbin/init.d/agent start
#/sbin/powermt check force dev=all
# /sbin/powermt config
# /sbin/powermt save
Then do a “powermt display dev=all” and search for snp1_1, snp1_2, and snp1_3. Once you find these find the HP-UX device for these and symbolically link /asm_s1/disk1, /asm_s1/disk2, and /asm_s3/disk3 to these
- Disable image access on RPA (see previous section)
- End backup mode on source (see previous section
- Modify ASM block header on target SNP1 (see previous section). Use devices /asm_s1/disk1, /asm_s1/disk2, and /asm_s1/disk3 based on previous steps
- Create ASM diskgroups for SNP1 (see previous section). Reference above devices
- Mount ASM diskgroups for SNP1 (see previous section)
- Generate script to create controlfile (see previous section)
- Build controlfile for SNP1
- Find and backup needed archive logs to destination SNP1 can see (see previous section)
- Recover SNP1 (see previous section)
- Open SNP1 with RESETLOGS and add temp files
- You obviously won’t have to drop/dismount/create/mount ASM disk groups
- You won’t have to modify ASM block headers
- No need to symbolically link to HP-UX device names
- PROD is a production database running on prodhost
- CLN1 is an Oracle copy of production running on drhost and will be a complete SnapView clone of production
- CLN2 is an Oracle copy production running on drhost and will be a complete SnapView clone of production
- SNP1 is an Oracle copy production running on drhost and will be a SnapView snapshot of production
- SNP1 is an Oracle copy production running on drhost and will be a SnapView snapshot of production
- Ensure clone LUNs are in the proper CX4 Storage Group
- # /sbin/init.d/agent stop
- # ioscan -fnCdisk
- # insf
- # /sbin/init.d/agent start
- # /sbin/powermt display dev=all
- Examine output of PowerPath command above and note device names. For sake of example, we’ll focus on the first LUN, /u01, which we want to mount as /u01_cln1. The device for this is /dev/dsk/c80t0d1 (again, for example)
- # vgchgid /dev/dsk/c80t0d1
- # mkdir /dev/vgcln1
- # mknod /dev/vgcln1/group c 64 0×100000 — this “0×100000″ should be unique, check /dev/vg*/group*)
- # vgimport /dev/vgcln1 /dev/dsk/c80t0d1
- # vgchange -a y /dev/vgcln1
- # fsck /dev/vgcln1/lvol1
- # mkdir /u01_cln1
- # mount -o delaylog /dev/vgcln1/lvol1 /u01_cln1
Large SGA on 32-bit Redhat Linux
Many DBAs are familiar with the SGA size limitations on 32-bit platforms. This post shows how to allocate a 3Gb buffer cache on 32-bit RHAS 3.
- Step 1: Mount /dev/shm to as type ramfs. Edit /etc/fstab and add an entry like this:
none /dev/shm ramfs defaults,size=4G 0 0
- Step 2: do a “mount -a” to mount /dev/shm. I use ramfs instead of tmpfs because it doesn’t use swap; tmpfs does. With ramfs memory allocation will also grow dynamically, whereas when using tmpfs it will not.
- Step 3: As root, assuming your database is owned by the Linux account oracle, whose primary group is “dba”, do this:
# chown oracle:dba /dev/shm
- Step 4: Add the following to /etc/security/limits.conf to increase maximum memory lock parameters. Ensure “oracle” user has his environment sourced to establish these settings.
oracle soft memlock 3145728
oracle hard memlock 3145728
- Step 5: Login as oracle and do “ulimit -a”; validate that memlock is set to the above values
- Step 6: Edit /etc/sysctl.conf and add/change the following. When complete, do “sysctl -w” to activate changes into the Linux kernel. The “vm.hugetbl_pool” setting below is set based on the output of a script provided here (http://download-uk.oracle.com/docs/cd/B28359_01/server.111/b32009/appi_vlm.htm) - run this AFTER Oracle is started at the completion of this document to get a realistic value for vm.hugetbl_pool and adjust accordingly
kernel.sem = 1000 32000 100 150
kernel.shmmax = 4294967295
kernel.shmall = 4194304
net.ipv4.ip_local_port_range = 1024 65000
vm.pagecache = 10 20 30
kernel.shmmni=4096
vm.hugetlb_pool=4096
- Step 7: Unset db_cache_size, db_xk_cache_size, sga_target, sga_max_size, memory_target init.ora parameters and manually set shared_pool_size to appropriate value. You can use “show sga” to determine this
- Step 8: Set use_indirect_data_buffers=true
- Step 9: Set db_block_buffers such that the product of db_block_buffers and db_block_size = 3G
- Step 10: In oracle’s .profile/.bash_profile, set DISABLE_MAP_LOCK=1. This is required to avoid unnecessarily long connect times for databases that are connected to frequently
export DISABLE_MAP_LOCK=1
- Step 11: Stop oracle, source environment, ensure O/S limits are correct (ulimit -a), ensure /dev/shm is owned by oracle (ls -al /dev/shm), ensure DISABLE_MAP_LOCK=1, and then start Oracle
- Step 12: Add the following to /etc/rc2.d/S99local:
mount /dev/shm
chown oracle:dba /dev/shm
Test. Enjoy the benefits of a large cache. Test across reboots to ensure /dev/shm is mounted correctly.
Using DBMS_SQLDIAG
As a DBA, have you ever had an issue you suspected may be a bug and have been asked to generate “test data” for Oracle development? We ran into this at a client recently and found a cool 11g utility (at least I *think* it’s new :)) to get Oracle Support the data they need to reproduce the issue in-house.
The situation had to do with optimizer_features_enable being set to 11.1.0.7 in a database recently upgraded form 9.2.0.5 in an Oracle eBusiness Suite environment (11.5.10.2) that utilized Oracle Reports heavily. Our problem was specific to a custom Report that had optimizer hints embedded in the main query. With optimizer_features_enable set to 11.1.0.7, the report retrieved no rows. With it set to 9.2.0.x or 10.2.0.x, it returned the proper number of rows. Based on this data discrepancy, we felt it safe to set optimizer_features_enable < 11.1.0.7 across the board, as we didn’t know the scope of the issue.
During our work on an SR with Oracle, they asked for us to send the query, all the versions and optimizer settings, as well as an export of all the tables involved in the query with exported optimizer statistics. Our problem was several-fold; first, the query was complex and had many embedded views, so we didn’t really want to spend time deconstructing it to get a comprehensive list of tables to export. Second, the underlying tables were very large (hundreds of millions of rows for a few of them) and we didn’t have disk space or quite frankly, time, to export all the tables in their entirety.
Enter DBMS_SQLDIAG …
Using DBMS_SQLDIAG and 11g Data Pump features, we were able to quickly generate a complete test case to export a subset of the rows from all the impacted tables, very quickly and with minimal disk space requirements. Here’s what we did:
- Grabbed the offending query from a TKPROF’d trace file
- Used this syntax to generate a test case:
declare
tc_out clob;
begin
dbms_sqldiag.export_sql_testcase(directory=>’<directory>’,
sql_text=>’<SQL Text>’,
testcase => tc_out,
exportdata=>TRUE,
samplingpercent=>1);
end;
/
In the above example, note the following:
- <directory> is a valid directory - check DBA_DIRECTORIES
- <SQL Text> is the SQL statement from the TKPROF output
- exportdata=>TRUE tells DBMS_SQLDIAG to export the data from the base tables
- samplingpercent=>1 tells Data Pump to use a 1% sampling size. This was important to limit the number of rows
After executing, a number of XML, log, and Data Pump export dumps are generated to <directory> and available to upload to the SR!
One additional step we took was to export table statistics for all the tables involved in the query. For this, we looked in the log file for all tables export and used DBMS_STATS.EXPORT_TABLE_STATS to export segment statistics.
The obvious benefit here is that it enabled us to continue working on the SR, but some other possible applications of DBMS_SQLDIAG could be for internal testing purposes, testing functionality/performance across versions of Oracle without a complicated upgrade, regression testing, and so forth.
Connecting an Oracle Database with a DB2 Database
On a recent project we had a business and technology reason to connect Oracle EBS running 11g to a DB2 database. After a few weeks of researching for the best practice and approach we decided on the following implementation steps.
- Download a DB2 driver, such as from DataTek, and download to your application server in any storage directory.
- Create a directory called “YM” under your custom application top. example: $CUSTOM_TOP/java/YM. (Assumes you have already created a $CUSTOM_TOP and it has a directory called “java”.
- Copy all files from the storage directory to $CUSTOM_TOP/java/YM
- Add 3 entries to s_adovar_classpath and s_adovar_afclasspath to point to the files in $CUSTOM_TOP/java/YM.
- Make sure you run auto-config on your updated environment.
- Bounce the application and database tier.
- Perform any select, insert, update and or delete from the Oracle EBS application via java program to the DB2 database. The java program will reference the DB2 connection from a jdbc connect string such as: “jdbc:oracle:db2://servername.com:port;databasename=xxxxxx; User=xxxx; Password=xxxx”
Future Proofing Oracle EBS: Tip 1
Most organizations leave their mission critical systems vulnerable to security threats and performance degradation. Recent studies show that organizational constraints are at the core of this issue. These constraints range from organizational policies, budgets and staff availability and even competency.
26 percent of organizations apply critical security and performance patches as they are released, 20 percent will eventually apply 6 to 9 months later while another 20 percent have no requirement at all to apply them. That leaves many companies data unprotected.
There are several ways to combat these constraints. One way is gain support from your organization. This can come in the form of support from executives, security / audit team and other critical policy makers. By establishing new and or updating existing security policies, and as a critical component, patching, these guidelines and timeliness of patching removes some of these constraints. A second way to protect and improve data integrity is to use technology. By keeping current on latest software, technology can help you reduce time and budget constraints of getting current applications and databases tested and allow your team to proactively assess the impact to your enterprise.
Centroid believes in 2 degrees of continuous change instead of wide arcs of re-activeness. These small degrees of change allows for continuous improvement across the organization. Whether it be through technology or process slight proactive adjustments make the enterprise perform at a higher degree then you can ever accomplish by large scale reactive measures.
Centroid is in the business of Future Proofing Oracle environments. In fact, Centroid is the leader in Future Proofing your Business and Technology and no one does it better. No one.
Oracle EBS 12.1.1 Upgrade from 12.0.6 - Lessons Learned
So you’re going to patch from 12.0.6 to 12.1.1? It’s basically like any other major upgrade with Oracle’s e-Business Suite; tons of pieces to upgrade, patch, validate, and a high potential for running into problems. The documentation for all the “gotchas” is all over the place, but Metalink document 752619.1 is the best place to start and has all the major steps listed. In general, the following is required:
- Confirm database version
- Upgrade 10.1.3 home to 10.1.3.4
- Upgrade 10.1.2 home to 10.1.2.3
- Apply necessary pre-req patches
- Apply maintenance pack (7303030)
- Apply post-maintenance pack patches
- Do all the post-upgrade steps
Below are some of the lessons learned in an environment in which we patched 12.0.6 to 12.1.1 and along the way, upgraded the 10.2.0.4 database to 11.1.0.7.
Oracle XDB core-dumps (look for ORA-31114)
- Follow Metalink document 760611.1
- We’re not sure exactly why this happened, but something in the XDB upgrade from 10.2.0.4 to 11.1.0.7 failed
AZ tables/objects invalid and workers fail on 7303030 (azR12active.sql)
- Result of XDB core-dump issue above
- Follow Metalink document 832459.1
Worker fails on EGOSILDU.sql
- Per Metalink document 847687.1, this is a bug patched with 8485942. This patch, 8485942, isn’t available as of 10/10/2009
- Per the recommendation in 847687.1, skipped worked and continued
Relink fails on RAXTRX
- Choose to continue. This is due to bug
- Look in 841113.1 for patches. These are the ones I found that fixed this, plus some other issues found during testing …
8239041
7721420
7660309
7673650
8393427
8414254
8414069
Failures during 7303030 compiling multiple forms and libraries
- Choose to continue. Manually recompile using adadmin after the patch, and if things still fail look for patches
Issues applying 10.1.3.4 IAS_ORACLE_HOME patch (7272722)
- Receive failures trying to start OPMN while at 83% complete on initial screen
- Environment was cloned and adcfgclone’d successfully from a 12.0.6 instance
- adcfgclone.pl will update all the right stuff in $ORA_CONFIG_HOME correctly, but several files in $IAS_ORACLE_HOME will not be updated with the correct environment-specific values
- Change hard-coded paths in the following:
$IAS_ORACLE_HOME/inventory/Clone/clone.xml
$IAS_ORACLE_HOME/config/ias.properties
$IAS_ORACLE_HOME/opmn/bin/opmnctl
$IAS_ORACLE_HOME/Apache/Apache/bin/iasobf
- Make sure you can run $IAS_ORACLE_HOME/opmn/bin/opmnctl startall|stopall|status
- Resume 7272722
Failures with 7303030 doing FndXdfCmp
- I’m pretty sure this was a site-specific issue related to not cloning cleanly
- If the adworkxxx.log file looks like this …
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at oracle.apps.ad.worker.AdJavaWorker.invokeUtility(AdJavaWorker.java:501)
at oracle.apps.ad.worker.AdJavaWorker.main(AdJavaWorker.java:356)
Caused by: java.lang.NoSuchFieldError: mTypeData
at oracle.apps.fnd.odf2.FndInitType.doCmpType(FndInitType.java:156)
at oracle.apps.fnd.odf2.FndXdfCmp.doOdfCmp(FndXdfCmp.java:98)
at oracle.apps.fnd.odf2.FndXdfCmp.applyXDF(FndXdfCmp.java:186)
- … hopefully you’ve got an existing 12.1.1 instance to refer to
- Look in $JAVA_TOP/oracle/apps/odf2 for class files containing mTypeData. You should see 3 class files
- Restore anything missing from a clean instance
- Generate all JAR Files
- Kill patch
- Resume
Patch takes a very long time on adobjcmp.sql (11.1.0.7)
- This will take a long time, it’s compiling all invalid objects and there will be 50,000+
- Look for lots of time on “resmgr:cpu quantum” wait event
- Do “alter system set resouce_manager_plan=””
- Bounce the database when everything is done
Are you trying to use RapidWiz to Upgrade to 12.1.1 from 12.0.6?
- Don’t. This is for an 11.5.x upgrade
More to come as I find them, I’m sure …
What you can do with ASH: Top Resource Consuming SQL
The following SQL shows you the top resource-consuming pieces of SQL in your instance:
select ash.SQL_ID ,
sum(decode(ash.session_state,'ON CPU',1,0)) "CPU",
sum(decode(ash.session_state,'WAITING',1,0)) -
sum(decode(ash.session_state,'WAITING', decode(en.wait_class, 'User I/O',1,0),0)) "WAIT" ,
sum(decode(ash.session_state,'WAITING', decode(en.wait_class, 'User I/O',1,0),0)) "IO" ,
sum(decode(ash.session_state,'ON CPU',1,1)) "TOTAL"
from v$active_session_history ash,
v$event_name en
where SQL_ID is not NULL
and en.event#=ash.event#
group by sql_id
order by sum(decode(session_state,'ON CPU',1,1)) desc
“Oracle is slow, can you see if anything is going on?”
As DBAs, we’re faced with this question all the time. In order to quickly supply an accurate answer to this question, an experienced Oracle DBA needs to have a few tools in his belt - and I’m not talking about any special software or monitoring solutions, I’m talking simply SQL*Plus scripts and access to a database account with access to the V$ views.
Here’s what I do when someone asks me this question:
Step #1:
Take a look at V$SESSION_WAIT. This will show you details about sessions currently and actively waiting on named Oracle wait events. More often than not, if things are “slow”, a session or sessions is waiting on an instrumented Oracle wait event. The query I use to do this (works on 9i-11g) is below:
select A.sid,
decode(A.event,'null event','CPU Exec',A.event) WaitEvent,
decode(A.event,'slave wait','N/A',
'PX Deq: Execution Msg','N/A',
'PX Deq Credit: send blk','N/A',
'latch free','N/A',
'enqueue',
chr(bitand(A.p1,-16777216)/16777215)||chr(bitand(A.p1,16711680)/65535),
'file open','-1',to_char(A.p1)) p1,
decode(A.event,'enqueue',decode(mod(A.p1,16),'6','ROW-LOCK','4','ITL','3',
'FK?','OTHER'),
'file open',
-1,
A.p2) p2,
decode(A.event,'latch free','N/A','enqueue',null,'PX qref latch','-1',
'buffer busy waits',to_char(A.p3), A.p3) p3,
decode(A.state,'WAITING','WTG',
'WAITED UNKNOWN TIME','UNK',
'WAITED SHORT TIME','WST',
'WAITED KNOWN TIME','WKT') wait_type,
decode(A.state,'WAITING',A.seconds_in_wait,
'WAITED UNKNOWN TIME',-999,
'WAITED SHORT TIME',A.wait_time,
'WAITED KNOWN TIME',A.WAIT_TIME) wt,
round((last_call_et/60),2) lc,
substr(nvl(b.module,b.program),1,15) pgm
from v$session_wait A,
v$session B
where A.event not in ('Queue Monitor Slave Wait','wait for unread message on broadcast channel','Queue Monitor Wait','jobq slave wait','queue messages','SQL*Net message to client','Null event','rdbms ipc message','i/o slave wait','io done')
and A.event <> 'pipe get'
and A.event not like '%akeup%'
and A.event not like 'Streams AQ%'
and A.state in ('WAITING','WAITED KNOWN TIME')
and A.sid=B.sid
and B.status='ACTIVE'
order by 1
/
Sample output is below (you’ll have to set column headings and other SQL*plus formatting options, but you get the point):
Sid Wait Event P1 P2 P3 Typ Time last call What
------ ------------------------------ ---------- ---------- ---------- --- -------- ---------------
4518 gc buffer busy 24 38019 65537 WTG 0 .00
4519 gc buffer busy 24 38019 65537 WTG 0 .00 XXVG_INV_PICKLI
4680 gc buffer busy 24 38019 65537 WTG 0 .00
4830 gc buffer busy 24 38019 65537 WTG 0 .00 FNDRSSUB
4886 smon timer 300 0 0 WTG 29 18447.47 oracle@usplsvpe
4887 control file parallel write 2 4 2 WTG 0 18447.47 oracle@usplsvpe
4893 gcs remote message 24 0 0 WTG 0 18447.47 oracle@usplsvpe
4895 gcs remote message 24 0 0 WTG 0 18447.47 oracle@usplsvpe
4896 ges remote message 64 0 0 WTG 152 18447.47 oracle@usplsvpe
4899 DIAG idle wait 1 1 200 WTG 1106848 18447.47 oracle@usplsvpe
4900 pmon timer 300 0 0 WTG 680 18447.47 oracle@usplsvpe
In this output, you’ll see a handful of sessions waiting on “gc buffer busy” wait events. At this point, it’s time for the Oracle DBA to study up on what the wait events mean; in this case, sessions are waiting on RAC-related global buffer busy waits, which means that blocks are being used and are pinned in another instance’s cache. I won’t go into a description on what all the wait events mean here - you can look them up at any of the following URLs:
http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/waitevents.htm#REFRN101
http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/waitevents003.htm#BGGIBDJI
http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/instance_tune.htm#i22670
http://www.adp-gmbh.ch/ora/tuning/event.html
http://metalink.oracle.com/metalink/plsql/ml2_documents.showDocument?p_database_id=NOT&p_id=34405.1
http://metalink.oracle.com/metalink/plsql/ml2_documents.showDocument?p_database_id=NOT&p_id=62172.1
At this point, you know who’s waiting on what and you can use the output to look for anomalies for the current environment. A couple of things to note:
- There are a handful of common wait events in any “busy” Oracle environment; specifically, “db file sequential read”, “db file scattered read”, latch-related, enqueue-related (locks), etc. You should be familiar with what types of waits are “normal” for a given system.
- You should become familiar with the relative quantity of each type of wait for each system at various times during the day. For example, at client A, with a new implementation, low transaction volume, not many users, you may never see more than a handful of I/O-related waits at any given time. At this client, if you see several dozen sessions waiting on the same type or class of wait event, it’s probably a cause for concern. At a different client, it may be typical to see 20 or 30 I/O-related waits at any given time. Bottom line is this - you need to have familiarity with the system you’re monitoring.
- Any DBA with his salt should become study Oracle’s wait interface and become familiar with what each of the major wait events means
- You can use Centroid’s “CCEO Infra Wait Interface.ppt” document as a quick reference on the wait interface
Step #2:
Grab the SQL for the sessions that show up repeatedly and frequently in the output from the above query. Note the SID (Session Identifier) and use it as input to the following script:
select
t.sql_fulltext ,
t.buffer_gets, t.disk_reads,t.executions
from v$session s,
v$sql t
where s.sql_address =t.address and s.sql_hash_value =t.hash_value
and s.sid = &&1
/
Format the out of this if you plan on running an execution plan on it.
If you want additional detail about the session(s) from V$SESSION_WAIT, you can query V$SESSION.
Step #3:
If the SQL statements extracted from the previous step are waiting on I/O-related or contention-related waits, you should grab an execution plan/explain plan by taking the formatted SQL and plugging into the below script:
set lines 120
explain plan for
<< insert SQL here >>
select * from table(dbms_xplan.display(null, null,'all'));
Step #4:
If the slowness is related to, for example, locks (enqueue waits), find out who the lock holder(s) is by querying V$LOCK or DBA_WAITERS and make a judgement call as to whether to kill the session(s) holding the lock, communicate with the end-user, etc.
Step #5:
Fix it. This could be a quick-fix (resolving a lock), or more likely will take some time to assess. If your cause of slowness is I/O-related waits, for example, you need to determine whether the SQL is optimized, whether indexes will help, whether concurrency patterns are abnormal (i.e., are there 50 simultaneous executions of a batch job that should only be running serially?), etc. SQL optimization is a science in itself that requires knowledge of the underlying data structures and data volumes, as well as an understanding of Oracle’s optimizer.
Step #6:
What if V$SESSION_WAIT doesn’t tell you anything meaningful? This is when you should consult ASH (Active Session History) views to give you time breakdown details:
select
decode(nvl(to_char(s.sid),-1),-1,'DISCONNECTED','CONNECTED')
"STATUS",
topsession.sid "SID",
topsession.program "PROGRAM",
max(topsession.CPU) "CPU",
max(topsession.WAIT) "WAITING",
max(topsession.IO) "IO",
max(topsession.TOTAL) "TOTAL"
from (
select * from (
select
ash.session_id sid,
ash.session_serial# serial#,
ash.user_id user_id,
ash.program,
sum(decode(ash.session_state,'ON CPU',1,0)) "CPU",
sum(decode(ash.session_state,'WAITING',1,0)) -
sum(decode(ash.session_state,'WAITING',
decode(wait_class,'User I/O',1, 0 ), 0)) "WAIT" ,
sum(decode(ash.session_state,'WAITING',
decode(wait_class,'User I/O',1, 0 ), 0)) "IO" ,
sum(decode(session_state,'ON CPU',1,1)) "TOTAL"
from v$active_session_history ash
group by session_id,user_id,session_serial#,program
order by sum(decode(session_state,'ON CPU',1,1)) desc
) where rownum < 10
) topsession,
v$session s,
all_users u
where
u.user_id =topsession.user_id and
/* outer join to v$session because the session might be disconnected */
topsession.sid = s.sid (+) and
topsession.serial# = s.serial# (+)
group by topsession.sid, topsession.serial#,
topsession.user_id, topsession.program, s.username,
s.sid,s.paddr,u.username
order by max(topsession.TOTAL) desc
/
The output may look like this:
STATUS Sid PROGRAM CPU WAITING IO TOTAL
------------ ----- ------------------------- ----------- ---------- ----- ------
DISCONNECTED 4518 11220 9 83 11312
CONNECTED 4584 9620 25 50 9695
DISCONNECTED 4683 das@usplsvpba002.verigy.n 5598 258 735 6591
et (TNS V1-V3)
CONNECTED 4888 oracle@usplsvped002.verig 483 956 0 1439
y.net (LGWR)
CONNECTED 4897 oracle@usplsvped002.verig 7 1119 0 1126
y.net (LMON)
DISCONNECTED 4614 158 75 552 785
DISCONNECTED 4698 sqlservr.exe 52 20 695 767
DISCONNECTED 4491 sqlservr.exe 102 35 496 633
DISCONNECTED 4698 sqlservr.exe 14 9 578 601
You can use the methods in Steps 2-3 above to get details about the sessions above.
Step #7:
If nothing stands out at this point, consult system logs and Oracle alert logs, as well as O/S performance tools (sar, top, glance, etc)
Oracle 10g ASH: What is it Anyway?
As a DBA, how many times have you been asked… “what happened at 3:00am today, everything slowed down?” Try answering this question with V$SESSION_WAIT – you can’t. Try answering by looking at AWR, Automatic Workload Repository, or Statpack reports – you may be able to make some guesses, but they’d only be guesses. Try answering with data in the alert log or data in O/S logs – you’ll only be able to answer this question if you get very lucky. Try answering by looking at Oracle Enterprise Manager Grid Control 10g. Well actually, this gets you on the right track, because Grid Control will show you ASH, Activity Session History data!
ASH is built into the kernel in Oracle 10g and MMON and MMNL do the work! ASH collects data every second from V$SESSION, V$SESSTAT, V$SESSION_WAIT, etc, and populates V$ACTIVE_SESSION_HISTORY. An hour of data is kept in memory for easy access, just query V$ACTIVE_SESSION_HISTORY (think of it as V$SESSION_WAIT plus some more stuff, but longer lasting). Data is flushed to disk every hour and externalized via views WRH$_<ASH name>. The collection interval and volume self-adjusts based on load. ASH will collect more aggressively when the system is busy, less so when not.
In short, ASH gives you a way to use all your “right now” tuning methods for time periods in the past. When someone asks you what happened at X:00AM when you were fast asleep, you have an option now. Give it a try!
Look for deeper dives into ASH in future posts. Stay tuned!











