Monitoring Exadata Cell Servers with Metrics

Cell Server Metrics

Metrics are recorded observations of run-time properties and internal instrumentation values in the storage cell and its components, such as cells, cell disks, grid disks, etc.

CellCLI> list metricdefinition where
objectType='CELLDISK' <detail>
CellCLI> list metricdefinition where
objectType='GRIDDISK' <detail>

The “detail” clause at the end of the listing will show additional details about the metrics available. There is a wide range of metric definitions for each object type, so let’s start by focusing on Grid Disk metrics and see what’s available to monitor:

CellCLI> list metricdefinition where
objectType='GRIDDISK' detail;
name: GD_IO_BY_R_LG
description: "Number of megabytes read in large
blocks from a grid disk"
metricType: Cumulative
objectType: GRIDDISK
unit: MB

name: GD_IO_BY_R_LG_SEC
description: "Number of megabytes read in large
blocks per second from a grid disk"
metricType: Rate
objectType: GRIDDISK
unit: MB/sec

name: GD_IO_BY_R_SM
description: "Number of megabytes read in small
blocks from a grid disk"
metricType: Cumulative
objectType: GRIDDISK
unit: MB

name: GD_IO_BY_R_SM_SEC
description: "Number of megabytes read in small
blocks per second from a grid disk"
metricType: Rate
objectType: GRIDDISK
unit: MB/sec

name: GD_IO_BY_W_LG
description: "Number of megabytes written in
large blocks to a grid disk"
metricType: Cumulative
objectType: GRIDDISK
unit: MB

name: GD_IO_BY_W_LG_SEC
description: "Number of megabytes written in
large blocks per second to a grid disk"
metricType: Rate
objectType: GRIDDISK
unit: MB/sec

name: GD_IO_BY_W_SM
description: "Number of megabytes written in
small blocks to a grid disk"
metricType: Cumulative
objectType: GRIDDISK
unit: MB

name: GD_IO_BY_W_SM_SEC
description: "Number of megabytes written in
small blocks per second to a grid disk"
metricType: Rate
objectType: GRIDDISK
unit: MB/sec

name: GD_IO_ERRS
description: "Number of IO errors on a grid
disk"
metricType: Cumulative
objectType: GRIDDISK
unit: Number

name: GD_IO_ERRS_MIN
description: "Number of IO errors on a grid disk
per minute"
metricType: Rate
objectType: GRIDDISK
unit: /min

name: GD_IO_RQ_R_LG
description: "Number of requests to read large
blocks from a grid disk"
metricType: Cumulative
objectType: GRIDDISK
unit: "IO requests"

name: GD_IO_RQ_R_LG_SEC
description: "Number of requests to read large
blocks per second from a grid disk"
metricType: Rate
objectType: GRIDDISK
unit: IO/sec

name: GD_IO_RQ_R_SM
description: "Number of requests to read small
blocks from a grid disk"
metricType: Cumulative
objectType: GRIDDISK
unit: "IO requests"

name: GD_IO_RQ_R_SM_SEC
description: "Number of requests to read small
blocks per second from a grid disk"
metricType: Rate
objectType: GRIDDISK
unit: IO/sec

name: GD_IO_RQ_W_LG
description: "Number of requests to write large
blocks to a grid disk"
metricType: Cumulative
objectType: GRIDDISK
unit: "IO requests"

name: GD_IO_RQ_W_LG_SEC
description: "Number of requests to write large
blocks per second to a grid disk"
metricType: Rate
objectType: GRIDDISK
unit: IO/sec

name: GD_IO_RQ_W_SM
description: "Number of requests to write small
blocks to a grid disk"
metricType: Cumulative
objectType: GRIDDISK
unit: "IO requests"

name: GD_IO_RQ_W_SM_SEC
description: "Number of requests to write small
blocks per second to a grid disk"
metricType: Rate
objectType: GRIDDISK
unit: IO/sec

CellCLI>

 

As we can see above in the metricType listing, metrics can either be cumulative or instantaneous – your monitoring needs should dictate what types of metrics you may or will need to display. Rather than go through every case of monitoring, below is a table containing some common current monitoring scenarios that you may wish to report on. In the below table, I am mostly doing instantaneous metrics, but to get cumulative values in most cases you can drop the “_SEC” from the end of the list command.

 

Monitoring Requirement objectType CellCLI Command
Cell CPU Utilization Cell
list metriccurrent where name='CL_CPUT';
Cell Memory Utilization Cell
list metriccurrent where name='CL_MEMUT';
Cell Temperature Cell
list metriccurrent where name='CL_TEMP';
Total IO packets received/second Cell
list metriccurrent where
name='N_NIC_RCV_SEC';
Total IO packets transmitted second Cell
list metriccurrent where
name='N_NIC_TRANS_SEC';
MB Read/Written in large blocks/Sec Cell Disk
list metriccurrent where
name='CD_IO_BY_R_LG_SEC';
list metriccurrent where
name='CD_IO_BY_W_LG_SEC';
MB Read/Write in small blocks/Sec Cell Disk
list metriccurrent where
name='CD_IO_BY_R_SM_SEC';
list metriccurrent where
name='CD_IO_BY_W_SM_SEC';
Avg IO Load of cell disk Cell Disk
list metriccurrent where name= CD_IO_LOAD;
Number of large read/write requests/second to cell disk Cell Disk
list metriccurrent where
name='CD_IO_RQ_R_LG_SEC';
list metriccurrent where
name='CD_IO_RQ_W_LG_SEC';
Number of small read/write requests/second to cell disk Cell Disk
list metriccurrent where
name='CD_IO_RQ_R_SM_SEC';
list metriccurrent where
name='CD_IO_RQ_W_SM_SEC';
Avg latency of large read/write to cell disk Cell Disk
list metriccurrent where
name='CD_IO_TM_R_LG_RQ';
list metriccurrent where
name='CD_IO_TM_W_LG_RQ';
Avg latency of small read/write to cell disk Cell Disk
list metriccurrent where
name='CD_IO_TM_R_SM_RQ';
list metriccurrent where
name='CD_IO_TM_W_SM_RQ';
MB read/written in large blocks/Sec Grid Disk
list metriccurrent where
name='GD_IO_BY_R_LG_SEC';
list metriccurrent where
name='GD_IO_BY_W_LG_SEC';
MB read/write in small blocks/sec Grid Disk
list metriccurrent where
name='GD_IO_BY_R_SM_SEC';
list metriccurrent where
name='GD_IO_BY_W_SM_SEC';
Number of large read/write requests/second to grid disk Grid Disk
list metriccurrent where
name='GD_IO_RQ_R_LG_SEC';
list metriccurrent where
name='GD_IO_RQ_W_LG_SEC';
Number of small read/write requests/second to grid disk Grid Disk
list metriccurrent where
name='GD_IO_RQ_R_SM_SEC';
list metriccurrent where
name='GD_IO_RQ_W_SM_SEC';
Number of MB/sec pushed out of FlashCache due to being 80% full FlashCache
list metriccurrent where
name='FC_BYKEEP_OVERWR_SEC';
Number of MB used for ‘keep’ objects in FlashCache FlashCache
list metriccurrent where name='
FC_BYKEEP_USED;
Number of MB used for in FlashCache FlashCache
list metriccurrent where name=' FC_BY_USED;
Number of MB/sec read/written from FlashCache FlashCache
list metriccurrent where
name='FC_IO_BY_R_SEC';
list metriccurrent where
name='FC_IO_BY_W_SEC';
Number of reads/sec satisfied FlashCache FlashCache
list metriccurrent where name='
FC_IO_RQ_R_SEC';
Number of IO requests/second that resulted in FlashCache being populated FlashCache
list metriccurrent where
name='FC_IO_BY_W_SEC'
Mb/sec received from host Host Interconnect
list metriccurrent where
name='N_MB_RECEIVED_SEC';
Mb/sec sent to host Host Interconnect
list metriccurrent where
name='N_MB_SENT_SEC';

Examples

Below, output is truncated in each example to save space.

Cell server CPU utilization:

 

[[email protected] cellmon]$ dcli -g
../cell_group cellcli -e \
list metriccurrent where name='CL_CPUT';
cm01cel01: CL_CPUT cm01cel01 0.2 %
cm01cel02: CL_CPUT cm01cel02 0.2 %
cm01cel03: CL_CPUT cm01cel03 0.7 %
[[email protected] cellmon]$

 

Average IO load of cell disks:

 

[[email protected] cellmon]$ dcli -g
../cell_group cellcli -e \
list metriccurrent where name= 'CD_IO_LOAD'
cm01cel01: CD_IO_LOAD CD_00_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_01_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_02_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_03_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_04_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_05_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_06_cm01cel01 0
cm01cel01: CD_IO_LOAD CD_07_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_08_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_09_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_10_cm01cel01 1
cm01cel01: CD_IO_LOAD CD_11_cm01cel01 1

 

Average latency of large IO to cell disks:

 

[[email protected] cellmon]$ dcli -g
../cell_group cellcli -e \
list metriccurrent where
name='CD_IO_TM_R_LG_RQ'
cm01cel01: CD_IO_TM_R_LG_RQ CD_00_cm01cel01 0.0
us/request
cm01cel01: CD_IO_TM_R_LG_RQ CD_01_cm01cel01 0.0
us/request
cm01cel01: CD_IO_TM_R_LG_RQ CD_02_cm01cel01 0.0
us/request
cm01cel01: CD_IO_TM_R_LG_RQ CD_03_cm01cel01 0.0
us/request
cm01cel01: CD_IO_TM_R_LG_RQ CD_04_cm01cel01 0.0
us/request
cm01cel01: CD_IO_TM_R_LG_RQ CD_05_cm01cel01 0.0
us/request
cm01cel01: CD_IO_TM_R_LG_RQ CD_06_cm01cel01 0.0
us/request
cm01cel01: CD_IO_TM_R_LG_RQ CD_07_cm01cel01 0.0
us/request
cm01cel01: CD_IO_TM_R_LG_RQ CD_08_cm01cel01 0.0
us/request
cm01cel01: CD_IO_TM_R_LG_RQ CD_09_cm01cel01 0.0
us/request

 

Number of large read requests/second to Grid Disks:

 

[[email protected] cellmon]$ dcli -g
../cell_group cellcli -e \
list metriccurrent where
name='GD_IO_RQ_R_LG_SEC'
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_00_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_01_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_02_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_03_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_04_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_05_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_06_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_07_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_08_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_09_cm01cel01 0.0 IO/sec
cm01cel01: GD_IO_RQ_R_LG_SEC
DATA_CD_10_cm01cel01 0.0 IO/sec

 

Cumulative number of large read requests to Grid Disks:

 

[[email protected] cellmon]$ dcli -g
../cell_group cellcli -e \
list metriccurrent where
name='GD_IO_RQ_R_LG'
cm01cel01: GD_IO_RQ_R_LG DATA_CD_00_cm01cel01
58,973 IO requests
cm01cel01: GD_IO_RQ_R_LG DATA_CD_01_cm01cel01
58,293 IO requests
cm01cel01: GD_IO_RQ_R_LG DATA_CD_02_cm01cel01
58,120 IO requests
cm01cel01: GD_IO_RQ_R_LG DATA_CD_03_cm01cel01
58,243 IO requests
cm01cel01: GD_IO_RQ_R_LG DATA_CD_04_cm01cel01
58,844 IO requests
cm01cel01: GD_IO_RQ_R_LG DATA_CD_05_cm01cel01
58,973 IO requests
cm01cel01: GD_IO_RQ_R_LG DATA_CD_06_cm01cel01
58,491 IO requests
cm01cel01: GD_IO_RQ_R_LG DATA_CD_07_cm01cel01
58,326 IO requests
cm01cel01: GD_IO_RQ_R_LG DATA_CD_08_cm01cel01
58,405 IO requests


 

Summary

Monitoring Exadata with metrics provides insight into the component performance and availability.