Objective 4.1 – Utilize Advanced vSphere Performance Monitoring Tools

Configure esxtop/resxtop custom profiles

These two tools are used to monitor and gather performance data from an ESXi host.

  • esxtop – This gives real time CPU, memory, disk and network data for hosts and virtual machines. You can run esxtop from a direct connection to a host’s CLI
  • resxtop – This is a remote version of esxtop. It is included as part of vCLI and is present on the vMA (vSphere Management Assistant). resxtop has three modes of operation including Interactive, Batch, and Replay.

Example esxtop output:

 esxtop01

There are a lot of options available to you when using esxtop. When running the tool, pressing ‘h’ will show you what options are available:

Interactive commands are:
esxtop02

So, as an example, you can press ‘d’ to access stats around the disk adapters:

esxtop03

From that screen you can press ‘f’ to select what fields to wish to view:

esxtop04

Once you have set it up to display the values you are interested in you can press ‘W’ (capital W, otherwise you will sort the output by writes) to save the config to a file. You can accept the default or type your own path as necessary.

esxtop05

Next time you need to view those specific stats in esxtop you can open it, referring to the file you saved:

esxtop06
esxtop07

Determine use cases for and apply esxtop/resxtop Interactive, Batch and Replay modes

esxtop/resxtop has three modes in which it can be run:

  • Interactive – This is the default mode, and the one used in the example above. By default statistics are collected at a 5 second interval, although this can be changed.
  • Batch – This mode is used for collecting statistics for a period of time, for later analysis. Once the stats have been collected, they can be analysed using Excel, Perfmon or ESXplot amongst other tools.
  • Replay – This mode allows you to replay data collected using the vm-support tool. You cannot replay data collected in batch mode interactively.

Collecting Data using Batch Mode

There are a few options to set when doing a batch mode collection – the various esxtop options are shown here:

esxtop08

The example below, ran from a ESXi host, will collect stats at 5 second intervals, 20 times. The -a switch indicates that you want to collect all stats:

esxtop09

Rather than collect all stats, you can specify a configuration file such as the one I created earlier. Batch mode will then only capture the stats specified in the file.

esxtop10

Be aware that these captures can grow quickly, so ensure you have enough local disk if running ofr extended periods of time.

esxtop11

With this in mind, it’s possible to compress the output file as it is collected:

esxtop12

The zipped file is significantly smaller than the unzipped version:

esxtop13

Use vscsiStats to gather storage performance data 

You can use the vscsiStats tool to gather storage performance data for VMFS and NFS datastores. The tool is now available on a default install of ESXi, and is located in the /usr/sbin directory. Running the tool, without any arguments, will display the usage options:

vscsi01

The first step to collecting data is to get the world id for a virtual machine you wish to monitor. You can list the world ids by running ‘vscsiStats -l’, as shown below:

vscsi02

In this example we can see there are two running virtual machines on this host, one with 11 disks (my vCenter VCSA) and one virtual disk (my vSphere Management Assistant). We can start a collection on the second VMs disk by running:

vscsi03

The collection will run for 30 minutes unless it is stopped before then. It can be stopped by running:

vscsi04

If you wanted to extend the collection time you can run the start command again whilst the collection is already running (but you have to do this every 30 minutes).

To reset the statistics without stopping the collection run:

vscsi05

To view the data, at any point whilst the collection is running, use the -p switch:

vscsi06

This will output the histogram data to the console. Alternatively you can choose to output the data to .csv by running:

vscsi07

Using the ‘-p all’ option will display all collected data in several histograms. The following metrics are represented:

  • seekDistance – The distance in logical block numbers (LBN) that the disk head must travel to read or write a block. If a concentration of your seek distance is very small (less than 1), then the data is sequential in nature. If the seek distance happens to be varied, your level of randomization may be proportional to this distance travelled.
  • ioLength – is the size of the I/O.
  • outstandingIOs – This will help give you an idea of any queuing that is occurring.
  • latency – is the time of the I/O trip
  • interarrival – is the amount of time in microseconds between virtual machine disk commands.

Rather than display all metrics in the output, you can choose to only display the histograms related to the metric you are interested in by substituting ‘all’ for the name of the metric. For example:

vscsi08

Use esxtop/resxtop to collect performance data

Metrics to pay attention to are:

  • CMDS/s – This is the total amount of commands per second, which includes IOPS and other SCSI commands (e.g. reservations and locks). Generally speaking CMDS/s = IOPS unless there are a lot of other SCSI operations/metadata operations such as reservations.
  • DAVG/cmd – This is the average response time in milliseconds per command being sent to the storage device.
  • KAVG/cmd – This is the amount of time the command spends in the VMKernel.
  • GAVG/cmd – This is the response time as experienced by the Guest OS. This is calculated by adding together the DAVG and the KAVG values.

As a general rule DAVG/cmd, KAVG/cmd and GAVG/cmd should not exceed 10 milliseconds (ms) for sustained lengths of time.

There are also the following throughput metrics to be aware of:

  • READS/s – Number of read commands issued per second
  • WRITES/s – Number of write commands issued per second
  • MBREAD/s – Megabytes read per second
  • MBWRTN/s – Megabytes written per second

The sum of reads and writes equals IOPS, which is the the most common benchmark when monitoring and troubleshooting storage performance. These metrics can be monitored at the HBA or Virtual Machine level.

Given esxtop/resxtop output, identify relative performance data for capacity planning purposes

esxtop17

CPU load average:

The CPU Load Average on the first line to determine the amount of use for all physical CPUs on the ESX Server. The load averages are displayed for one, five and fifteen-minute intervals.

A load average of 1.00 means that the ESX Server machine’s physical CPUs are fully utilized, and a load average of 0.5 means they are half utilized. On the other hand, a load average of 2.00 means that you either need to increase the number of CPUs or decrease the number of virtual machines running on the ESX Server machine because the system as a whole is overloaded.

PCPU UTIL(%)

The PCPU line for the percentage of individual physical CPU use for CPU0 and CPU1 respectively (for a dual-processor machine). The last value is the average percentage for all of the physical CPUs.

As a rule of thumb, 80.00% is a desirable usage percentage, but bear in mind that different organisations have varying standards with respect to how close to capacity they run their servers. 90% should be considered a warning that the CPUs are approaching an overloaded condition.

You can enter the interactive c command to toggle the display of the PCPU line. If hyperthreading
is enabled, the LCPU line appears whenever the PCPU line is displayed. The LCPU
line shows the logical CPU use.

“CORE UTIL(%)” (only displayed when hyper-threading is enabled)

The percentage of CPU cycles per core when at least one of the PCPUs in this core is unhalted, and its average over all cores. It’s the reverse of the “CORE IDLE” percentage, which is the percentage of CPU cycles when both PCPUs in this core are halted.

If hyper-threading is used, get the average “CORE UTIL(%)” directly. Otherwise, i.e. hyper-threading is unavailable or disabled, a PCPU is a Core, then we can just use the average “PCPU UTIL(%)”. Based on esxtop batch output, we can use something like below.

PCPU USED(%)

While “PCPU UTIL(%)” indicates how much time a PCPU was busy (unhalted) in the last duration, “PCPU USED(%)” shows the amount of “effective work” that has been done by this PCPU. The value of “PCPU USED(%)” can be different from “PCPU UTIL(%)” mainly for the following two reasons:Hyper-threading;Power Management.

CCPU(%)

Percentages of total CPU time as reported by the ESX Service Console, “us” is for percentage user time, “sy” is for percentage system time, “id” is for percentage idle time and “wa” is for percentage wait time. “cs/sec” is for the context switches per second recorded by the ESX Service Console.

vm %RDY

A world in a run queue is waiting for CPU scheduler to let it run on a PCPU. %RDY accounts the percentage of this time. The %RDY value is a sum of all vCPU %RDY for the VM. Some examples:

  • The max %RDY value of a 1vCPU VM is 100%
  • The max %RDY value of a 4vCPU VM is 400%

The recommended RDY% thresholds is :

  • %RDY 10 for 1 vCPU
  • %RDY 20 for 2 vCPU
  • %RDY 40 for 4 vCPU

To determine whether the poor performance is due to a CPU constraint:

  • Examine the load average on the first line of the esxtop command output.
  • Examine the %READY field for the percentage of time that the virtual machine was ready but could not be scheduled to run on a physical CPU.
  • Make sure the virtual machine is not constrained by a CPU limit set on itself
  • Make sure that the virtual machine is not constrained by its resource pool.