Tuesday, March 31, 2020

Performance tools in Linux

top

The top command in Linux displays the running processes on the system.  It is used extensively for monitoring the load on a server.






Uptime and Load averages:  
top - 20:55:50 up 176 days, 7:38, i user, load average: 1.39, 0.95, 0.76

The fields display:

  • current time
  • the time the system has been up
  • number of users logged in
  • load average of 5 minutes, 10 minutes and 15 minutes respectively

This uptime display can be toggled with the 'l' command.


Tasks:

Tasks: 288 total, 1 running, 287 sleeping, 0 stopped, 0 zombie

Shows summary of tasks or processes.  The processes can be in different states.  It shows the total number of processes.  These processes can be running, sleeping, stopped or in zombie state.  These processes can be toggled with the 't' command.


CPU states:

Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Next is shown the CPU state.   Here percentage of CPU(s) time in different modes is shown:



  • us, user: CPU time in user processes
  • sy, system: CPU time in running kernel processes
  • ni, niced: CPU time in running niced user processes
  • wa, I/O wait: CPU time waiting for I/O completion
  • hi: CPU time serving hardware interrupts
  • si: CPU time serving software interrupts
  • st: CPU time stolen from this VM by the Hypervisor

Memory usage:
Mem: 164615148k total, 5679640k used, 10785508k free, 261452k buffers
Swap: 18481148k total, 0k used, 18481148k free, 1254932k cached

The memory usage is sort of like the # free   command output.  The first line shows details for physical memory.  The second line displays information on the virtual memory (swap space).


Fields/Columns:




The processes are shown in columns.  


PID:

The Process IDs, to uniquely identify processes.

USER:

The effective username of the owner of the processes.

PR:

The scheduling priority of the process.  

NI:

The nice value of the process.  Lower value means higher priority.

VIRT:

The amount of virtual memory used by the process.

RES:

The resident memory size.  Resident memory is the amount of non-swapped physical memory a task is using.

SHR:

SHR is the shared memory used by the process.

S:

This is the process status.  It can have one of the following values:
D - uninterruptible sleep
R - running
S - sleeping
T - traced or stopped
Z - zombie

%CPU:

It is the percentage of CPU time the task has used since last update.

%MEM:

Percentage of available physical memory used by the process.

TIME+:

The total CPU time the task has used since it started, with precision upto hundredth of a second.

COMMAND:

The command which was used to start the process.

iostat


The iostat command is used for monitoring system input/output device loading by observing the time the devices are active in relation to their average transfer rates.  The iostat creates reports that can be used to change system configuration to better balance the input/output between physical disks.


server.us.company.com: / >

server.us.company.com: / > iostat
Linux 2.6.39-400.109.5.el5uek (server.us.company.com)   09/27/2014

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           0.11    0.02    0.03    0.03    0.00   99.81

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

sda               1.30         0.11        14.27    1628194  218384238
sda1              0.00         0.00         0.00       3146         30
sda2              1.30         0.11        14.27    1624648  218384208
dm-0              1.79         0.11        14.27    1622738  218384208
dm-1              0.00         0.00         0.00       1472          0

server.us.company.com: / >

server.us.company.com: / >

The first section contains the CPU report:



  • %user: shows the percentage of CPU utilization that occurs while executing at the user (application) level
  • %nice: shows the percentage of CPU utilization that occurs while executing at the user level with nice priority
  • %system: shows the percentage of CPU utilization that occurs while executing at the system (kernel) level
  • %iowait: shows the percentage of the time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request
  • %steal: shows the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor
  • %idle: shows the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request

The second section contains device utilization report:



  • Device: device/partition name as listed in /dev directory
  • tps: shows the number of transfers per second that were issued to the device.  Higher tps means the processor is busier
  • Blk_read/s: shows the amount of data read from the device expressed in number of blocks (kilobytes, megabytes) per second
  • Blk_wrtn/s: shows the amount of data written to the device expressed in number of blocks (kilobytes, megabytes) per second
  • Blk_read: shows the total number of blocks read
  • Blk_wrtn: shows the total number of blocks written


The various data you have seen above is in bytes.  You can use the 'k' option and display the information in Kilobytes, for ease of readability.  Combined with a couple of more options, let us look at an example where the disk I/O and iostat outputs are put out on the screen four times, with a gap of three seconds after every read:


server.us.company.com: / >

server.us.company.com: / >
server.us.company.com: / > iostat -k 3 4
Linux 2.6.39-400.109.5.el5uek (server.us.company.com)   09/27/2014

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           0.11    0.02    0.03    0.03    0.00   99.81

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn

sda               1.30         0.05         7.14     814097  109311967
sda1              0.00         0.00         0.00       1573         15
sda2              1.30         0.05         7.14     812324  109311952
dm-0              1.79         0.05         7.14     811369  109311952
dm-1              0.00         0.00         0.00        736          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           0.08    0.00    0.08    0.17    0.00   99.67

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn

sda               6.67         0.00        32.00          0         96
sda1              0.00         0.00         0.00          0          0
sda2              6.67         0.00        32.00          0         96
dm-0              8.00         0.00        32.00          0         96
dm-1              0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           0.08    0.00    0.13    0.00    0.00   99.79

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn

sda               2.67         0.00        10.67          0         32
sda1              0.00         0.00         0.00          0          0
sda2              2.67         0.00        10.67          0         32
dm-0              2.67         0.00        10.67          0         32
dm-1              0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           0.04    0.00    0.08    0.04    0.00   99.83

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn

sda               0.67         0.00         5.33          0         16
sda1              0.00         0.00         0.00          0          0
sda2              0.67         0.00         5.33          0         16
dm-0              1.33         0.00         5.33          0         16
dm-1              0.00         0.00         0.00          0          0

server.us.company.com: / >

server.us.company.com: / >
server.us.company.com: / >

uptime

server.us.company.com: / >

server.us.company.com: / > uptime
 19:20:13 up 178 days,  6:03,  1 user,  load average: 3.73, 7.98, 0.50
server.us.company.com: / >

The uptime command displays how long the server/system has been up and running since the last reboot.  

The 19:20:13 shows the current time in 24-hour format.
The 178 days and 6:03 says that the system has been running for 178 days, 6 hours and 3 minutes.
The total number of users logged in is 1.
What is loadavg ?  What do the three numbers in uptime represent ?
Exponentially damped/weighted moving average
On single-CPU machines that are CPU-bound, one can think of load average as a percentage of system utilization during the respective time period.  For systems with multiple CPUs, the number needs to be divided by the number of processors in order to get a percentage.
For example, a load average of "3.73 7.98 0.50" on a single-CPU system can be interpreted as:
During the last minute, the CPU was overloaded by 273% (1 CPU with 3.73 runnable processes, so that 2.73 processes were waiting for their turn).  The CPU was only half busy during half of the last 15 minutes.  This means that this CPU could have handled all of the work scheduled for the last minute if it were 3.73 times as fast, or if there were 4 (3.73 rounded up) times as many CPUs, but that over the last 15 minutes it was twice as fast as necessary to prevent runnable processes from waiting for their turn.

Conversely, in a system with four CPUs, a load average of 3.73 would indicate that there were, on average, 3.73 processes ready to run, and each one could be scheduled into a CPU.



mpstat


mpstat is used for monitoring CPU utilization.  This tool is more useful when there are multiple CPUs.   server.us.company.com: / >

server.us.company.com: / > mpstat
Linux 2.6.39-400.109.5.el5uek (server.us.company.com)   09/28/2014

07:30:28 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s

07:30:28 PM  all    0.11    0.02    0.03    0.03    0.00    0.00    0.00   99.81    232.23
server.us.company.com: / >

In this output, 



  • 07:30:28 PM:  the time that mpstat was run
  • all:  means all CPUs
  • %usr:  shows the percentage of CPU utilization that occurs while executing at the user level (application)
  • %nice:  shows the percentage of CPU utilization that occurs while executing at the user level with nice priority
  • %sys:  shows the percentage of CPU utilization that occurs while executing at the system level (kernel)
  • %iowait:  shows the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request
  • %irq:  shows the percentage of time spent by the CPU(s) to service hardware interrupts
  • %soft:  shows the percentage of time spent by the CPU(s) to service software interrupts
  • %steal:  shows the percentage of time spent in involuntary wait by the virtual CPU(s) while the hypervisor was servicing another virtual processor
  • %idle:  shows the percentage of time spent that the CPU(s) were idling and the system did not have an outstanding disk I/O request


A useful way of checking all the CPUs for their utilization:

server.us.company.com: / >
server.us.company.com: / > mpstat -P ALL
Linux 2.6.39-400.109.5.el5uek (server.us.company.com)   09/28/2014

08:00:23 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s

08:00:23 PM  all    0.11    0.02    0.03    0.03    0.00    0.00    0.00   99.81    232.31
08:00:23 PM    0    0.07    0.01    0.03    0.03    0.00    0.00    0.00   99.87      0.00
08:00:23 PM    1    0.08    0.01    0.03    0.08    0.00    0.00    0.00   99.80      0.00
08:00:23 PM    2    0.12    0.00    0.03    0.04    0.00    0.00    0.00   99.81      0.00
08:00:23 PM    3    0.32    0.12    0.04    0.00    0.00    0.00    0.00   99.51      0.00
08:00:23 PM    4    0.08    0.01    0.02    0.08    0.00    0.00    0.00   99.82      0.00
08:00:23 PM    5    0.07    0.01    0.02    0.02    0.00    0.00    0.00   99.89      0.00
08:00:23 PM    6    0.07    0.00    0.02    0.00    0.00    0.00    0.00   99.90      0.00
08:00:23 PM    7    0.07    0.00    0.02    0.00    0.00    0.00    0.00   99.91      0.00
server.us.company.com: / >
server.us.company.com: / >


vmstat

The amount of memory (RAM) is finite, and you can only load a certain number of applications.  When you try to load too many applications into memory, the computer will come back to you and say: "Sorry, you cannot run any more applications.  You need to close some of the applications already running".


To resolve this sort of a problematic situation, the operating system uses a concept called Virtual Memory.  This method will search the area of memory not recently used by an application, copy it into the hard disk, thereby freeing up some memory and give you the opportunity to run more applications.


vmstat provides reporting virtual memory statistics.  It covers system's memory, swap and processor(s) utilizations in real time.


server.us.company.com: / >

server.us.company.com: / > vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 10785136 261548 1255252    0    0     0     1    0    0  0  0 100  0  0
 0  0      0 10785136 261548 1255252    0    0     0     3 1042 2206  0  0 100  0  0
 0  0      0 10785136 261548 1255252    0    0     0    17 1033 2209  0  0 100  0  0
 0  0      0 10785128 261548 1255252    0    0     0    25  981 2162  0  0 100  0  0
 0  0      0 10785128 261548 1255252    0    0     0    11  954 2127  0  0 100  0  0
server.us.company.com: / >

Procs:

r:  the total number of processes that are waiting for access to the processor
b:  the total number of processes in a sleep state

Memory:

swpd:  shows how much memory has been swapped to a swap file or disk
free:  shows the unallocated memory available
buff:  shows how much buffer space is taken up
cache:  shows much memory that can be swapped into the swap file or disk if there is some application needing it

Swap:

Swap shows how much memory is sent or retrieved from the swap system.  
si:  how much memory is moved from swap to real memory per second
so:  how much memory is moved from real memory to swap

I/O:

The I/O shows the amount of input and output activity per second in terms of blocks read and blocks written.
bi:  the number of blocks received
bo:  the number of blocks sent

System:

Shows the number of system operations per second.  
in:  the number of system interrupts per second
cs:  the number of context switches that the system makes in order to process all tasks

CPU:

Shows the use of CPU's resources.


  • us:  how much time that processor spends on non-kernel processes
  • sy:  how much time that processor spends on kernel related tasks
  • id:  how long the processor has been idle
  • wa:  how much time or how long the processor has been waiting for I/O operations to complete before being able to continue processing tasks









free



ping



nicstat



dstat



sar



netstat



pidstat



strace



tcpdump



blktrace



iotop



slabtop



sysctl



/proc



btrace



perf



dtrace



SystemTap



lsof



pcstat



ftrace




stap

ktap



ebpf



lttng



tiptop



swapon



ltrace


ss



ltrace



iptraf



snmpget



lldptool



sysdig



rdmsr





Loadavg:


What is loadavg ?  What do the three numbers in uptime represent ?

delta59.company.com: / >

delta59.company.com: / >
delta59.company.com: / > uptime
 07:38:22 up 81 days, 17:45,  6 users,  load average: 3.73, 7.98, 0.50
delta59.company.com: / >
delta59.company.com: / >

Exponentially damped/weighted moving average

On single-CPU machines that are CPU-bound, one can think of load average as a percentage of system utilization during the respective time period.  For systems with multiple CPUs, the number needs to be divided by the number of processors in order to get a percentage.
For example, a load average of "3.73 7.98 0.50" on a single-CPU system can be interpreted as:
During the last minute, the CPU was overloaded by 273% (1 CPU with 3.73 runnable processes, so that 2.73 processes were waiting for their turn).  The CPU was only half busy during half of the last 15 minutes.  This means that this CPU could have handled all of the work scheduled for the last minute if it were 3.73 times as fast, or if there were 4 (3.73 rounded up) times as many CPUs, but that over the last 15 minutes it was twice as fast as necessary to prevent runnable processes from waiting for their turn.

Conversely, in a system with four CPUs, a load average of 3.73 would indicate that there were, on average, 3.73 processes ready to run, and each one could be scheduled into a CPU.  Average number of processes in the run queue.  EMA


No comments:

Post a Comment