mysterious high load

We had an issue recently where a server started reporting high load for no apparent reason. Running top on the server revealed that there was no process hogging cpu. The only other thing it could be was IO wait (kernel waiting for IO read/write operation to complete) and this most commonly relates to disk operations. _When IO is slow, processes take longer to run and tend to pile up on each other causing the overall load to rise. _

Why had IO suddenly become slow? We mount filesystems via NFS from a couple of Netapp devices, so naturally went looking at the Netapp to see whether something was up there…nothing abnormal found.

Eventually we turned to the network:ran ping tests, checked speed/duplex on the NICs - all looked fine. Then somebody had the bright idea of running tshark to analyse the traffic on the interface that connects to the storage:

Which revealed the following:

6.846868 ->  TCP [TCP Retransmission] [TCP segment of a reassembled PDU]  
6.847041 ->  TCP [TCP Retransmission] [TCP segment of a reassembled PDU]  
6.851126 ->  NFS [TCP Fast Retransmission] V3 GETATTR Reply (Call In 1916) Regular File mode:0644 uid:27973 gid:100  
6.862840 ->  NFS [TCP Fast Retransmission] V3 ACCESS Reply (Call In 2011) ; V3 ACCESS Reply (Call In 2013)  
6.863705 ->  NFS [TCP Fast Retransmission] V3 READ Reply (Call In 1944) Len:4096

This showed us that TCP retransmissions were happening somewhere between our server and the netapp. A simple ping test hadn’t picked this up as the problem was only happening when a significant volume of traffic was passing through. Tests run from other parts of the network to/from the storage device had also failed to pick up any issue as that traffic was taking a different path.

It turned out that one of the Ethernet switches in between was trying to balancing traffic (layer2) between several uplinks and failing (due to a software bug we suspect).

Lessons Learnt