24 Nov 2012, 09:11

mysterious high load

We had an issue recently where a server started reporting high load for no apparent reason. Running top on the server revealed that there was no process hogging cpu. The only other thing it could be was IO wait (kernel waiting for IO read/write operation to complete) and this most commonly relates to disk operations. _When IO is slow, processes take longer to run and tend to pile up on each other causing the overall load to rise. _

Why had IO suddenly become slow? We mount filesystems via NFS from a couple of Netapp devices, so naturally went looking at the Netapp to see whether something was up there…nothing abnormal found.

Eventually we turned to the network:ran ping tests, checked speed/duplex on the NICs - all looked fine. Then somebody had the bright idea of running tshark to analyse the traffic on the interface that connects to the storage:

 tshark -i eth1 -R tcp.analysis.retransmission

Which revealed the following:

6.846868  10.36.0.1 -> 10.36.0.6  TCP [TCP Retransmission] [TCP segment of a reassembled PDU]  
6.847041  10.36.0.1 -> 10.36.0.6  TCP [TCP Retransmission] [TCP segment of a reassembled PDU]  
6.851126  10.36.0.1 -> 10.36.0.6  NFS [TCP Fast Retransmission] V3 GETATTR Reply (Call In 1916) Regular File mode:0644 uid:27973 gid:100  
6.862840  10.36.0.1 -> 10.36.0.6  NFS [TCP Fast Retransmission] V3 ACCESS Reply (Call In 2011) ; V3 ACCESS Reply (Call In 2013)  
6.863705  10.36.0.1 -> 10.36.0.6  NFS [TCP Fast Retransmission] V3 READ Reply (Call In 1944) Len:4096

This showed us that TCP retransmissions were happening somewhere between our server and the netapp. A simple ping test hadn’t picked this up as the problem was only happening when a significant volume of traffic was passing through. Tests run from other parts of the network to/from the storage device had also failed to pick up any issue as that traffic was taking a different path.

It turned out that one of the Ethernet switches in between was trying to balancing traffic (layer2) between several uplinks and failing (due to a software bug we suspect).

Lessons Learnt

  • storage protocols such as NFS will not necesarily report problems in the underlying network (TCP retransmissions are transparent to high level protocols)
  • don’t stop at ping, use more advanced tools to get the complete picture
  • take extra care when designing and implementing the network path between servers and storage
  • make sure your network guys are monitoring layer2 network properly! :)