18 June 2012

A tale from the Troubleshooting Department - on the Floor

I was recently involved in troubleshooting a problem for a customer after a migration of a database and an application server to newer hardware and operating system.

The problem was costing the customer a lot of money every minute it persisted - now a couple of days in duration, which resulted in at best 40% capacity on the floor when compared to normal throughput. The situation was becoming untenable and naturally the customer was becoming desperate.

The customer has a conveyer belt with various manual and automated picking lines feeding orders (stock) into totes on the belt. The picking lines are each controlled by a small embedded PC.

The IT infrastructure part of the system, involving the application server, embedded PC's and an Oracle database, is optimised to calculate when the orders coming down the feeding lines is due to hit the belt such that 1) it can synchronize with the arrival of a tote and crucially 2) keep the conveyor belt running continuously.

The problem manifested itself by spurious interruptions on only one automated feeding line causing the belt to stop for long periods of time and causing an estimated 60% capacity decrease.

A couple of days after the migration and upgrade, which was carried out by others, a colleague and I were sent on-site by our company to have a look. When we arrived we found the place abuzz with activity - the customer, the belt vendor, the IT infrastructure provider and a few others were working out options to try. A lot of troubleshooting effort had been put into trying to identify and resolve the issue over the past few days and there were no major leads as to the root cause other than some potential network related problems.

On the back of concerns over the new server being a lot faster than the old server they devised plans to reduce CPU speed, CPU cores, memory etc. for the database. But to no avail.

At this point they were running out of options and my colleague and I were finally given access to have a look at the problem. One of my other colleagues who had had a look at the problem overnight mentioned that he had noticed that the Oracle database wait event "sqlnet message from client" often hit 65 seconds when running a 10046 trace - so we started there.

With the belt vendor's help we managed to get onto the network switch with wireshark, while capturing traffic between the troubled picking line PC (client) and the Oracle database server; and at the same time running a 10046 trace. What we found there was interesting. The client would send a SQL*Net request to the server and the server in turn would respond. Occasionally there would be a problem. A few hundred milliseconds after the initial request the client would duplicate its request as if it had heard nothing from the database server - and the database server in turn would respond promptly with a TCP duplicate acknowledgement. 65 seconds later the server would then repeat with a retransmission; again, as if the client had heard nothing. Then things would move on again. We could correlate this pattern with the picking line queue filling up at the same point. Some time later,  after requesting access to the picking line PC, we noticed that running "ifconfig -a" continuously would also show an increase in the "frames" counter around the time that we saw packet loss from a TCP perspective. The "frames" counter counts received (layer 2) frames that are corrupted in some way or are only partially received.

After a while of monitoring this we decided to go on the floor, standing next to the picking line, to see if we could spot any abnormalities and to determine if there were any events on the floor to correlate with. We brought a wirelessly connected laptop with a SSH connection to the picking line PC.

Bear in mind that at this point there was a lot of scepticism over the findings because the picking line and the PC had not been changed (well, after the problem the PC was changed in case it was causing an issue but afterwards the issue persisted) - no belt changes had been made. It was only the database and application servers that had been upgraded and migrated to new hardware, including a new network switch.

After about 10 minutes there we noticed a pattern; the smoking gun. We realised that there was an electrical motor at the belt and near the PC (actually, it turns out that there were two electrical motors - but one was hidden under the belt ).

Every time the motor(s) were running (always in unison) the "frames" counter would increase sporadically. But every time the motor was stopped there were no issues. We also took note of the fact, after a while, that regardless of whether the motor was on or not there was network traffic - so it was not a simple case of no motor running = no network traffic = no problem; or vice versa. The point was that we had a feedback loop - the motor would interrupt network traffic flow, which in turn caused the picking line PC to stop the belt (motor) - until the database server ultimately re-transmitted the SQL*Net packet while the motor was not running. Then the picking line PC would start the belt again. And so on and so forth ad nauseam.

Understandably, there was still a fair bit of scepticism over the findings - after all, nothing had changed on the floor. Our suggestion was therefore that while the new server had not caused the problem it may have made it visible - perhaps by treating loss of frames/packets (traffic flow) over the network differently compared to the old server.

The IT Infrastructure provider therefore set off to do three experiments in order to make up isolated test cases:
1) Take two laptops and the same network wire, strew the cable across the motors and do file transfers while the motors were turned on and off - the test proved inconclusive; while they experienced multiple frame loss during their tests they were not 100% able to match it to the motor running - or not.
2) Put the old server on the network and do file transfers from the old server to the picking line PC while the motor was switched on and off
3) Do file transfers from the new server to the picking line PC while the motor was switched on and off

The last two tests proved very conclusive. While copy activity from the old server dropped the pace of the copy while the motor was running, traffic was still coming through at a reduced rate. Not so for the new servers - they would stall the file transfer for up to... You got it, 65 seconds!

At this point the involved parties started to accept the findings. The findings were further verified by a senior techie with the belt provider who flew over from Germany that same night with two new picking line PC's - and crucially an oscilloscope. Replacing the picking line PC didn't make a difference and when he kicked off the oscilloscope it quickly became apparent that there was a lot of Electro-Magnetic Interference (EMI) in the vicinity of the electrical motors while they were running.

Now we knew the root cause and it was well after 0:00 at night. The customer really needed to get the lines running by 8:00 this morning or they most definitely would loose customers and reputation at this stage. Various options were discussed and some tried - but none of them worked out or were not feasible to implement for an 8:00 start.

Our final contribution came just before 4:00 where in our sleep deprived states of mind we suggested that if we lowered the network speed to the picking line PC from 100Mbps Full Duplex to 10Mbps Full Duplex we might stretch the bits (the signal) further over the wire and thereby make it less susceptible to noise. It was quickly agreed to try it out; we changed it on the picking line PC while the IT Infrastructure provider changed it on the network switch.

And... The problem went away! An hour later and after a couple of cups of tea the problem still hadn't manifested itself again and everyone - not least the customer - breathed a sigh of relief!

Of course, this was just a workaround. There was still a latent problem involving multiple factors: the UTP cabling (as opposed to STP), the fact that it passed within 15 cm of the motors, the fact that there was no noise filtering and finally the fact that the picking line PC was less than 1 meter from the motor all added up to an underlying flaw in the overall design. But at least the speed drop allowed the customer to resume operation at 8:00 that morning - and hopefully not lose too many customers.

Boot note: I will say that I am very happy to have been involved in this troubleshooting experience - and humbled by it too. As an IT worker these days it is such a rare opportunity to troubleshoot "on the floor." At the same time, debugging issues at this level is also a very rare experience these days. And it serves as a good reminder never to make any assumptions when troubleshooting - keep an open mind, use the process of elimination, work with isolated test cases where you can - and "think outside of the box!" is all I can say ;-)

2 comments:

horia.berca said...

Thanks for sharing this! Nice experience indeed.
This reminds of the roots: before everything, an IT guy is an engineer.

Morten Jensen said...

Just a quick follow-up to my blog...

It would appear that IBM has a patch for the 64 second retransmit problem AIX kernel problem now.

Check out:

Minimum Software Versions and Patches Required to Support Oracle Products on IBM Power Systems [ID 282036.1]

Specifically:

IV20595 The first retransmit packet may be sent after 64 seconds, (Doc Number=5795)

http://www-01.ibm.com/support/docview.wss?uid=isg1IV20595