In the last 24 hours we have all been told of a major disruption in the air traffic control system in this country. The word is that a "communications failure" in a FAA facility in Atlanta was at fault for numerous flight delays across the country and in particular in the Eastern US. Further information indicates that a network failure between Atlanta and a facility in Utah caused the issue.
This situation reminds me of an old adage in the networking business " 80% of all network failures are caused by changes made by humans". In nearly 30 years of observations I can validate the saying. Interestingly enough one can easily prove the 80% rule by putting an embargo on network changes during a time of low staffing, say the end of the year holiday period. Year after year of doing this resulted in that period of time being the lowest instance of outages for the entire year. With no one around and no one doing any changes in the network things just ran like they were supposed to.
What does all this mean? If there is a period of critical operation then minimize changes to your environment during that period. The last thing you want to do, say if your are a candy maker, is to do a major system change during the period leading up to Halloween (Hershey did an ERP upgrade during this time and did not produce enough candy to meet the Halloween demand and thus had their worst performing quarter in their history).
My guess is that we will find out that the recent FAA outage was a direct result of a network change that went in either untested or poorly tested at best. The other possibility is the lack of testing of redundant links so that you know they work when needed.
Stay tuned on this one. I have an 80% chance of being right again.
No comments:
Post a Comment