In my last post I made the statement that 80% of the time human error can be accounted for as the cause of major IT outages.  This goes for enterprises, carriers and the government.
Today it was published that the cause of a major outage of the FAA NADIN network was due to "human error that "resulted in the wrong configuration data being loaded onto the switch".  This is not an uncommon error.  Noteworthy in the release of information was that the configuration error took place on an IPX 9000 packet switch.  Anyone other than me know what that is?
Secondary to the outage was the fact that the backup provisions for NADIN to process flight plans calls for a system in Utah to pick up any load from the failed system in Atlanta.  The problem was that the queue built up caused so many delays and re-inputs from the airlines that the system in Utah could not keep up.  As I mentioned earlier it is wise to test backup systems to see what will happen when they are called upon to do their duty.  Clearly this was not the case here.  Now the FAA is talking about adding a third system to the mix for backup.  How much do you want to bet that routing issues will be the next thing that will plague this system and none of the systems will pick up the load when needed?
The FAA's antiquated management and procurement practices make it hard for them to get things right.  The good news is that in this outage planes were held up on the ground.
The 80% rule still applies but it can be overcome with smart planning and execution.
 
 
1 comment:
It's an old piece of Novell gear.
Post a Comment