Quantcast
Channel: Latin Business Today - How do I create a disaster recovery plan?
Viewing all articles
Browse latest Browse all 7

Delta Down and Cloud Outages...What Happened?

0
0
Delta Airlines Cloud Down

Getting all parts of the airline back to normal from the glitch.

 

On Aug 8, 2016 Delta Airlines experienced an extended six-hour outage, which some analysts and journalists falsely attributed to flaws associated with aging technology – i.e., TPF running on mainframe systems.

This false trope failed to identify the real culprits, which are exposures to all IT solutions – in public and private clouds and even more so in non-mainframe environments. IT executives should not alter their views about certain technologies or decision-making based upon biases and false reporting of facts. Nor should business or IT executives assume that outages will become history if they move to the cloud.

Normally I do not write about a single system outage that impacts a company but the Delta downtime became a cause célèbre.

On Aug. 8th at 2:38 AM EDT a power outage hit the Delta data center, which caused a global system failure that lasted six hours before business could begin to go back to normal. Tens of thousands of passengers were stranded around the world and all systems – check-in, flight scheduling and departures, airport screens, reservations, websites, etc. – were affected by the meltdown.

Getting all parts of the airline back to normal from the glitch and all passengers to their ultimate destinations actually took days as hundreds of Delta flights and flight crews were out of position post recovery.

So who's to blame?

The True Story

The initial report that a power outage was the culprit was partially correct. According to Delta's COO, "a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. When this happened, critical systems and network equipment didn't switch over to backups.

Other systems did.

And now we're seeing instability in these systems." What the executive did not mention was that it all started when Delta's IT staff attempted to perform a routine switch to its backup generator, which resulted in a spike that caused a fire in an Automatic Transfer Switch (ATS).

Thus, in effect what Delta and users experienced was the result of a two-step failure. First, the ATS fire and subsequent shutdown meant that a server farm of about 500 servers also closed down abnormally.

Second, Delta's staff then executed its standard failover process and executed switchovers to the backup IT systems. But this process also failed as critical systems and network equipment did not switch over to backup power. It was determined after the fact that about 300 of the approximately 7,000 data center components (of which the TPF mainframes were a very small component) were determined to not have been configured correctly to the available backup power and therefore remained offline without power.

Even before the details of the problem were made public it was apparent that Delta's power outage impacted only them, as there are two unique power grids feeding the site and one provider, the Atlanta utility Georgia Power, claimed it was not responsible for the failure and had not received notifications of any outages in its territory.

In fact, Delta's passenger service system (PSS) like all major PSSs are theoretically configured with no single point of failure – from the power supply through all equipment components and databases. But in Delta's case there was either a lack of redundancy or the backup ATS failed to kick in as expected.

Next- Outage Track Records

.

 

 

Outage Track Records

Because this not the first PSS meltdown this year, a number of individuals have attacked the TPF operating system and mainframes as aging technology unable to keep up with the demands of the 21st century.

However, neither outage (Delta's or Southwest's) was caused by the mainframe systems. Southwest Airline's downtime was the result of a faulty network router.

The availability of TPF systems, which run many of the major airline reservation systems and a number of banking systems, is amongst the best in the world.

Most of these mainframes, which handle up to 60,000 transactions/second, are up 100 percent of the time for years while others have 5 minutes or less downtime a year (99.999 percent availability). It is the exception that does not meet that rigorous standard.

That is not to say the user enjoys that level of uptime since there are many more components (usually thousands or tens of thousands) involved in the end-to-end experience.

As most system architects know, the more components involved in a system the greater the probability of failure.

I know some readers will say that this is not true for cloud environments, which take advantage of the latest technologies. While it may be true theoretically that cloud instances can be orchestrated so that there are no outages, the reality infringes on the concept – because there is more to the ecosystems than just the servers and software instances.

The chart below summarizes just some of the outages experienced this year alone in the cloud.

Source: CRN July 27, 2016 on 10 biggest cloud outages

The reality remains that the larger, more complex the systems, the greater the probability for downtime – cloud or no cloud. Moreover, due to the complexity of large systems, it may take longer to identify and fix the problem.

Next- The Bottom Line 

.

 

 

The Bottom Line

Large enterprises tend to have large, complex system environments with thousands of components and a mix of networking, servers, and storage hardware.

Moreover, the ecosystem usually contains multiple versions of the same application, database, middleware and operating system software and generations of hardware. Keeping all the elements patched, current and in sync is a challenge at most companies.

Then add to that the communications networks, universal power supplies (UPSs), ATSs, and other switching gear needed to complete the picture – and then throw in the backup equipment needed to handle the redundancy. No matter who you are or how big or small your IT environment maintaining availability at 99.999 percent (or even over 99.5 percent) is a challenge and cannot happen without a good operating and backup/recovery (BC/DR) plan that is continually executed and tested. 

Failure is inevitable– but it is IT's job to mitigate the impacts to the business to acceptable levels given the funds available.

While there are some computer systems and software stacks that are intended to provide higher availability than others, achieving high availability goes beyond that and requires good operating, redundancy and BC/DR planning and processes that are assiduously followed. IT executives should work with the corporate and line of business executives to determine the optimum service level requirements (including recovery point and time objectives – RPOs and RTOs) and the associated capital and operating costs.

This will not eliminate downtime but it should set the right levels of expectation that can be used as a basis for establishing backup procedures that can be executed when systems are unavailable.

Related articles:

Death by Cloud, the Explosion of Instances and Mitigation

3 Things to Consider When Moving to the Cloud

The Tail Wags the Dog- Death by Cloud- Part 2 [Video]


Viewing all articles
Browse latest Browse all 7

Latest Images

Trending Articles





Latest Images