Learning from the British Airways Outage

by | June 2, 2017

British Airways is still reeling from its catastrophic IT outage over the holiday weekend, and as we approach the 7-day mark since thousands of travelers around the world were left in limbo by the airline, questions have yet to be answered on exactly what went wrong. A power surge has been blamed for crashing BA’s global IT system and overwhelming the existing back-ups, but for a provider with millions of users, the scale of the downtime was simply unacceptable.

British Airways Outage

For travelers, British Airways’ website and app are essential tools for their trips around the globe, so this outage will have long-term negative consequences for the BA brand. Thousands of people were affected for hours on one of the busiest travel periods in the UK calendar – a bank holiday and a mid-term break for schools. Flights were grounded and terminals were filled with unhappy customers, as the company was unable to process any of the key information needed to get people up in the air.

At peak times, companies need to ensure that their website is prepared and can handle large amounts of traffic. Consequently, they need to have both technical readiness and response readiness. In the event of an outage, marketing teams should have a proactive strategy in place for their customers to promptly communicate why the service degradation has occurred, and what measures are being taken to resolve it.

BA might consider themselves in good company now, as they became the latest in a long line of recent high-profile outages. But just like AWS, WhatsApp, RyanAir and Twitter, BA should have been expecting the unexpected. In today’s connected world, anticipation is key.

Large organizations will have backup and failover systems in place to handle issues with anything from power supply and storage, to databases and servers – that goes without saying. What separates those organizations that are “always up” from those that fail, is that they perform constant tests to maintain functionality and have a dynamic plan for switching to failover systems to avoid any unexpected outages – something BA seemingly failed to do.

Being able to restart systems under heavy load is a typical disaster testing scenario which is often overlooked, bringing extra downtime on top of the initial incident. In this case, BA’s failure to restart has had catastrophic results for travelers around the world. Moving forward, you can expect to see other major travel providers learning from the incident and ensuring their rapid response options are fine-tuned, as well as implementing more developed testing to avoid being grounded themselves.

Apica Product Team