Last week, the International Working Group on Cloud Computing Resiliency (IWGCR) released a study that calculated the impact of cloud outages on the economy. The research, which tracked downtime in the market since 2007, had some startling results: In the past five years, we’ve seen 568 hours of downtime at 13 well-known cloud services, which had an economic impact of more than $71.7 million dollars.
This also impacted the average time for unavailability in cloud services, bringing it to 7.5 hours per year. However, there is good news: Even with those numbers, the availability rate is still at 99.9 percent. But it’s obvious that gone are the days of the five nines: 99.999 percent uptime, seen as standard in the TelcoData world.
Needless to say, it’s time to take a look at what could be causing these issues, and what we can be doing to fix them. Below is my take on the problem. I’ve listed what I think are the main causes below, along with a corresponding list of what we can consciously be doing to prevent downtime.
Issues:
- Complications within the cloud stack and internal applications
- Hardware failure
- DDoS
- Human error
- Improper change control process
- Switch and routing issues
- Lack of proper proactive DR testing and capacity planning
Best practices:
- Test for cloud outages and develop a contingency and disaster recovery plan
- Implement a high-availability strategy to build out redundancies — i.e. Test between availability zones, cloud environments, and data centers
- Employ regular performance testing to ensure that performance remains consistent
Have any more questions about preventing outages? Feel free to reach out to me in an email.