Apica partners with Boomi for Run-Time Observability powered by Apica Ascent. Learn More

Products

OVERVIEW

test

How it works
Architecture and components

InstaStore^TM
Data Storage for the Modern Enterprise

Experience Ascent
Navigate Your Tech Terrain Effortlessly with Apica’s Ascent Experience

Platform Reliability
Security, Compliance and Scale

Integrations
Inbound and outbound integrations

Generative AI Assistant
Unleashing the Power of contextualized Data

ROI Calculator
Calculate Your Observability Costs Seamlessly

OBSERVE

Active Observability

Logs
Log aggregation, management & analytics

Metrics
Application & infrastructure metrics

Traces
Trace transactions between distributed services

Convergence
Converge and analyze any data source

Synthetic Monitoring
Apica Synthetic Monitoring Built for Proactive Enterprises

LoadTest
Know How Your Apps Will Perform in Any Circumstance

Advanced Scripting Engine
Apica’s Powerful Scripting Engine

Time Series Database
Faster, Efficient, and easier to operate and Scale

FLOW

Pipeline Control

Filter/Reduce
Optimize spend and remediate faster

Mask/Transform
Improve compliance and interpret better

Enrich
Supercharge analytics and improve predictions

Route
Send right data to right target every time

Replay
Instantly replay historical data to any target

LAKE

Compliance & Search

Compliance
Petabyte-scale indexing and instant retrieval

Search
Instantly search and visualize at petabyte-scale

Replay
Instantly replay historical data to any target

Featured Articles

Data Lakes: A Comprehensive Guide
OpenTelemetry VS Prometheus: The Essential Guide
Log Management: The Apica Way

How To Choose the Best Observability Tools
What is OpenTelemetry? A Comprehensive Guide
What is Observability? The Bigger Picture
Resources

Get Started

Get Started With Our Free Tier!

REQUEST DEMO

LEARN MORE

COMPARE

E-Books
FREE e-books on technology and observability topics

Solution briefs
Learn more about Apica in these solution briefs.

Datasheets
Get a brief introduction to our key products with Datasheets

Brochures
Get a quick overview of our products with Apica’s brochures

Videos
Get the most out of Apica though these video demos.

Case studies
Get detailed case studies of Apica’s solutions to real-world challenges.

White Papers
Get a thorough insight of Apica via our comprehensive white papers

Try out Apica
Learn how to use Apica with our quick start guide

BLOG
Articles and guides that help you make data-driven decisions

How does Apica Compare
See how we stack against other vendors

Featured Articles

Data Lakes: A Comprehensive Guide
OpenTelemetry VS Prometheus: The Essential Guide
Log Management: The Apica Way

How To Choose the Best Observability Tools
What is OpenTelemetry? A Comprehensive Guide
What is Observability? The Bigger Picture
Solutions

BY INDUSTRY

BY ROLE

BY USECASE

BY TECHNOLOGY

Banking and Finance
Money, shares, credit, investments

Manufacturing
Streamline your business data with Apica

Government
Empowering Data Control and Mission Resilience

Healthcare
Facilitate the provision of healthcare to patients

IOT and IIOT
Physical objects with sensors, processing ability, software etc

Media and Entertainment
Film, television, radio, print, and gaming

Retail
Sale of goods and services to consumers

Compliance Manager
Comply with industry regulations

DevOps Engineer
Diagnose and troubleshoot complex problems

IT Ops
Maintain high reliability for your business

SOC Analyst
Secure hybrid cloud operations and protect your business

Active Observability
100% visibility with apica.io’s Active Observability Solution

Plan B for Native Observability
100% Observability with zero risk at 1/10th the cost.

Compliance
Petabyte-scale indexing and instant retrieval

Generative AI Assistant
Unleashing the Power of contextualized Data

Apica and Splunk integration
Unlock the Power of Real-Time Analytics

Hybrid Cloud Monitoring
Monitor Public, Private, and Hybrid Cloud Environments

Consolidated Monitoring
Embrace a Unified Observability Platform

AWS Observability
Gain insights into the behavior, performance, and health of your system

Kubernetes Monitoring
Leverage Kubernetes environments to identify services, pods, metrics, etc

OpenTelemetry
Unlock business insights and improve efficiency with Apica’s OpenTelemetry integration

IoT and IIoT
Ensure high levels of data-driven decision-making and powerful business outcomes

Featured Articles

Data Lakes: A Comprehensive Guide
OpenTelemetry VS Prometheus: The Essential Guide
Log Management: The Apica Way

How To Choose the Best Observability Tools
What is OpenTelemetry? A Comprehensive Guide
What is Observability? The Bigger Picture
Documentation

Get Started

Get Started With Our Free Tier!

REQUEST DEMO

DOCUMENTATION

GET STARTED

QUICKSTART GUIDES

Apica Docs

Search Docs

Observability Glossary
Learn more

User Guide
Step-by-Step instructions for common tasks

Apicactl
Integrate with automation and scripted worflows.

ApicaHub
Free dashboards for popular applications

K8S
Step-by-Step instructions to deploy Apica in Kubernetes

Sandbox
Run Apica in a Docker Compose sandbox
Company

Get Started

Get Started With Our Free Tier!

REQUEST DEMO

Company

About Us

Security

News

Leadership

Partners

Careers
Login

Get Started

Get Started With Our Free Tier!

REQUEST DEMO

Login

Load Test Portal

Monitoring Portal

Lessons learned from the AWS Outage

AWS
October 24, 2012

On Monday, Amazon Web Services — the leading provider of cloud services — suffered an outage, and as a result, a long list of well-known and popular websites went dark. According to Amazon’s Service Health Dashboard, the outage started out as degraded performance of a small number of Elastic Bloc Store (EBS) storage units in the US-EAST-1 Region, then evolved to include problems with the Relational Database Service and Elastic Beanstalk as well.

The only surprising thing about this AWS outage was that anyone was surprised by it. It wasn’t the first time AWS had a major outage or problems with this data center. If you remember, back in June a line of powerful thunderstorms knocked the power out at a major Amazon hosting center. The backup generator failed, then the software failed, and, well, you know the drill. A corollary of Murphy’s Law is that if multiple things can go wrong, they will all go wrong at once.

In both of these instances (and in all Amazon Web Services outages, in fact) some customers were knocked “off the air” while others continued running without a hiccup. You would think that eventually companies will learn to anticipate the inevitable AWS outages and take active steps to prepare for them. There are best practices and solutions on how to reduce vulnerability to an outage, but they’re rarely implemented. That’s because people don’t think that anything could happen to Amazon — obviously, things happen.

Instances like this are a learning opportunity if we take the time to think about why they happened and what could have been done to prevent them. Here are six lessons that I think we can learn from the Amazon Web Services outages.

Lesson 1 — Clouds are made of components that can fail. When people think of the cloud, they think that there is some amorphous and untouchable blog up in the sky. And while that’s a nice bit of marketing, it is not a useful model for operational planning. Be mindful of your cloud provider’s architecture and how it is built to manage failure of a component or a zone blackout. Then anticipate that failures can happen at any point in the cloud infrastructure.

Lesson 2 — The stress of failure will trigger a cascade of other failures. After reading a description of the outage, you get the sense that it was just one thing after another. What started as a small issue affecting one Northern Virginia data center quickly spread, causing a chain reaction and outage that disrupted much of the Internet for several hours. Remember Murphy and his law?

Lesson 3 – -Spikes matter. When a cloud fails, hundreds of customers are impacted. As they try to recover, they will be stressing the cloud provider’s infrastructure with a peak load that is guaranteed to cause even more problems. If you get these transition spikes, they get worse and worse. Every time you reboot, it takes longer and longer. If you have ten servers doing that, that’s bad. If you spike a thousand servers, that’s really bad. Something that would have taken five minutes to fix will now take five hours when you get into that transition type of syndrome.

Lesson 4 — Cloud providers provide the tools to manage failure, but it is up to you to put your own failover plans in place. AWS, for example, is broken into zones. If a component in the Virginia zone goes down and the whole matrix is dead, then (in theory) you should be able to move all your data to another zone. That other zone might be hosted, unaffected, in Ireland and then you are up and running again. This is one of the big differences between the cloud and more traditional approaches to IT. It is up to the application (and by extension, the application’s designer) to manage its interaction with the cloud environment, up to and including failover. Most cloud providers offer tools and frameworks to support failover, but you are responsible for implementing that best practice into your system operation and into the applications.

Lesson 5 — You need to put your failover plans through a full-blown load test. It’s not enough to have a strategy in place for failover. You have to test it under real-world conditions. Even the best laid failover plans, once implemented and designed, might have hiccups when a real outage occurs. A full-blown cloud load test can help you see how long the failover process will take to kick in and what other dependencies might need to be sorted out. Obviously this isn’t easy. If it was, Reddit, Foursquare, Airbnb and others wouldn’t have been impacted by the AWS outage.

Lesson 6 — Conduct fire drills. While a load test will confirm that your failover plan works as you expect, it will also give your team some real experience in executing the plan. Remember the fire drills you used to do in school? Fire drills help train students, teachers, and others to know exactly what they’re supposed to do and where they’re supposed to go in the event of an emergency. All the bugs in the process are worked out during the fire drill, and the more everybody does the drills, the more comfortable there are with what they need to do. And if a real emergency happens, everybody knows how to leave the building calmly. You want to do the same thing with your failover plan, and load testing can help you get there. Fire drills save lives and load tests save cloud apps.

Is your failure worth more than $28?

Amazon offers reimbursement to its customers based on the amount of downtime the customer experiences. The last time our Amazon Web Services went down, we got a $28 reimbursement. So my final lesson learned (I guess this makes for seven lessons) is this: The cost of downtime for your organization — in lost revenue, poor customer experience, etc. — is far, far greater than just what you are paying your cloud provider. $28 is not going to save your day. You have to make sure that you have a failover solution that’s ready and working. Don’t wait for Amazon to solve this problem for you, because it’s only a $28 problem for it.

The biggest lesson learned from these AWS outages is that you need to configure properly and you need to train your people. These types of events will always happen, and when they do, you need to be trained ahead of time. Load testing itself is a good way to validate and train. That way when a real emergency occurs, your team can react in a calm, collected manner to a situation they’ve experienced dozens of times before.

The Apica blog

Let’s keep this a friendly and inclusive space: A few ground rules: be respectful, stay on topic, and no spam, please.

More insights. More affordable. Less hassle.

Make use of our valuable resources

Explore

Ready to get started?

Apica Platform

Features

Resources

About

Community

Leaving without a Demo?

Discover the power of Active Observability with Apica

Unlock the full potential of your data and cloud infrastructure with a personalized demo of Apica. See firsthand how our Apica Ascent platform can transform your data observability strategy, ensure scalability, flexibility, and deliver precision in every aspect of your operations.

Request Demo

Innovation Insight: Telemetry Pipelines Elevate the Handling of Operational Data

As digital infrastructures grow in size and complexity, monitoring them becomes increasingly challenging. Explore how Gartner Telemetry Pipelines Elevate the Handling of Operational Data innovation insights to understand Telemetry Pipelines, maximize efficiency, cost reduction strategies, and improve collaboration.

test

Lessons learned from the AWS Outage

Zaigam

The Apica blog

Leave a Comment Cancel reply

Table of Contents

Share this article

Related articles

More insights. More affordable. Less hassle.

Make use of our valuable resources

Leaving without a Demo?

Innovation Insight: Telemetry Pipelines Elevate the Handling of Operational Data