Earlier this month Google Cloud services located on the US east coast were impacted by a significant outage lasting almost 24 hours. The incident demonstrates that occasionally a critical hardware failure - in this case, a fiber cable break - can negate any and all redundancy planning.
It’s also a good argument for a multi-cloud approach - that is to say, not putting all your public cloud assets in the hands of one provider.
Fiber optic cables linking Google Cloud servers in its US-East1 region were physically broken. Google responded by rerouting traffic to ensure that customers' services continued to operate reliably until the affected fiber paths are repaired. But some customers were affected by elevated latency for up to 24 hours.
While it’s unusual for multiple physical fiber breaks to occur simultaneously, it does happen and for organizations relying on services operating out of a specific regional data center, the impact can be disastrous.
Can downtime be compared among providers?
There have been some efforts to measure and compare the reliability of a number of cloud providers but the challenge is that they don’t disclose disruptions in a consistent manner.
In fact, says Zeus Kerravala of ZK Research, some disclosures are confusing to the point where it’s difficult to glean any kind of meaningful conclusion.
According to Kerravala, between Azure, GCP and Amazon Web Services (AWS), Azure is the most obscure, in terms of provided detail. GCP does a better job of providing detail at the service level but tends to be obscure with regional information. Meanwhile, AWS has the most granular reporting, as it shows every service in every region.
There are more inconsistencies here. If an incident occurs that impacts three AWS services and each was unavailable for one hour, AWS would record three hours of downtime. But if Azure reports a one-hour outage that impacts five services in three regions, the status website might show just a single hour, rather than 15 hours of total downtime.
The confusion is further compounded by the historical downtime data that is available. At one time, all three major cloud vendors provided a one-year view into outages but Azure has moved to only a 90-day view.
What cost is downtime to your business?
So how can you compare apples to something as close to apples? Kerrevala’s own research found that from the beginning of 2018 through to May 2019, AWS recorded only 338 hours of downtime, followed by GCP closely at 361. Microsoft Azure meanwhile has a whopping total of 1,934 hours of self-reported downtime.
But Kerrevala is quick to point out that this is an aggregation of the self-reported data from the vendors’ websites, which isn’t the “true” number, as regional information or service granularity is sometimes obscured.
Organizations choose cloud services based on a multitude of factors, from price to local availability to existing infrastructure to specific service requirements. Downtime is just another one of those considerations and should be weighed against your appetite for risk. If some downtime isn’t going to be a big problem, then other factors might sway your decision. If uptime is critical, then that will move to the top of the consideration pile.
Kerrevala also warns that buyers should be aware that there is a big difference between service level agreements (SLAs) and downtime. “A cloud operator can promise anything they want, even provide a 100% SLA, but that just means they need to reimburse the business when a service isn’t available. Most IT leaders I have talked to say the few bucks they get back when a service is out is a mere fraction of what the outage actually cost them,” he said.