AWS Outage Exposes Achilles Heel: Central Control Plane
upstart writes:
Too many services depend not just on one cloud provider, but on one location:
Analysis Amazon's US-EAST-1 region outage caused widespread chaos, taking websites and services offline even in Europe and raising some difficult questions. After all, cloud operations are supposed to have some built-in resiliency, right?
The problems began just after midnight US Pacific Time today when Amazon Web Services (AWS) noticed increased error rates and latencies for multiple services running within its home US-EAST-1 region.
Within a couple of hours, Amazon's techies had identified DNS as a potential root cause of the issue - specifically the resolution of the DynamoDB API endpoint in US-EAST-1 - and were working on a fix.
However, it was affecting other AWS services, including global services and or features that rely on endpoints operating from AWS' original region, such as IAM (Identity and Access Management) updates and DynamoDB global tables.
While Amazon worked to fully resolve the problem, the issue was already causing widespread chaos to websites and online services beyond the Northern Virginia locale of US-EAST-1, and even outside of America's borders.
As The Register reported earlier, Amazon.com itself was down for a time, while the company's Alexa smart speakers and Ring doorbells stopped working. But the effects were also felt by messaging apps such as Signal and WhatsApp, while in the UK, Lloyds Bank and even government services such as tax agency HMRC were impacted.
According to a BBC report, outage monitor Downdetector indicated there had been more than 6.5 million reports globally, with upwards of 1,000 companies affected.
How could this happen? Amazon has a global footprint, and its infrastructure is split into regions, physical locations with a cluster of datacenters. Each region consists of a minimum of three isolated and physically separate availability zones (AZ), each with independent power and connected via redundant, ultra-low-latency networks.
Customers are encouraged to design their applications and services to run in multiple AZs to avoid being taken down by a failure in one of them.
Sadly, it seems that the entire edifice has an Achilles heel that can cause problems regardless of how much redundancy you design into your cloud-based operations, at least according to the experts we asked.
"The issue with AWS is that US East is the home of the common control plane for all of AWS locations except the federal government and European Sovereign Cloud. There was an issue some years ago when the problem was related to management of S3 policies that was felt globally," Omdia Chief AnalystRoy Illsley told us.
He explained that US-EAST-1 can cause global issues because many users and services default to using it since it was the first AWS region, even if they are in a different part of the world.
Certain "global" AWS services or features are run from US-EAST-1 and are dependent on its endpoints, and this includes DynamoDB Global Tables and the Amazon CloudFront content delivery network (CDN), Illsley added.
Read more of this story at SoylentNews.