So, that was a trying few hours yesterday, huh? Amazon S3 starts seeing “increased error rates” in its Northern Virginia region, and the world starts claiming that the internet is broken.
In defense of the hysteria, though, the outage did bring down a whole lot of popular sites. Jordan Novet at VentureBeat compiled the largest list
I have seen, although I’m sure there’s a longer one floating around somewhere. There were no doubt thousands of smaller companies and minor applications storing stuff directly on S3 or via a third-party services like Heroku (including Revue, my newsletter provider) that went down as well. Here’s the list from VentureBeat:
The issues appear to be affecting Adobe’s services, Amazon’s Twitch, Atlassian’s Bitbucket and HipChat, Autodesk Live and Cloud Rendering, Buffer, Business Insider, Carto, Chef, Citrix, Clarifai, Codecademy, Coindesk, Convo, Coursera, Cracked, Docker, Elastic, Expedia, Expensify, FanDuel, FiftyThree, Flipboard, Flippa, Giphy, GitHub, GitLab, Google-owned Fabric, Greenhouse, Heroku, Home Chef, iFixit, IFTTT, Imgur, Ionic, isitdownrightnow.com, Jamf, JSTOR, Kickstarter, Lonely Planet, Mailchimp, Mapbox, Medium, Microsoft’s HockeyApp, the MIT Technology Review, MuckRock, New Relic, News Corp, OrderAhead, PagerDuty, Pantheon, Quora, Razer, Signal, Slack, Sprout Social, StatusPage (which Atlassian recently acquired), Travis CI, Trello, Twilio, Unbounce, the U.S. Securities and Exchange Commission (SEC), The Verge, Vermont Public Radio, VSCO, Wix, Xero, and Zendesk, among other things. Airbnb, Down Detector, Freshdesk, Pinterest, SendGrid, Snapchat’s Bitmoji, and Time Inc. are currently working slowly.
Apple is acknowledging issues with its App Stores, Apple Music, FaceTime, iCloud services, iTunes, Photos, and other services on its system status page, but it’s not clear they’re attributable to today’s S3 difficulties.
Parts of Amazon itself also seems to be facing technical problems at the moment. Ironically, it’s restricting AWS’ ability to show errors.
There are conflicting reports about whether Netflix went down, which may have something to do with geographic location. However, Netflix is often the poster child for smart AWS architecture during these outages (including in September 2015, when “increased error rates” took down the Amazon DynamoDB service for a while), illustrating the importance of building highly available services and planning for failure.
Some folks will use yesterday’s outage as an example of why companies shouldn’t use the cloud, or (rightly) why they should consider platforms other than or in addition to AWS. But it’s important to remember that AWS only amassed such a large number of users because it works so well overall. Frankly, many of the services affected wouldn’t even exist if not for AWS, and those that did might be down far more frequently if they were forced to rely on their own infrastructure.
For every paragon of software and infrastructure engineering like Facebook, there several fail whales.
And while folks in Redmond and Mountain View might have been smiling ear to ear yesterday, most of them knew better than to get too cocky. Outages at competitive cloud providers don’t cause nearly this large a stir because they’re not serving nearly as many popular applications. (Although, god, what would I do if SnapChat were down?!)
Those other cloud services are not perfect, either. A comment by a Googler on the Hacker News thread about S3
quickly resulted in a litany of complaints against its cloud services.
However, the good news for everyone is that cloud computing providers are getting better, cloud services are getting better and cloud-native architectures are getting better. Hopefully, we should see a lot fewer of these hiccups in the coming years, and a lot smaller impact even when they do occur. In a few years, it would be reassuring to know that if our favorite services are down, we know some serious shit went down.