How a Microsoft Cloud Outage Hit Millions of Users Around the World
An anonymous reader shares Reuters' report from earlier this week:Microsoft Corp said on Wednesday it had recovered all of its cloud services after a networking outage took down its cloud platform Azure along with services such as Teams and Outlook used by millions around the globe. Azure's status page showed services were impacted in Americas, Europe, Asia Pacific, Middle East and Africa. Only services in China and its platform for governments were not hit. By late morning Azure said most customers should have seen services resume after a full recovery of the Microsoft Wide Area Network (WAN). An outage of Azure, which has 15 million corporate customers and over 500 million active users, according to Microsoft data, can impact multiple services and create a domino effect as almost all of the world's largest companies use the platform.... Microsoft did not disclose the number of users affected by the disruption, but data from outage tracking website Downdetector showed thousands of incidents across continents.... Azure's share of the cloud computing market rose to 30% in 2022, trailing Amazon's AWS, according to estimates from BofA Global Research.... During the outage, users faced problems in exchanging messages, joining calls or using any features of Teams application. Many users took to Twitter to share updates about the service disruption, with #MicrosoftTeams trending as a hashtag on the social media site.... Among the other services affected were Microsoft Exchange Online, SharePoint Online, OneDrive for Business, according to the company's status page. "I think there is a very big debate to be had on resiliency in the comms and cloud space and the critical applications," Symphony Chief Executive Brad Levy said. From Microsoft's [preliminary] post-incident review:We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute. As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed.... Due to the WAN impact, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices, and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC.
Read more of this story at Slashdot.