Article 6KMH6 Cloudflare Says It's Automated Empathy To Avoid Fixing Flaky Hardware Too Often

Cloudflare Says It's Automated Empathy To Avoid Fixing Flaky Hardware Too Often

by
msmash
from Slashdot on (#6KMH6)
The Register: Cloudflare has revealed a little about how it maintains the millions of boxes it operates around the world -- including the concept of an "error budget" that enacts "empathy embedded in automation." In a Tuesday post titled "Autonomous hardware diagnostics and recovery at scale," the internet-taming biz explains that it built fault-tolerant infrastructure that can continue operating with "little to no impact" on its services. But as explained by infrastructure engineering tech lead Jet Marsical and systems engineers Aakash Shah and Yilin Xiong, when servers did break the Data Center Operations team relied on manual processes to identify dead boxes. And those processes could take "hours for a single server alone, and [could] easily consume an engineer's entire day." Which does not work at hyperscale. Worse, dead servers would sometimes remain powered on, costing Cloudflare money without producing anything of value. Enter Phoenix -- a tool Cloudflare created to detect broken servers and automatically initiate workflows to get them fixed. Phoenix makes a "discovery run" every thirty minutes, during which it probes up to two datacenters known to house broken boxen. That pace of discovery means Phoenix can find dead machines across Cloudflare's network in no more than three days. If it spots machines already listed for repairs, it "takes care of ensuring that the Recovery phase is executed immediately."

twitter_icon_large.pngfacebook_icon_large.png

Read more of this story at Slashdot.

External Content
Source RSS or Atom Feed
Feed Location https://rss.slashdot.org/Slashdot/slashdotMain
Feed Title Slashdot
Feed Link https://slashdot.org/
Feed Copyright Copyright Slashdot Media. All Rights Reserved.
Reply 0 comments