As the Largest Computer Networks Continue To Grow, Some Engineers Fear that Their Smallest Components Could Prove To Be an Achilles' Heel
An anonymous reader shares a report: Imagine for a moment that the millions of computer chips inside the servers that power the largest data centers in the world had rare, almost undetectable flaws. And the only way to find the flaws was to throw those chips at giant computing problems that would have been unthinkable just a decade ago. As the tiny switches in computer chips have shrunk to the width of a few atoms, the reliability of chips has become another worry for the people who run the biggest networks in the world. Companies like Amazon, Facebook, Twitter and many other sites have experienced surprising outages over the last year. The outages have had several causes, like programming mistakes and congestion on the networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they are still dependent, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable. In the past year, researchers at both Facebook and Google have published studies describing computer hardware failures whose causes have not been easy to identify. The problem, they argued, was not in the software -- it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook did not return requests for comment on its study. "They're seeing these silent errors, essentially coming from the underlying hardware," said Subhasish Mitra, a Stanford University electrical engineer who specializes in testing computer hardware. Increasingly, Dr. Mitra said, people believe that manufacturing defects are tied to these so-called silent errors that cannot be easily caught. Researchers worry that they are finding rare defects because they are trying to solve bigger and bigger computing problems, which stresses their systems in unexpected ways. Companies that run large data centers began reporting systematic problems more than a decade ago. In 2015, in the engineering publication IEEE Spectrum, a group of computer scientists who study hardware reliability at the University of Toronto reported that each year as many as 4 percent of Google's millions of computers had encountered errors that couldn't be detected and that caused them to shut down unexpectedly. In a microprocessor that has billions of transistors -- or a computer memory board composed of trillions of the tiny switches that can each store a 1 or 0 -- even the smallest error can disrupt systems that now routinely perform billions of calculations each second.
Read more of this story at Slashdot.