Silent Data Corruption in Multicore CPUs?
by business_kid from LinuxQuestions.org on (#5JQRK)
Here's a link with 2 references and a summary but google scholar has plenty more
https://hardware.slashdot.org/story/...re-modern-cpus
Basically, it seems that modern high-density fab is causing sporadic errors in CPUs and these are being noticed in big data centres.
As a hardware guy, I know it's a testing nightmare. Testing at max temperature and minimum voltage may help, but may not. These would typically heavily cooled 250W or 280W packages, and temperature uniformity throughout can only be modeled, not measured. Also, the uniformity of doping could be an issue. "Doping" mixes pure silicon with a tiny percentage of atoms with 1 electron more(negative doping), or 1 less (positive doping). Lastly, any manufacturing imperfection would do it. CPUs have an extremely low manufacturing pass rate anyhow.
What I can also imagine is the staggering amount of time required to decide core 59 is dodgy, but not 58 or 60. I'm interested in proposed solutions, because nobody seems to have any. I thought about options to disable cores, but once you find the suspect box, the cheapest practical thing is to replace the CPU or indeed the box.
https://hardware.slashdot.org/story/...re-modern-cpus
Basically, it seems that modern high-density fab is causing sporadic errors in CPUs and these are being noticed in big data centres.
As a hardware guy, I know it's a testing nightmare. Testing at max temperature and minimum voltage may help, but may not. These would typically heavily cooled 250W or 280W packages, and temperature uniformity throughout can only be modeled, not measured. Also, the uniformity of doping could be an issue. "Doping" mixes pure silicon with a tiny percentage of atoms with 1 electron more(negative doping), or 1 less (positive doping). Lastly, any manufacturing imperfection would do it. CPUs have an extremely low manufacturing pass rate anyhow.
What I can also imagine is the staggering amount of time required to decide core 59 is dodgy, but not 58 or 60. I'm interested in proposed solutions, because nobody seems to have any. I thought about options to disable cores, but once you find the suspect box, the cheapest practical thing is to replace the CPU or indeed the box.