Article 6XTM1 Tesla details how it finds punishing defective cores on its million-core Dojo supercomputers — a single error can ruin a weeks-long AI training run

Tesla details how it finds punishing defective cores on its million-core Dojo supercomputers — a single error can ruin a weeks-long AI training run

by
ashilov@gmail.com (Anton Shilov)
from Latest from Tom's Hardware on (#6XTM1)
Story ImageTesla's Stress tool detects and disables faulty cores in Dojo wafer-scale processors, which power Dojo clusters with millions of cores, without interrupting AI training.
External Content
Source RSS or Atom Feed
Feed Location https://www.tomshardware.com/feeds/all
Feed Title Latest from Tom's Hardware
Feed Link https://www.tomshardware.com/
Reply 0 comments