Article 3W7R8 Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

by
Rich Brueckner
from High-Performance Computing News Analysis | insideHPC on (#3W7R8)
christian_engelmann2-112x150.jpg

Christian Engelmann from ORNL gave this talk at PASC18. "Building a reliable supercomputer that achieves the expected performance within a given cost budget and providing efficiency and correctness during operation in the presence of faults, errors, and failures requires a full understanding of the resilience problem. The Catalog project develops a fault taxonomy, catalog and models that capture the observed and inferred conditions in current supercomputers and extrapolates this knowledge to future-generation systems. To date, the Catalog project has analyzed billions of node hours of system logs from supercomputers at Oak Ridge National Laboratory and Argonne National Laboratory. This talk provides an overview of our findings and lessons learned."

The post Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems appeared first on insideHPC.

External Content
Source RSS or Atom Feed
Feed Location http://insidehpc.com/feed/
Feed Title High-Performance Computing News Analysis | insideHPC
Feed Link https://insidehpc.com/
Reply 0 comments