Video: Recent Results and Open Problems for Resilience at Scale
by Rich Brueckner from Inside HPC & AI News | High-Performance Computing & Artificial Intelligence on (#3WNRQ)

In this video from PASC18, Yves Robert from icole normale supi(C)rieure de Lyon in France presents: Recent Results and Open Problems for Resilience at Scale. "The talk will address the following three questions: (i) fail-stop errors: checkpointing or replication or both? (ii) silent errors: application-specific detectors or plain old trustworthy replication? In terms of workflows: how to avoid checkpointing every task?"
The post Video: Recent Results and Open Problems for Resilience at Scale appeared first on insideHPC.