Varda: The Mysterious Fiber Bomb Problem: A Debugging Story
Over at the Sandstorm Blog, project founder Kenton Varda relates a debugging war story. Sandstorm web servers would mysteriously peg the CPU around once a week, slowing request processing to a crawl, seemingly at random."Obviously, we needed to take a CPU profile while the bug was in progress. Of course, the bug only reproduced in production, therefore we'd have to take our profile in production. This ruled out any profiling technology that would harm performance at other times - so, no instrumented binaries. We'd need a sampling profiler that could run on an existing process on-demand. And it would have to understand both C++ and V8 Javascript. (This last requirement ruled out my personal favorite profiler, pprof from google-perftools.)Luckily, it turns out there is a correct modern answer: Linux's "perf" tool. This is a sampling profiler that relies on Linux kernel APIs, thus not requiring loading any code into the target binary at all, at least for C/C++. And for Javascript, it turns out V8 has built-in support for generating a "perf map", which tells the tool how to map JITed code locations back to Javascript source: just pass the --perf_basic_prof_only_functions flag on the Node command-line. This flag is safe in production - it writes some data to disk over time, but we rebuild all our VMs weekly, so the files never get large enough to be a problem."