A machine-learning wishlist for hardware designers
Pete Warden (previously) is one of my favorite commentators on machine learning and computer science; yesterday he gave a keynote at the IEEE Custom Integrated Circuits Conference, on the ways that hardware specialization could improve machine learning: his main point is that though there's a wealth of hardware specialized for creating models, we need more hardware optimized for running models.
I've saved what I expect may be my most controversial request until last. The typical design process I've seen from hardware teams is that they will look at some existing ML workloads, note that almost all of the time goes into just a few operations, and so design an accelerator that speeds up those critical-path ops.
This sounds fine in principle, but when an accelerator like that is integrated into a full system it often fails to live up to its potential. The problem is that even though most of the compute for almost all models does go into a handful of common operations, there are hundreds of others that often appear. Almost every model I see has some of these, and they're almost always different from network to network. A good example is 'non-max suppression' in MobileSSD and similar object detection models, where we need some very specific and custom operations to merge the many bounding boxes that are output by the model into just a few coherent final results. This doesn't require very much raw compute, but it does take a lot of logic, and is hard to express except as general C++ code. In a similar way, many audio networks have a feature generation preprocessing step that converts raw audio data into tensors to feed into the neural networks. Even more tricky are custom steps (like modified activation functions) that show up in the middle of networks. Almost none of these operations are compute intensive, but they aren't supported by specialized accelerators.
There are two common answers to this from hardware teams. The first is to fall back to a main application processor to implement these custom operations. If the accelerator is across a system bus from the main CPU this can involve a lot of latency as the two processors have to communicate and synchronize with each other. This latency can easily cancel out any speed advantages from using the accelerator in the first place. Alternatively, the team may direct users towards using 'blessed' models that will run entirely on the accelerator, avoiding any of the tricky custom operations. This can work for some cases, but the majority of the product teams I work with are struggling to train their models to the accuracy they require for their application, so they're usually using custom approaches to achieve the results they need. This makes asking them to switch to a new model and figure out how to achieve similar results within tighter constraints a big ask.
What Machine Learning needs from Hardware [Pete Warden]
(via Four Short Links)