We Can Now Train Big Neural Networks on Small Devices
The gadgets around us are constantly learning about our lives. Smartwatches pick up on our vital signs to track our health. Home speakers listen to our conversations to recognize our voices. Smartphones play grammarian, watching what we write in order to fix our idiosyncratic typos. We appreciate these conveniences, but the information we share with our gadgets isn't always kept between us and our electronic minders. Machine learning can require heavy hardware, so edge" devices like phones often send raw data to central servers, which then return trained algorithms. Some people would like that training to happen locally. A new AI training method expands the training capabilities of smaller devices, potentially helping to preserve privacy.
The most powerful machine-learning systems use neural networks, complex functions filled with tunable parameters. During training, a network receives an input (such as a set of pixels), generates an output (such as the label cat"), compares its output with the correct answer, and adjusts its parameters to do better next time. To know how to tune each of those internal knobs, the network needs to remember the effect of each one, but they regularly number in the millions or even billions. That requires a lot of memory. Training a neural network can require hundreds of times the memory called upon when merely using one (also called inference"). In the latter case, the memory is allowed to forget what each layer of the network did as soon as it passes information to the next layer.
To reduce the memory demanded during the training phase, researchers have employed a few tricks. In one, called paging or offloading, the machine moves those activations from short-term memory to a slower but more abundant type of memory such as flash or an SD card, then brings it back when needed. In another, called rematerialization, the machine deletes the activations, then computes them again later. Previously, memory-reduction systems used one of those two tricks or, says Shishir Patil, a computer scientist at the University of California, Berkeley, and the lead author of the paper describing the innovation, they were combined using heuristics" that are suboptimal," often requiring a lot of energy. The innovation reported by Patil and his collaborators formalizes the combination of paging and rematerialization.
Taking these two techniques, combining them well into this optimization problem, and then solving it-that's really nice," says Jiasi Chen, a computer scientist at the University of California, Riverside, who works on edge computing but was not involved in the work.
In July, Patil presented his system, called POET (private optimal energy training), at the International Conference on Machine Learning, in Baltimore. He first gives POET a device's technical details and information about the architecture of a neural network he wants it to train. He specifies a memory budget and a time budget. He then asks it to create a training process that minimizes energy usage. The process might decide to page certain activations that would be inefficient to recompute but rematerialize others that are simple to redo but require a lot of memory to store.
One of the keys to the breakthrough was to define the problem as a mixed integer linear programming (MILP) puzzle, a set of constraints and relationships between variables. For each device and network architecture, POET plugs its variables into Patil's hand-crafted MILP program, then finds the optimal solution. A main challenge is actually formulating that problem in a nice way so that you can input it into a solver," Chen says. So, you capture all of the realistic system dynamics, like energy, latency, and memory."
The team tested POET on four different processors, whose RAM ranged from 32 KB to 8 GB. On each, the researchers trained three different neural network architectures: two types popular in image recognition (VGG16 and ResNet-18), plus a popular language-processing network (BERT). In many of the tests, the system could reduce memory usage by about 80 percent, without a big bump in energy use. Comparable methods couldn't do both at the same time. According to Patil, the study showed that BERT can now be trained on the smallest devices, which was previously impossible.
When we started off, POET was mostly a cute idea," Patil says. Now, several companies have reached out about using it, and at least one large company has tried it in its smart speaker. One thing they like, Patil says, is that POET doesn't reduce network precision by quantizing," or abbreviating, activations to save memory. So the teams that design networks don't have to coordinate with teams that implement them in order to negotiate trade-offs between precision and memory.
Patil notes other reasons to use POET besides privacy concerns. Some devices need to train networks locally because they have low or no Internet connection. These include devices used on farms, in submarines, or in space. Other setups can benefit from the innovation because data transmission requires too much energy. POET could also make large devices-Internet servers-more memory efficient and energy efficient. But as for keeping data private, Patil says, I guess this is very timely, right?"