Deep learning practitioners have demonstrated a scaling across various
nodes.
Baidu distrubuted training to 40 GPU nodes, later that year,
UC Berkeley scaled training to 120 GPU nodes.
Their paper provided sufficient details for
other practitioners to build upon their work.
A few months later, Intel demonstrated scaling to 128 CPUs,
Google to 120 GPUs, Amazon to 120 GPUs.
Most recently, and not shown in this slide,
Facebook demonstrated near-linear scaling to 256 GPUs,
reducing the time to train from several days to just one hour.
With very large batch sizes, the time to train becomes quite large,
making training slow and not able to reach the same accuracy.
Therefore, let's assume that we have a batch size of 1024,
how can we distribute the data across nodes?
One option is to have 1024 nodes, each node with a batch size 1.
However, with this arrangement, the communication between the nodes becomes
a bottleneck, and the computation itself on each node is too little.
On the other hand, using 16 nodes, each with a batch size of 64, is more
reasonable, as most of the communication is hidden in the computation.
Multi-node training on IntelCaffe, which uses data parallelism,
works in the following manner.
First, the data on a given node is forward-propagated through the network,
which in this case is composed of two layers, L1 and L2.
Then the L2 gradients are sent to the parameter server after that
layer has been propagated through.
Similarly, the L1 gradients are sent subsequently to the server after
L1 has been back-propagated through.
When the server receives the L2 gradings from all nodes, it then applies an update,
and broadcasts it to all the nodes, and likewise with the L1 gradients.
Nodes wait for
these updates before forward-propagating through the updated network.
Now that we have discussed how data and model parallelism work,
we will consider strategies for implementing gradient aggregation.
Parameter server, reduction trees, rings, and butterfly.
One strategy for
communicating gradients is to appoint one node as the parameter server.
Which computes the sum of the communicated gradients,
and sends the updates to each of the workers.
However, there is a bottleneck in sending and
receiving all of the gradients with just one parameter server.
Another strategy is an AllReduce tree.
An AllReduce communication method is where each worker produces one or
more data values that must be globally reduced.
Generally, we'd have commutative, binary element-wise operator,
to produce a single result value.
And then this single value must be broadcast to all workers
before they can continue.
In an AllReduce tree, the local gradient information is
distributed to the entire network using a tree-based algorithm,
which is then broadcasted to each individual node.