0:15

Specifically, we will look at scaling challenges, data and model parallelism.

Allreduce approaches, and synchronous and asynchronous multi-node training and

trade-offs.

Before we do, let's review gradient descent and stochastic gradient descent.

When using gradient descent to find a local minimum, we take steps proportional

to the negative of the gradient of the cost, with respect to the weight.

However, there are three main issues with gradient descent.

First, in gradient descent, you have to run through all of the samples in your

training set to do a single update for a parameter in a particular iteration.

0:54

Second, gradient descent gets stuck at saddle point,

which are very common in high-dimensional spaces.

AlexNet, for

example, has 60 million dimensions, making saddle point a large issue.

Finally, gradient descent is thought to converge to sharp

minimums more frequently.

This is problematic, as even a slight variations between the training and

test objective functions can cause dramatic shifts in their values,

when in sharp minimum.

On the other hand, a stochastic gradient descent solves many of these issues.

By breaking the training data set into minibatches,

each of which are then used to compute a parameter update.

This algorithm is more explanatory than gradient descent,

converges to flatter minimum.

Does not get caught in saddle points points as easily, and

is much more computationally efficient.

As seen here, in gradient descent, to make a word update,

one must go through entire training data set, which can be very slow.

On the other hand, stochastic gradient descent makes an update after each

minibatch, which has a dramatic effect on efficiency.

Nearly all, if not all, deep learning models are trained with some variant

of a stochastic gradient descent, for these reasons.

We'll now transition to discussing scale challenges.

When using SGD, choosing the right batch size is very important.

2:20

For smaller batch sizes,

one does not efficiently utilize the computational resources.

Yet for larger batch sizes,

one can run into similar issues that we explored with gradient descent.

Very slow progress near saddle-like points, and getting stuck in sharp minima.

With very deep networks, trained on large data set,

efficiently parallelizing networks across multiple servers becomes essential,

in order to minimize the time to train.

To this end, we will explore two algorithms, data parallelism and

model parallelism.

In model parallelism, we split the model weights among end nodes,

with each node working on the same minibatch.

With data parallelism, we use the same model with every node, but

feed it different parcels of data.

Such a method is better for networks with few weights, like GoogLeNet,

and is currently used in Intel Optimized Caffe.

Visually, we can see that with data parallelism,

the algorithm distributes the data between various nodes.

And each node independently tries to estimate the same parameters,

then they exchange their estimates with each layer.

Using a parameter server or an AllReduce method, as we will discuss,

to come up with the right estimate for the step.

While the minibatch is distributed or mini nodes, three in this example,

one can not simply increase the byte size by three times.

As the time to train increases with larger minibatches,

due to similar issues as with gradient descent.

With model parallelism, the algorithm sends the same data to all nodes, but

each node is responsible for estimating different parameters.

These nodes then exchange their estimates with each other,

to come up with the right estimates for all the parameters.

Data parallelism is preferred when the number of weights is small,

which is true for linear topologies.

When updating weights on a single node using stochastic grading descent,

we pass in x training examples, where x is a batch size,

and forward-propagate them through the network.

After computing the cost of all examples, we then compute the data gradient

from layer to layer, and update the model weights using the weight gradients.

When we have multiple nodes, 32 in this example, and

are using data parallelism, we partition the minibatch

into 32 subsections, and distribute them to 32 workers.

When back-propagating, each worker computes a sum of the weight gradient for

each subset of their batch.

The weight gradients are then summed across workers,

producing identical numerical result as one would find with a single node,

training with a large batch size.

5:16

Deep learning practitioners have demonstrated a scaling across various

nodes.

Baidu distrubuted training to 40 GPU nodes, later that year,

UC Berkeley scaled training to 120 GPU nodes.

Their paper provided sufficient details for

other practitioners to build upon their work.

A few months later, Intel demonstrated scaling to 128 CPUs,

Google to 120 GPUs, Amazon to 120 GPUs.

Most recently, and not shown in this slide,

Facebook demonstrated near-linear scaling to 256 GPUs,

reducing the time to train from several days to just one hour.

With very large batch sizes, the time to train becomes quite large,

making training slow and not able to reach the same accuracy.

Therefore, let's assume that we have a batch size of 1024,

how can we distribute the data across nodes?

One option is to have 1024 nodes, each node with a batch size 1.

However, with this arrangement, the communication between the nodes becomes

a bottleneck, and the computation itself on each node is too little.

On the other hand, using 16 nodes, each with a batch size of 64, is more

reasonable, as most of the communication is hidden in the computation.

Multi-node training on IntelCaffe, which uses data parallelism,

works in the following manner.

First, the data on a given node is forward-propagated through the network,

which in this case is composed of two layers, L1 and L2.

Then the L2 gradients are sent to the parameter server after that

layer has been propagated through.

Similarly, the L1 gradients are sent subsequently to the server after

L1 has been back-propagated through.

When the server receives the L2 gradings from all nodes, it then applies an update,

and broadcasts it to all the nodes, and likewise with the L1 gradients.

Nodes wait for

these updates before forward-propagating through the updated network.

Now that we have discussed how data and model parallelism work,

we will consider strategies for implementing gradient aggregation.

Parameter server, reduction trees, rings, and butterfly.

One strategy for

communicating gradients is to appoint one node as the parameter server.

Which computes the sum of the communicated gradients,

and sends the updates to each of the workers.

However, there is a bottleneck in sending and

receiving all of the gradients with just one parameter server.

Another strategy is an AllReduce tree.

An AllReduce communication method is where each worker produces one or

more data values that must be globally reduced.

Generally, we'd have commutative, binary element-wise operator,

to produce a single result value.

And then this single value must be broadcast to all workers

before they can continue.

In an AllReduce tree, the local gradient information is

distributed to the entire network using a tree-based algorithm,

which is then broadcasted to each individual node.

8:35

In this example, there are seven nodes, and

each has a gradient value between one and seven.

In this AllReduce tree example, where the goal is to sum all of its values.

1, 2, and 5 are summed to 8, and 3, 4, and 6 are summed to 13, in the first step.

Then these results, 8 and 13, are summed with 7 to make 28, in the next step.

Finally, 28 is then broadcasted to the rest of the network in just two steps.

In total, if N is the number of nodes, the time is a function of the log of n,

which makes it ideal for power of two number of nodes- 1.

9:19

All Reduce butterfly requires the least of number of steps,

as it does not require separate reduce and broadcast steps.

In this example, there are four nodes with three steps, starting from the top, and

how the communication happens at each step.

As shown here in step one, node 1 and

2 are first summed concurrently with 3 and 4.

Resulting in the first two nodes with the values 3, and

the next two nodes with value 7.

In step two, the nodes with 3 and 7 communicate,

resulting in all the nodes with a value of 10.

The complexity of this algorithm is also log 2 of N, but

requires half of the steps as AllReduce tree.

A ring All Reduce algorithm is useful as the communication cost is constant, and

independent of the number of nodes in the system.

And is determined solely by the slowest connection.

10:16

Each node will only ever send data to its right neighbor, and

receive data from its left neighbor.

The algorithm proceeds in two steps.

First, a scalar reduce to exchange data, and

then an all-gather to communicate the final result.

While we have looked at synchronous multi-node training,

we will now briefly discuss asynchronous multi-node training, and its trade-offs.

In order to accelerate the convergence of SGD, and improve the training speed,

asynchronous parallelization of stochastic gradient descent has been investigated.

While asynchronous SGD overcomes communication delays,

it comes with a myriad of problems.

The algorithm requires more tuning of hyperparameters such as momentum and

learning rate, and requires more epochs to train.

Furthermore, it does not match single node performance.

It's quite hard to debug, and has not been shown to scale and

retain accuracy on large models.

Therefore, while there do exist benefits of asynchronous SGD,

there exist many issues that have not yet been fully addressed.