0:02

In this video, we will speak about simple Recurrent Neural Network,

Â how to use Backpropagation to train.

Â In the previous video, we spoke about a recurrent architecture.

Â In this architecture, we work with sequences, element by element.

Â Here, Xs are the elements of the input sequence and at each time step,

Â we have some output or a prediction right ahead.

Â We can use this architecture for a lot of different tasks.

Â For example, if we want to use it for language modeling,

Â then Xs should be word embeddings and the output

Â of our model at each time step is a probability distribution over a vocabulary,

Â which says which word is more probable to be the next one in our sequence.

Â Of course, we can generate this word from this probability distribution and give

Â it to our model as an input in the next time step so we can generate text.

Â We can use a recurrent architecture to solve a post tagging task.

Â In this task, we want to tag each word to the text,

Â which is a part of speech label.

Â Here we have the same inputs, word embeddings,

Â but the output is different now.

Â Now, at each time step,

Â the output is the probability distribution over the set of

Â all possible post tags which says which post tag is more probable to the current word.

Â Now we see that we can use this architecture for different tasks and here,

Â in the middle of this architecture,

Â we see this box MLP,

Â which can be quite different inside.

Â Let's use the symbols MLP, which exist.

Â The MLP is one layer of hidden neurons.

Â Then, we have a vector of hidden neurons in the front layer of our network

Â and we can use these hidden neurons both

Â to calculate the output of our networks or the predictions we had,

Â and to calculate the hidden neurons from the next time step.

Â On the slide, you can see formulas for forward pass in our network,

Â an action that follows are pretty similar to the ones you already saw in the course.

Â Here, again, you have some known neural functions

Â F often in your combinations of the inputs.

Â But here you need to remember that all the parameters are shared between

Â time steps so we use same word matrices and the same bisectors at each time stamp.

Â This type of visualization of recurrent neural network usually called a full diversion.

Â And there is another diversion of the rotation for recurrent neural network,

Â the more complex one.

Â Actually, this complex visualization is more general

Â because it doesn't depend on the length of the input sequence.

Â While the full diversion is actually depends on it,

Â but in the computerization,

Â we have these recurrent connection,

Â which forms a loop in our scheme and it's actually

Â not clear how to train a network with a loop.

Â That's why we will use a full visualization when we will speak about training.

Â Now, let's speak about training of Recurrent Neural Network.

Â Here we see a Recurrent Neural Network in the unfolded form.

Â To train it training we actually need some data.

Â Let's say that we have a training dataset so for some training sequence we know

Â all the true label's y and we make

Â some predictions y-hat and to understand how work are we,

Â use some loss function L but we use this loss function separately at each time step,

Â and then we sum up all of the sources to take their total loss, to have a total loss.

Â Now, we have our usual neural network,

Â we have data, we have a loss, what should we do?

Â We need to backpropagate.

Â As usual, in backpropagation, of the system,

Â we need to make a forward pass and backward pass.

Â In forward pass, we need to calculate all of the elements in

Â our own neural network so we calculate the hidden elements,

Â then we make our predictions, and we calculate the loss.

Â In the backward pass,

Â we need to calculate the gradient of our loss function with respect to

Â all the parameters so respect to weight matrices and bisvectors.

Â In usual fit for our neural network,

Â we make a backpropagation on different layers so in the vertical direction,

Â but here we have sequences.

Â This why we need to backpropagate not only through layers but

Â also through time in result direction.

Â That's why the algorithm of training of recurrent neural networks,

Â usually called backpropagation through time.

Â Additionally, we need to remember that all the parameters are shared between time steps.

Â Therefore, if you want to calculate the gradient

Â of our loss with respect to some parameter,

Â we actually need to sum up the gradient from all the time steps.

Â Let's, for example, calculate the gradient of

Â loss function with respect to weight matrix U.

Â As I already said, we need to sum up the gradients from all these time steps in

Â our sequence and to calculate the gradient of the loss at the specific time step t,

Â we can simply use a chain rule.

Â As a result, we have this product and the first element in this product,

Â we calculate if we know which loss function we used,

Â and the second element,

Â we can actually also calculate very simply

Â because our prediction depends on the matrix U only in one point.

Â Now let's consider a more difficult example.

Â Let's calculate the gradient of our loss function with respect to

Â the recurrent weight matrix W.

Â In first step, we will do the same thing order.

Â We simply sum up the gradients from all the time steps in

Â our sequence and actually we want to do the same thing thing order in the next step,

Â but if we look at the formula for the hidden units,

Â we see that there is not only one dependence

Â between our weight matrix W and hidden units at

Â time step t because hidden units in the previous time step actually also depend on W.

Â So, here, we need to roll deeper and use not a simple chain rule,

Â but use a formula of total derivative.

Â After finding this formula,

Â we end up with the following equation for gradient for

Â our loss function with respect to W.

Â At the last part of this equation,

Â you can see that the first we take into

Â account dependence between hidden units at time steps t and the

Â weight matrix W.

Â When we take one step back and take into account the dependence

Â between previous hidden units and weight matrix W,

Â and then we take one step back and so on and so on while we

Â don't reach the beginning power sequence.

Â As a result, here,

Â the last part of this equation is the sum of

Â the contributions from the old previous time steps toward

Â the gradient at time step t.

Â And to calculate the contribution from time step k to the gradient at time step t,

Â we actually need to go from the hidden units at time step t

Â to hidden units at time step k. In each element on the sum,

Â we have this product of the combined matrices of the gradients of hidden units

Â at one time step with respect to the hidden units in the previous time step.

Â Now, we calculated the gradient of our loss function

Â with respect to two weight matrices, U and W.

Â When we calculated the gradient of our loss function with respect to U,

Â we needed only to go backwards in layers,

Â so in vertical direction.

Â When we calculated the gradient with respect to W, we need to go

Â backwards both in vertical direction and in horizontal direction,

Â so backwards in layers and time.

Â Now we have the last weight matrix V,

Â and it's a question for you.

Â Do we really to go backwards in time to calculate the gradient of

Â our loss function with respect to weight matrix V?

Â Yes, here we have the same situation as with

Â the recurrent weight matrix W because the dependence

Â between hidden units and weight matrix V is actually not only in one place.

Â Hidden units from all the previous time steps also depends

Â from V so we need to go backwards in time to calculate this gradient.

Â Let's summarize what you've learned in this video.

Â We now know what is the simple Recurrent Neural Network is

Â and how to train it using Backpropagation through time.

Â This backpropagation through time algorithm is actually a simple backpropagation,

Â but with a fancy name.

Â In the next video, we will discuss

Â the difficulties that arise during the training of Recurrent Neural Networks.

Â