All right. In this video, we're going to introduce the Recurrent Neural Networks in Keras, and in particular talk about an esteems. As an application, we're going to classify sentiments from movie reviews. So there's a few architectures available for Recurrent Neural Networks in Keras. The first one is a class called simpler RNN, which is a basic plane on the nose recurrent neural network, which suffers from problems like vanishing and exploding radiant. So you see those very rarely used in practice that gated recursive units introduced in 2014, has certainly its used cases. Also, an LSTM are a long short term memory models introduced in 1997 by Huffington Schmidhuber. Those are the most popular Recurrent Neural Networks in Keras in maybe overall. So, oftentimes when people talk about using RNNs usually they mean finding LSTMs. We'll focus in this lecture exclusively on LSTMs. Alright. How do we build LSTM layers with Keras? There's a few things we have to specify. Recall that with LSTM layer, you have two sets of weights. First, regular set of weights and then also recurrent set of weights. So in Keras that's called recurrent RNN. And apart from specifying the units, so the number of units and then activation fronting for all layer, we also have to specify everything recurrent. So for instance we need a recurrent initializer or recurrent activation and so on. So if you want to specialize or specify a regularizer or constraint, you can do that. Something of note here is that LSTM layers allows you to specify a dropout within the layer. So you don't have to specify a dropout layer separately, but you can specify a dropout rate for a LSTM. Both for direct set of weights, specified by dropout and for the recurrent weights specified by recurrent neural network. The last key words are we mention here is return sequences which are set True, False, by default. So, what that means is if you started to true, then return of your LSTM won't be a simple vector but a matrix instead. So, where you run an LSTM is by you apply different time sets. Right? You go through different time sets and at the end the output is one, one back to usual. If you set return sequences to true, you will also get returns, all at the immediate values adding up at the matrix time. Okay. To compute our application we need to introduce another layer which is quite useful for many such applications. And so embedding letters, are used as a very first layer in Keras. And what you use them for is you transform integers into vectors and like. To you give you one example, these two numbers are three and 12, those can be embedded into vectors of length two, which are on the right. A prototypical example that you would let us join which is going to do, is we want to embed a certain vocabulary into a vector space, meaning each word in those vocabulary has to be embedded or has to be given a representation of a vector space, and then you apply those embedments to sequences of words. This is when you can map a 2D input to a 3D output which connects to LSTM. So meaning, if you have a mini-batch of sequences of IDs, you can map those to mini-batch of sequences of vectors by a mini-batch of matrices. And this can be set into an LSTM. Alright. How do we initialize on embedding with Keras? There's a few things we have to specify. First off, the input dimension is our vocabulary size that we have to define before length. We also need to specify the output vector length, of the day. In our example from the previous lines, which show two dimensional vectors that usually returns much larger vectors as I put in dimension. Third item we need to specify here, is this so called embeddings initializes. So the weights of embedding in Keras are called Embeddings Voids and we initialize them like that. A false key word what's quite interesting is the mask zero flag. So, your input sequences in an embedding layer, may or may not have different lengths. So, if you think about sentences of different length, and we can use the value zero as a special value that we can then mask out. So for instance, you start out with sequences of various length, and then you paart those sequences with zeros to make them of the same size. Only do them mask out the zero values. Okay. Our use case here is sentiment classification from movie reviews, which is a database of 25000 movie reviews from IMDB. Those are labeled either as good or bad. The data status is also visible through the Keras dataset module. And the good thing is that the data is already preprocessed sequences of integers. So, we don't actually have to work with the words vocabulary but those have already been marked as integers. Our task then is to classify our sentiment so good or bad from the review content. And the way we are going to tackle this is we first embed our sentences with an embedding layer and then learn the sequential structure with an LSTM. So how do we do this? So we have a few imports of sequential model maxlen layer, embedding layer and LSTM and we also import our data set. So first off, we spent specifying the maximum number of features that goes to 20,000. So meaning, we only choose to model 20000 most common items from our vocabulary. We specify a maximum length for our features. So, we only allow sequences of length 80 which would be quite short. With movie reviews, that's how we choose. So then we can simply load IMDB data into memory, by imposing IMDB low data with the maximum of features as specified to get back training and test data. Alright. Next, we pad on sequences according to the maxlength we have specified. So that would mean if the sequence training and testing of data sets is shorter than 80 items, we would pad them with zeros at the end, and if they're longer, we crack them. Next, we initialize the sequential model and add on the embedding layer with the given maximum number of features and 128 equal dimension. This then connects to LSTM layer with also 128 equal dimension, and we also specify 20 percent dropout rate for both regular and recurring weights. At very end, we are at the dense layer of the single output dimension and the sigmoid function which would specify a value between zero and one, zero meaning bad and one good. What's left is to compile them all which we again do with binary crossentropy, suppress the gradient and we also have the same accuracy as before. Then we set our model with batch size 32 and let it run for 15 epochs, and again we use our test service validation data. Afterwards, we can evaluate our model.