Hi! Here we are again in optimizing our Smith-Waterman Algorithm! Last time, we tried to compress the input strings of our algorithm to increase system performance, but, unfortunately, the design we were trying to express did not fit entirely on our FPGA chip. In this lesson, we will try to increase the performance of our algorithm, by implementing a shift register on our second string. So, ready to go? Let’s start! First things first! Let’s adapt the host code so that it sends correct data to our FPGA device. We compressed our query and database, remember? So, let’s define two variables in our host code, and let’s call them query_param and database_param. Now, we are defining them as unsigned integers. This data type is 32 bits long. As we need 2 bits to represent a symbol, we can fit 16 elements in each variable. Let’s also put a plus 1 here, so that we will not have problem even if the query or database size is not divisible by 16. Awesome! Now, let's initialize them to zero, and then we need a function, that translates each element of the original strings, into their two bits representation. As for the directionMatrix, I have a function here, that checks the value given in input in this switch-case and then shifts into the unsigned integer variable a number that goes from 0 to 3, representing its 2 bits value, that is the same we are using in out kernel code. Hence, let’s create two for loops with the task of calling this function to populate our variables. Now we need to adapt the buffers. Here we have in fact to change the dimension of the input_query and input_database buffers. Notice that the dimension, as well as the name of the variables, must be changed when we are sending the data to our kernel using the "cleneuquqwritebuffer" OpenCL API. Awesome. We just need to add the free functions to clean the code and test the correctness of the code. Great! Everything is working fine! Now we can move to the implementation of our shift register! First of all we can remove the partitioning pragma on our input variables. The string1 will be probably automatically partitioned due to its small size, while for the big one, we will mainly work on our shift register. As said in the previous lesson, we need N element of the database to be accessed in parallel each clock cycle. Hence, let’s define a shift_db variable of type ap_uint 512 bits able to host N elements. At the beginning of the computation, we need to initialize its first N divided by NUM_elem elements with the first elements available on our database string. Now, we can avoid passing string2 to the calculate_diagonal function, instead we need to pass our shift_db variable. As everytime we are calculating an anti-diagonal, now, we are reading all the elements that are stored in our shift register, we need to change the database local index to be zero at the beginning of each computation. We will update these values in our shift register later on in the code. Now, after updating the occurrences of string2 with shift_db, let’s move the memcpy directly inside the calculate_diagonal function, as it is actually performed every time an anti-diagonal is computed. Let me change the index really fast, include the parameter in the function call, and remove now the useless call to the store function. Great. Now we are calulating the anti-diagonal using our shift register, let’s see now how it needs to be updated after each computation. In these regards, let’s create an update_database function, taking as parameter the database, the shift register and the num_diagonals variable. In this function, we need a startingIndex, pointing at the new element in the database that needs to be shifted in our shift register, but before shifting this new value, we need to move the elder one. This action is performed in the update_database loop cycle, that starting from the element in position 1, to the size of the shift register, takes each element in position i, and moves it in position i-1. Once this operation is performed our shift register will have the position of the elements in our shift register updated, and the element that was in position 0 now deleted. We need now to select the new element from our database and shift it in our shift register. Notice that this is a very easy, yet very powerful, optimization! Let’s try to synthesize now and see the effect of this shift register on our Smith-Waterman architecture. The synthesis is over! As you can see, now the kernel is only occupying 48% of LUTS and 10% of FLIP flops and the loops have all an initiation interval equal to 1! This is fantastic! Let’s build this architecture for our Virtex-7 FPGA device, so that we can now what kind of performance we obtained on the board. While we wait for the build step to complete, let’ have a fast look at the operational intensity. We slightly reduced memory traffic due to the input compression, and also had an increase in the number of operations performed. These two factors allowed us to increase the operational intensity to 289 operations per byte in the best case and 305 in the worst one. Looking at our roofline, we can see that now the performance predicted are in the order of more than 2048 Giga operations, being still in the memory bound area, but very close to the ridge point. Let’s now see what kind of performance we obtain. I have built the kernel with three different datasets. For each dataset, the query is set to 256, while the database is 256, 65 thousands and more than a million. Let’s start from the first one. Execution time here is 0.13 ms, resulting in 8,9 GOPS, far from the prediction. Let’s see what happens increasing the size of the database ... 0,55 milliseconds! Now the performance is 2329 Giga operations! Wow, let’s see the last dataset. 7,61 milliseconds, equivalent to 2715 Giga operations per seconds. Great! We are matching the performance predicted by our roofline model, and we obtained a super parallel hardware architecture for the Smith-Waterman algorithm. Notice how, increasing the size of the database, the overall performance are increasing too. This because, the FPGA is able to express its full potentiality when a huge amount of data is provided and executed in parallel. In our last lesson, we will try to map the port to the memory in a different way. In doing so, we will test our architecture on an FPGA board based in the Kintex Ultrascale 3, an FPGA device featuring two memory ports to the DDR.