Hello! Let’s see how to emulate and build for the FPGA using Xilinx SDAccel. After checking that the active build configuration in the project tab is set to Emulation-CPU, let’s click on the hammer symbol to start compiling our code. If we double click on Console, we can also observe the log of compilation. Oh, wait! What happened here? Well, the compiler is not able to find the kernel function because we are missing the extern “C” in our kernel file. The extern “C” deems necessary only in case the kernel file is not an OpenCL one. Let’s go back in the source code, and let’s specify it. Great. Let’s now save and start compiling again. The build now succeded and we are ready to run our kernel. Before actually testing it, I want you to show something. If now that we have specified all the necessary instructions for our kernel function, we delete the binary container, and try to specify it from scratch, removing the flag specified before, we will observe that SDAccel is able to automatically find the kernel function. Let’s specify the file in the SRC folder and then click ok to confirm the selection. Now, our kernel should be ready for a first execution Let’s take a fast look at the host file. The host file is composed of multiple functions, that you will probably never change in the code, and others that are kernel specific. In our host code, we start by defining the local buffers for the application as for example the query, the database and the matrices. Then, we need to define some OpenCL variables used to keep track of the platform and device id, the context, the command queue, the memory buffers, and others. At this point, we have multiple OpenCL APIs used to get the platform and devide ids, the vendor, and the type, that in our case is an accelerator. We need to create a context, a command queue and to load the kernel binary, that unlike other OpenCL programs, it must be compiled offline and loaded from binary in our host. When we create the kernel, its name must match the name of the kernel function, otherwise SDAccel is not able to link the correct kernel function. Now we have to create the memory buffers specifying if they are read, write only or read/write, enqueue the input buffers, set the arguments for our kernel and then launch our kernel to be executed on the accelerator. In our example, we are using the EnqueueTask function as we have a C Kernel, however, if we want to run our OpenCL kernel using multiple WorkGroups and WorkItems we need to use the EneuquqNDRangeKernel function. We can now use the clWait function for attend termination, and then use the EnqueueRead functions to read the outputs of our kernel. It is very important to always use a test bench, in order to understand if the results of the hardware accelerator are correct. In fact, in this host code we have a software version of the compute_matrices function that we use to check the correctness of the results. Finally, we release the memory objects and clean the local buffers. Now you have an introduction to the host file too. Let’s clean the project, and build it again. This time, let’s execute the code to see if everything is working. Something is still not working. This is happening because the host file expects the kernel binary as argument. To specify it, let’s go close to the green arrow button, and then run configurations. In the Arguments tab, let’s flag “Automatically add binary container(s) to arguments so that the binary container is always given as argument to the function, and then click on Apply, and finally Run. Awesome! It worked and the results are correct! We have our first implementation of the Smith-Waterman Algorithm running in Software Emulation! In our host file, we created a function that for each run provides the execution time of the kernel on the accelerator, however, SDAccel provides automatically multiple information to the user. To observe this information, let’s go on the reports tab, under Emulation.CPU. We can see that there are 2 files: PRofile Summary and Application Timeline. Profile Summary gives information regarding the execution of our kernel (9,345 ms in the example) and data transfer by means of memory reads and writes among the host and device global memory. Looking at its second tab we can find information regarding the kernel function and the utilization of the compute unit. For example we can see the global and local workgroup and work items sizes, how many times the kernel has been called and how long it took to terminate. Then we have a tab for Data Transfers and one for all the APIs that has been called. The application timeline provides a chart specifying when each API has been called, and how much it lasted. It is very important as it allows to understand when is actually issued the computation of each kernel and how much it took with relation to the overall latency of the system. OK, now that we had an overview of the logs, let’s try to perform hardware emulation. So, let’s go back into the project.sdx tab and change the active build configuration to Emulation-HW and then click on Build. After building, we can see that SDAccel creates a report called HLS Report that is basically the same report we have seen in Vivado HLS. This file in fact, gives information regarding the clock frequency, the estimated latency, resource usages, how the operations are scheduled in the different clock cycles and then it has multiple tabs with informations on the resources, performance, resources profile etc etc. In the reports, we can find a file called Systems Estimate too, where there is a sum up of the information contained in the HLS report. So, the kernel has been synthesised for Hardware Emulation. We can now start running it. Software Emulation is a very fast way to guarantee functional correctness of the kernel code we are providing, however, it does not guarantee that the code is correct on the target FPGA. For this reason, it is necessary to test the provided kernel code with hardware emulation as it allows to check the correctness of the logic that SDAccel generates. The problem with hardware emulation is that it can be very long, especially if the code involves unoptimized read and writes to the memory, as in our case. During hardware emulation, we can see in console each data transfer with the DDR. As expected, the computation took way longer than software emulation, in fact, if we look for the execution time value it is some order of magnitude bigger than the previous one. The report we were talking about before are produced for hardware emulation too, but they are providing some more specific information, especially for what concerns Data Transfer. We can see from the log, that the tool provides information regarding the bandwidth usage. In this way, we are able to understand how to fully exploit the bandwidth of the device. As our kernel passed both Software and Hardware emulation, we can now move to build our architecture for the FPGA device. This is the slowest operation, and can take up to multiple hours so, make sure not to interrupt the process, otherwise you will not be happy to lose hours and hours of work. Once the building process has completed, we can find the xclbin binary and the host executable under the System folder. In the Report tab there are three new reports to observe: the post synthesis utilization, post placement utilization and post route utilization. All these reports are more specific than the one we were observing in the previous phases, as they are taken after the actual step has been performed.