Hi, and welcome to this class,

where we will see you on how to optimize

the implementation of our kernel in

order to efficiently use

the available resources on our target FPGA.

In particular, we will discuss

the loop unrolling optimization.

Let's come back to our vector some example that has

been introduced in the interface optimization classes.

The code version that you see here has

already some interface optimizations applied to it.

In particular, the code already

exploits bus data transfer and we leverage on

local memories in order to read the operands and

store the results of our floating point additions.

The code of our kernel,

resides in the loop labeled as sum_loop.

Here we iterate over the n elements of

our vectors and iteratively perform the additions.

Looking again at our synthesis reports,

we can see that each iteration of

our sum_loop takes 10 cycles.

Since the loop needs 1,024 iterations,

the overall latency for computing

the loop is 10,240 cycles.

Noted that the number of loop iterations is referred as

trip count within the Vivado HLS performance reports.

To understand why we need 10 cycles for each iterations,

we can look at the analysis report.

Here, we can see that two cycles are needed to load in

parallel the operands from arrays local A and local B.

Seven cycles are required to

perform the floating point addition.

And finally, one cycle is

needed for storing back the results into array local res.

Is there any way to reduce

the overall latency of

the loop and achieve higher performance?

Well, likely the answer is yes.

We will now look into

two different optimization directives

namely loop unrolling and loop pipe lining.

If we take a closer look to our original code,

we can clearly see that

all the iterations of

the loop are independent from each other.

Indeed, each addition is

done on different elements of the input arrays,

and it is stored on different elements

of the output array.

Hence, would it be possible to perform

multiple additions in parallel on different elements?

The answer is again yes.

And the way to achieve it is by unrolling the loop.

Loop unrolling effectively means unrolling

the loop iterations so that

the number of iteration of the loop reduces,

and the loop body performs extra computation.

This technique allows to expose

additional instruction level parallelism that

Vivado HLS can exploit

to implement the final hardware design.

In this example, we have manually

unrolled our sum loop by a factor of two.

As you can see, the variable i increments with step two,

hence effectively reducing the number of

loop iteration from 1,024 to 512.

On the other hand, each loop iteration

perform two additions instead of one.

The same optimization can also be expressed in

a much more convenient way by

using the HLS UNROLL pragma.

The pragma must be placed directly

within the loop that we wish to unrol.

The pragma also allows us to specify

the unrolling factor by which we want to unroll our loop.

Notice that the unrolling factor can be

any number from two up to

the number of iteration of the loop.

If the factor parameter is not specified,

Vivado HLS will try to completely unroll the entire loop.

However, this can be achieved

only if the number of iteration is constant,

and not dependent on dynamic failure computed

within the function. All right.

Let us now see what is the effect of our optimization.

If we run Vivado HLS and look at the synthesis report,

we can now see that

the latency of the sum loop has halved.

The reduction comes from the fact that the loops

now iterates over 512 iterations,

but it's still able to perform

each loop iteration in 10 cycles,

as for the previous case.

To understand how Vivado HLS achieved these,

we can look at the analysis report.

Here, we can clearly see that Vivado HLS was able

to schedule the execution of two floating point addition,

as well as the load and store

operation completely in parallel.

Nevertheless, these optimization comes at a cost.

In order to perform

the two floating point addition fully in parallel,

we need two floating point address

in our hardware design,

which increase the overall

resource consumption of our kernel.

Indeed, if we look at the resource estimation report,

we can actually see

the two floating-point other instances

and their corresponding resource consumption.

In our design, we are far away from

using all the available FPGA resources.

But in more complex design,

it's very important to consider the impact on

the resource consumption when

applying optimization to our kernel.

In this example, our rolling by a factor of two provided

a straight 2x reduction in the latency of the loop

at the cost of 2x extra resources for its implementation.

Nevertheless, in some cases,

it might not be possible to achieve

such an idea latency improvement.

When performing loops optimization,

there are two potential issues

that needs to be considered.

First, constraints on the number of

available memory ports and available hardware resources.

Second, loop-carried dependencies.

I know you are interested in knowing more.

Don't worry. More information

will be provided in the following lesson.