of our loop unrolled code. We can do a bunch of loads upfront here.
So we've inter, intermingled these loops. And what's kinda cool is because you pull
out the loads and are thrown to the top and the stores been pushed into the
bottom. And then put all the ads sort of in the
middle. And maybe sprinkle the array update
somewhere else. And we go to actually schedule that, we're
going to do something similar. We're going to have the loads [inaudible]
first, execute the floating point ads of the floating points stores and have the
result. But what you could noticed here is we're
actually starting to get some overlap, because we've unrolled, we can overlap
this load and the first floating points addition 'cause we've effectively covered
the latency of our functional units by putting other loop iterations during that
time. So if you look at this schedule versus the
schedule back here. We're just sort taking these dead cycles
and we've put the other loop iterations in those dead cycles.
In this loop unrolled case, we're incrementing this counters and the indexes
not by four anymore. We're incrementing by however many times
we've loop unrolled times the offset, so we're incrementing it by sixteen now.
Does that make sense? 'Cause in, in this code here we were
incrementing R2 by four, because that's the size of a single value is four bytes.
So we have to sort of move our array index over by four.
But now, because we're batching up all this work together.
[clears throat] We actually have to move the, the index by a, a bigger value.
So we're moving it by four 'cause we've unrolled four times, times the size of the
data value, which was four. So we're moving it by sixteen.
And, and, one of the nice things here if we look is in both the loads in the
stores, we're using our register indirect addressing mode here to add in some
offsets. So, we're actually offsetting, let's say,
twelve plus this base register of R1 to figure out where we're actually doing the
load from. But it's just, a convenient way that we
don't have to compute a bunch of addresses [clears throat].
Okay, so going back here, we can see we're starting to overlap actual operations with
other loop iterations. Well, that's really cool.
So we're starting to get some performance here.
So, let's, let's look at the performance. So, I ask the same question here.
How many floating point operations per cycle?
Hopefully, hopefully it's higher, one, two, three, four divided by one, two,
three, four, five, six, seven, eight, nine, ten, eleven cycles Okay, so that's
0.36 is a lot better than 0.125. This is good.
Loop unrolling is helping us. But is this, is this everything?
Or could we do more in our compiler? So these four compiler people came up even
fancier idea, which is called software pipelining.