So that's sort of, gives you a matrix of things in flight in the processor.
Now these may be sitting in your instruction queue.
We're not going to count if they actually, let's say, get all the way to the reorder
buffer, cuz by the time they get all the way to the reorder buffer they probably,
depending on how you build your processor, you probably don't need to continually
check those against the other instructions that are coming.
You've already broadcast that back to either your reservation station or your
instruction queue to issue a few further instructions.
But if you think about the complexity here of what you're really doing when you try
to go issue an instruction. You're trying to take these inflight
instructions, and you have to look up, basically, against them, or in your
instruction window, all of the instructions, when they complete against
basically all of the other instructions, all the registers.
Now, we built some structures that make this so you don't have to do a complete
all to all check. So, example, structures that we create to
help out with this. First of all, the scoreboard.
Instead of keeping track of where everything is.
Sort of our original design, where we had portions of, or which register destination
sort of flowing down the pipe. And then we did an all to all compare.
Instead, we just keep track of the location and latency of the most recent
write to a register. And it turns something from a all to all
compare into a index lookup based on the register.
So that's a little bit nicer, but if you think about what's happening in the issue
queue, we're still doing a as a instruction retires, It's going to
basically broadcast the registers that get completed against the instruction que, and
try to wake up instructions. And see which ones are ready to go.
And, that roughly as I said. Scales, scales like this.
Cause you're gonna be returning multiple instructions per cycle.
And you know, let's say you have a bunch of stuff in the instruction queue.
You are going to have to check all of those and this can get painful.
Let's, let's take a look at first sort of the complexity of this for in order
machine. So in order machine is, is not so bad.
L is kind of the pipeline length. You have your width but you just basically
have to have a scoreboard to figure this out, and as we said that's an index based
structure. Now when you start to go out of order,
you, you start to have the sort of complexity of all the results that come
back have to be broadcast against either if you were distributing instructions to
all the reservation stations or if you have a, a single instruction queue, you
have to check against all the possible instructions to go issue.
And what's interesting is one of the big challenges of building these structures is
not actually quite what you would think. It's, there's a interesting challenge that
you have too many things that could wake up in one cycle.
So let's say you have a single instruction, or say all of the
instructions in your instruction queue are waiting at register five to become valid.
All of a sudden, instruction some add, let's say, writes register five, and it,
it enters the re-order buffer. So it gets to the end of the pipe.
Well, now, all of a sudden, you have to wake up all these instructions and choose
the, some instructions that are the width of your machine, that can go issue.
Also, you need, still need to make sure that you can satisfy all of the different
requirements in the machine of functional unit mixes.
So, if you have, you know, say, two, two adds.
Two multiplies and one load and store, you can't just go pick randomly from there.
You need to pick exactly that mix. So this becomes a big mess of
combinational logic and this is why front ends of processors can actually in out of
word processors can start to get long and longer.
You get more and more stages in there to do this and as we talked about last time
in like the Pentium four there's actually multiple stages out in front there to be
able to do this and unfortunately you have to pipe line that which makes this even
more challenging. So, so roughly If you look at sort of the
out-of-order control logic here, it's sort of Growing somewhere between the width
squared, the width cubed. And this is, this is not, not very good.
Sometimes, there are a couple, circuit level tricks that can help you.
So if you go look at people who actually do full custom logic for designs of
processors, they'll make something that they call Pickers.
And what a picker is, is its actually a, instead of using straight combination of
logic to go compute what instructions ready to go execute.
You will instead have a much more analog circuit in there, where you will basically
have some custom, analog circuit there which is almost a CAM, or a continual
addressable memory. It's not quiet a CAM, its more of a
circuit which will choose what instruction is ready to go and it is typically, you
wanna have some heuristic that is the oldest instruction.
This is to prevent deadlocks. You don't really want to issue arbitrarily
instructions, arbitrary instructions. You want to issue the oldest instructions
cuz you want to make forward progress in the machine.
So, what we are trying to get out here is, this is a lot of hardware complexity.
It's growing, as the width of the machine, and some things actually makes this even
worse. So, here, I have a little not here, that
as you increase the width of your machine, let's say you have a, you got a three wide
machine, and all of a sudden, you want to go for, let's say, a four or five wire
machine, typically, we need also a larger instruction queue, or instruction window
to look across, in order to find enough parallelism to keep the machine busy.
So as you go wider, you also typically have to make the machine sort of deeper or
at least make your instruction queue deeper to find enough parallelism to feed
your functional units. Hm.
That's not good. So as this causes the, you know, our,
these sort of blow up factors here. And, you can build these things, but
they're hard. So are there alternatives?
Yes. But let's, before we, we move onto the
other alternatives, let's look at the sort of complexity of this in, a real
processor. So here we have a di-photo, or
di-micrograph of a MIPS R10,000 Processor. This was used in workstations by SGI for
many years. It's a, real out of order superscalar, and
one of the interesting things is if you go look at how much area is dedicated just to
control logic. It's pretty substantial.
So in this processor here, here's our actual data path, it's relatively modest.
Cache, instruction cache, data cache, those are pretty big structures you can
see those. We have you know, some other let's say,
the data tags, instruction tags the T.O.B you know, those are big, big-ish sort of
things, and that all this stuff here is quite large.
Let's look at what's inside of here. We have sort of next instruction sorts of
things. We have a free list for the we had talked
about this right. We talked about the free list for the,
which register is free if you will, to reuse in your out of order processor.
We have a big thing that just does registry namer.
The we have different queue here, so this is a distributed instruction queues, or a
distributed reservation station processor. So there's one just for address
calculation instructions, integer instructions and floating point
instructions. But the interesting to see here is the
prob, the, the, the percentage of this just dies.
It's actually just doing compute, let's say this integer data path, the floating
point data path, the floating point multiplier.
Is sorta pales in comparison to all the overhead.
So this is just sort of a get this, get this idea across.
Another problem with out of order super- scalars, or superscalars in general, is
thinking about how do we express parallelism.
And I alluded to this in a previous lecture, but it's worth repeating.
So, let's, let's take a look at what's happens for codes, sequential codes on a
out of order superscalar. We start off here with some pieces of C
code. We put it into our fancy superscalar
compiler. That generates some sort of data flow
graph. And probably also con, control flow graph
inside the compiler. So there's sort of instructions, there's,
dependencies between the different instructions, and the compiler right now
is trying to find independent operations, because it's going to come up with a
schedule. Unfortunately, you have to come up with a
sequential schedule, because the instruction sets on sequential processors
require you to come up with some order. Even though, it's very possible, that
there is no one perfect order. It's also possible that, you know, instead
of over-constraining the problem here, why are we coming up with some, some ordering
when we don't need to? But, you know, in our, in our sequential,
I say we have to come up with that. We write out to disk somehow, or you know,
save a file with this, this code. And then we have to go and actually try to
execute it. Now what's funny, as I always get a
chuckle out of this, is the compiler knew what the dependencies were.