And if you don't actually figure out if the branch is taken or not, let's say
until somewhere here, the execute stage. Then, you're going to have more,
basically, instructions you need to kill when you take a branch mis-predict.
So, when you start to go to these out of order processors, when you sort of have
this seemingly short pipeline, seemingly easy pipeline.
More instructions can get sort of queued up into some of these structures,
especially if you have a queue. So this effectively lengthens the front of
your pipeline and makes it such that if you mispredict or you fetch the wrong
instructions relatively often, you're just going to be out in the weeds and you're
going to be killing lots of instructions and having done extra work that you didn't
really want to do. So also, if you wait all the way until the
end of the pipe in sort of these out of order processors to resolve your branch,
that's also going to make life even worse here cuz that makes your mispredict
penalty even longer. Most people don't actually do that.
I mean you might say oh, I don't want to actually kill the instructions until I
know the branch commits and that was sort of a simplistic example we had when we are
talking about these out of order processors we wait all the way till the
end of the pipe and then sort of cleaned out things.
You can wait for it to go to the end of the pipe to actually fully clean out
things but you don't want to redirect in front of the pipe or redirect the fetch or
the PC in the front of the pipe as quickly as possible cuz you just don't want to be
fetching off into the weeds cuz you are then just wasting cycles.
Here we have going back to our super pipelining lecture that we had before.
And we look at the, for some real processors the, the Pentium three, and the
Pentium four, what their branch mis-predict penalty is.
And you know, in this Pentium four here, you have twenty, twenty odd cycles here of
branch mispredicts penalty. That can be pretty painful if you take
branch mispredicts quite often cuz you're going to be taking branches, and the
branch penalty is going to be quite, quite high if you don't have the correct
subsequent instructions after you. Now, you know, we talked about, some
techniques. You could just stall and wait, so you
don't actually predict the branch. But then, if you have to wait for every
branch to get to, let's say, the twentieth stage of the pipe before you go and fetch
the subsequent instruction, that's pretty painful.
So we talked about speculating the next PC or the PC plus four, we'll say in a NIP
style architecture or our architecture where the, each instruction is 32 bits
long. But that doesn't really help you when
you're trying to predict or when, doesn't really help you if you high probability
you think the branch is going to be taken or you think control flow is going to be
taken. So you need to start thinking about how to
actually deal with that in a pipe line. And up to this point we've only talked
about speculating the fall through case. We talked briefly about speculating the
non fall through case, but we didn't say how you could possibly do that.
And today we're going to talk about what the hardware is to do that.
Also making, making life worse is if you start to go wide.
This hurts also. So, if we start to go wide here let's say
we have a dual issue processor but if you go wide here, when you go to kill
instructions you are killing twice as many instructions in flight in the pipe if you
take a branch the wrong direction or mispredict the branch.
So showing that from our pipeline diagram perspective here, this is just recapping
the incidence in a previous lecture. But, here we have a fetch for this branch.
And, we're fetching two instructions per cycle here.
So even if we're relatively short pipeline, you end up with one, two, three,
four, five, six, seven dead instructions on a mispredict.
So, what this really comes down to here is you have the pipeline width or, or
approximately how much depth you end up getting killed, is the pipeline width
multiplied by the branch penalty. So, its width times length before you can
resolve the branch. In, if you can shorten the time it takes
you to resolve the branch, that's good. Or if you can make the processor narrower,
that may be good. It's good from, you know, fewer
instructions being killed. But, we like to execute multiple
instructions at a time cuz that improves our performance.