So one, one question comes up, is how can we take advantage of that?
Anyone have any thoughts? Okay so, shared data, we don't know where
it's going to be accessed. It could be accessed by all 6 CPUs and
all 6 caches here. But it's very common that your stack for
your program is going to be only access local and the instruction memory for your
program is only going to be access local. So you can potentially have performance
benefits by putting the instruction in memory or excuse me,
instruction and stack and maybe even some portion of the heap close to this core
because then you can access that really quickly, but only shared data has to go
across this interconnect. [COUGH] And will or in fact that has a
fancy name. So systems where some data is close and
some data is far away are called Non Uniform Memory Access or NUMA and you
might see this actually even in your desktop processors are actually
Moving towards numerous systems. They're, they're, some of them are are
actually If you look at some of the.
I believe it's the A and D chips today are already numa systems even on a on a
single di, or not excuse me single chip with multiple dies or something like that
there, there actually two newer nodes inside of them sort of one for one memory
controller one for one another memory controller.
So, [COUGH] if you go into something like Linux and you go look in the proc
directory, you can actually see there's a sub-directory in there called NUMA.
And it'll tell you the configuration of the different memory.
And then the OS can take advantage of this, so can put for instance, the stack
and the instruction memory, for a particular program, that's being used by
a particular core close to that core. And then, maybe through some other data,
I can somehow choose some other, other choice.
Now, I want to make a point here, is that just
because the latency to memory is different does not mean that your system
is a directory based cached coherent NUMA system.
So you can still have non-directory-based systems where some memory is close and
some memory is far away. So you could still have a, basically a
bus, or something like that, or maybe some other internet connection network in
there which is still a snooping protocol, or effectively a snooping protocol.
But some data is close and some data is far away.
But if you see this in literature usually you feel people talking about directory
based cache coherent NUMA systems will call them CC NUMA or cache coherent NUMA
systems, that's usually sort of means that this is a cache coherent non-uniform
memory access architecture and usually implies that directory based for the may
be other protocols that people are using out there also.
Okay so I want to go back one slide here, and I wanted to finish off talking about
one-topology, which is interesting. And the difference between these two
slides is we went from a CPU here to CPUs.
So this is a multi-core chip now and where this gets interesting is you might
have a directory based cache coherence system connecting multiple chips but then
inside of a chip you may have something like a bus based snooping protocol.
So we actually mix and match these two things.
And how we go about doing this is, if caches, if cores inside of this one chip
for instance, after you go get data from each other, they can just effectively
snoop on each other, but outside of that your cache controller or may be your L3
cache for this particular chip, is going to respond to messages coming from other
directories, like invalidation requests and do something about it.
So there's basically a transducer there between a directory base cache coherence
protocol, and a bus base snoopy protocol. And this is pretty, pretty common these
days. Especially given that you have a fair
number of multi core systems showing up. And being used in more of these directory
based cache coherence systems. And we'll talk about one of them at the
end actually the SGI UV systems or UV 1000, which we'll
talk about, is a users off-the-shelve, Intel parts, modern day sort of Core i7
parts mixed together with a NUMA, and directory based coherence system to
connect all the chips together. So there's a transducer from the external
snoop bus protocol to the directory based coherence protocol.