0:00

[MUSIC].

Hello and welcome back to week four of computational neuroscience.

This week we will be talking about information theory.

We'll be exploring information theory as a way to evaluate the coding properties

of a neural system. So going back to thinking about spiking

output as binary strings, of 0s and 1s. How good a code do these spike trains

represent? We'll explore using information theory

and related ideas as a way to understand how the coding properties, of our nervous

system might be specially structured to accommodate the complex structure of the

natural environment. So today we'll be addressing three

things, we're going to start by talking about entropy and information, defining

our terms. Then we're going to talk about how to

compute information in neural spike trains, and then finally we will explore

how information can tell us about coding. So, let's go back to our well worn

paradigm, a monkey choosing right from left.

And suppose we're watching the output of a neuron while different stimuli

appearing on the a screen. Here's an example, spike train, is the

time sequence in which we're marking spikes, in a given time bin, with a one,

and a silence with nothing. Now, here's another example.

And another. So, hopefully, when these oddball symbols

appeared, either stimulus or spike, you felt a tiny bit of surprise.

So, information quantifies that degree of surprise.

1:35

Let's say there was some overall probably p, that there's a spike in sometime bin.

And 1 minus P but there's silence, then the surprise for seeing a spike is

defined as minus log 2, so log based 2, of that probably, that's the information

that we get from seeing a spike. And the information that we get from

seeing silence is minus log 2 of 1 minus p, of the probability of seeing the

silence. So why does the information have this

form? Like my husband and I some of you

probably play squash. And if you do you'll know that what

you're trying to do is put the ball somewhere that will surprise your

partner. If you're a remarkable player you can put

the ball anywhere in the court. If your partner has one bit of

information, he knows which half of the court the ball is in.

There was an equal probability of being in either, but once he gets that bit he

knows which half. Each additional bit of information cuts

the possibilities down, by an additional factor of two.

So, what we're really doing, is multiplying the probability, the

probability of being in this half, is p equal one half.

2:45

The probability of being in the front half of the court, is an additional one

half. Taking the negative log base 2 turns this

into 1 plus 1 two bits to specify being in the front left corner.

So now that we have a sense of that information we can understand entropy.

Entropy is simply the average information of a random variable.

So entropy measures variability. I'll warn you right now that in the

future, I'll usually drop this, this base 2 on the log, and just assume it.

Entropies are always computed in log base 2 and their units are in bits.

An intuitive way to think about this, is that the entropy counts the number of yes

no questions, as we saw in the case of the squash game, that it takes to specify

a variable. So here's another example, let's say I

drive down from Seattle to Malibu, and park in the valet parking.

When I come back to get my car, the car park attendant is not very helpful and

won't tell me where my car is. He'll only grunt for yes answers.

So the car could be in any of these, say, eight spots.

How many questions will it take before I can find it?

So, let's say, is it on the left? Grunt.

Is it on the top? Grunt.

Is it on the top left? Grunt.

So, what's the entropy of this distribution?

Let's, let's calculate it. So, remember we defined the entropy.

I'll call H. As the sum over the probabilities times

log of the probability minus. So what is pi?

In this case, the probability of being in any one of these locations is 1 8th.

And that's the same for every location in this.

In this, car park. And so now H is equal to minus 1/8, sum

from i equals 1 to 8, log base 2 of 1/8. Now what is that?

Remember that 8 equals 2 to the power of 3, so the log base 2 of 8 is 3.

So here we have, now, sum of 1 8th times minus 3.

Now we add that up over the eight possibilities and we get 3.

5:07

So as we saw, it took three questions to specify our car, and that's exactly the

entropy of this distribution. So now let's go back to our coding

sequences, here's a few different examples, so which of these do you think

has the most intrinsic capability of encoding?

Encoding relies on the ability to generate stimulus driven variations in

the output. If an output has no variation, such as in

this case. We're not very optimistic about its

ability to encode inputs. So these three sequences differ in their

variability. Which do you think has the most inherent

coding capacity? So we can use the entropy to quantify

that variability. So what does having a large entropy do

for a code? It gives the most possibility for

representing inputs. The more intrinsic variability there is,

the more capacity that code has for representation.

So in this simple case, we can compute the entropy as a function of the

probability p. Where, again, the other, the other

possibility has probability 1 minus p. So entropy, again, is going to be given

by a minus p log p minus 1 minus p, log 1 minus p.

So now when one puts that function as a function of P of r plus, which here would

call p, we find that there is a maximum. So, what's the value of P at which H has

a maximum. That's the value at which.

P equals one half. In that case, in this distribution, these

two symbols are used equally often. So, this is a concept we'll come back to

at the end of this lecture. Let's go back to squash.

So, we had a possibility of the ball being anywhere in the field.

Generally, you're not able to put the ball anywhere with equal probability.

It's exactly this reduction in possibility that makes it even possible

to play. You could model your opponent's

probability of x, the probability of placing the ball somewhere in the court,

and you can, to some extent, predict where the ball is.

The lower the entropy of your partner's p of x, the more easily you'll defeat him.

So, let's come back finally to our spike code.

We now appreciate that the entropy tells us of the intrinsic variability of our

outputs, by obviously we really need to consider the stimulus, and how it's

driving those responses. So here's an example.

The stimulus can take one of two directions, and each is perfectly encoded

by either a spike or no spike. So here's the stimulus, here's the, the

spiking response. Every time there's a rightward stimulus,

we get a spike. So how about this case?

We'd probably still be comfortable to say that the response is encoding the

stimulus. These two are perfectly correlated.

On the other hand, there are, there are several other events that, that are

misfires. So in this case, the stimulus occurred

with no spike, in this case there was a spike with no stimulus.

8:11

But how about this? At least at a glance, there seems to be

little or no relationship between the responses and the stimulus.

So just as a side bar, what if the problem were not so much that our code

were noisy, but that we haven't exactly understood what the code is doing.

That is, maybe there's some temporal sequencing S that should be more

appropriately thought about as the true stimulus.

This is really the question that we were addressing in week two.

How do we know what our stimulus was? But let's go back to the main question.

What we really wanted to know is; how much of the variability that we see here

in R is actually used for encoding S? We need to incorporate the possibility

for error. So, let's do that by assuming now that

was, when a spike is generally produced in response to stimulus plus.

So here, there's also some possibility that there will be no spike, we'll

quantify that using the error probability q.

So probability of, of correct response in this case is 1 minus q, and the

probability of a incorrect response is q. And let's assume the same error in this

case for a silence response. So, now we would like to know, how much

of the entropy of our responses is accounted for by noise, by these errors.

Because that's going to reduce the responses capacity to encode S.

9:33

The way we can address that is to compute how much of the response entropy can be

assigned to the noise. That is if we can give a stimulus plus, a

plus stimulus, a right way stimulus and get a variety of responses.

Those conditional responses for a fixed S have some entropy of their own.

Similarly when we give stimulus minus. So we call these stimulus driven

entropies, the noise entropy. So this brings us to the definition of

the mutual information, the amount of information that the response carries

about the stimulus. This is given by the total entropy minus

the average noise entropy. That is, the amount of entropy that the

responses r have of some fixed s, averaged over s, and that's drawn out

here. So, here's the total entropy of the

responses, and here's the conditional entropy.

So the, the entropy of the responses conditioned on a particular stimulus s,

averaged over s. So now let's go back to our binomial

calculations, and see how the mutual information depends on the noise.

Now fixing p. We're going to take p to be the one that

maximizes the entropy, so p equals one half.

Let's vary the noise probability, and again assume that the noise is the same

for spike and silence. That is, there is one value q.

10:58

So this should be intuitive. When there's no noise entropy the

information is just the entropy of the response, which in this case is one bit.

As the error rate increases, as the error probability grows larger and larger.

Spiking is less and less likely to actually represent the stimulus S, and

the mutual information decreases. When the error probability reaches a

half, that is, responses occur at chance, there's no mutual information between R

and S. So, let's just check that everyone's

still on board. More generally, what are the limits?

So, if the response is unrelated to the stimulus, what is the probability of

argument S? Its simply, the probability of the

response. Because there's no relationship between

response and stimulus. So the noise entropy is equal to the

total entropy, and then the difference of response in noise entropy is zero.

At the opposite extreme, the response is perfectly predicted by the stimulus.

So in this case the noise entropy is zero.

So the mutual information will be given by the total entropy of the response.

All of the response's coding capacity is used in encoding the stimulus.

So let's just see how that works for continuous variables.

We've talked a lot about, about binary choices.

Let's think more generally about cases where we have some continuous r, and,

some response variability for the encoding of a stimulus s by r.

So here's an example where we've given several different stimuli.

12:31

Each of these distributions is the probability of the response given a

particular trice of the stimulus. And now that's going to be weighted by

the probability of that stimulus. And when we add all of these conditional

distributions together, we get the full probability, P of r.

Now, what we're doing by computing the entropy is we're going to compute the

entropy of this blue distribution. That's going to be the, that's going to

give us the total entropy. And then we're going to compute the

entropy of these conditional distributions.

And now we're going to average them over the stimulus that drove them.

So for these two cases, they differ by the amount of intrinsic noise that each

response has. So we give a stimulus s in this case,

there's some range of variability that takes out some of my range of r.

13:23

In this case, when we give that same stimulus, now the degree of noise

stretches over a much wider range of the response distribution.

So much more of the variability in R is accounted for by variability in responses

to specific stimuli. And so, I hope you can see that this kind

of response, set of response distributions, is going to encode much

more information about S, that the information about S and R is much larger

in this case. Then it is for this case.

Let's play a little bit with these distributions, because I want to

demonstrate a couple of things that I think really illustrate why information

is useful and an intuitive measure of the relationship between two variables.

I'm using capital letters to denote the random variable, and lower case letters

to denote a specific sample from that random variable.

So, what I'd like to show you is that the information quantifies how far from

independent these two random variables R and S are.

To demonstrate that, I'm going to use the KullbackâLeibler divergence; a measure of

similarity between probability distributions that we introduced earlier.

It's the mutual information, measures independence, then we'd like to quantify

the difference between the joint distribution, of R and S, and the

distribution, these two variables would have if they were independent.

That is, that that, that joint distribution would simply just be the

product of their marginal distributions. So, first, to refresh your memory, D KL,

let's redefine it. D KL, between two different probability

distributions, say P and Q, is equal to an integral.

Over probability of x times the log of P of x over Q of x.

So now let's apply that to these two distributions.

So, let's compute that, we have a integral over ds and over dr.

Joint distribution, times the log the joint distribution divided by the

marginal distributions. Now we can rewrite that, using the

conditional distribution. In the following form, we can rewrite

that as the probability of r, given s times the probability of s, that's just

equivalent to the joint distribution, divided by P of r, P of s.

And now, you can see that P of s cancels out, and we can rewrite this as, now the

difference of those two distributions. So we'll just expand that log.

All right. Now let's concentrate on this term.

Going to be equal to the negative ds dr, probability of s and r, times the log of

P of r, plus integral ds dr, P. Now, let's break that up into P of s, P

of r given s. Just dividing up the, the joint

distribution again into its conditional and marginal.

Times the log of P of r given s. Now, let's look at the terms that we've

developed here. We can see that we can just integrate

over ds. We can integrate the s part out of this

joint distribution. And this part is just simply going to be

the entropy of P of r. Whereas this one is going to be the

entropy of P of r given s, averaged over s, ds P of s.

And so what I've shown you is that this form, in terms of the KullbackâLeibler

divergence, gives us back the form that we've already seen.

The entropy of the responses minus the average, minus the average over, over the

stimuli. Of the noise entropy, for a given

stimulus. What I hope you realize is that

everything we've done here in terms of response and stimulus we could simply

flip, response and stimulus, redo the same calculation, and instead end up with

entropy of the stimulus minus an average over the responses, of the entropy, of

the stimulus given the response. So information is completely symmetric,

in the two variables, being computed between.

Mutual information between response to stimulus is the same as mutual

information, between stimulus and response.

18:12

So here's our grandma's famous mutual information recipe.

What we're going to do to compute this mutual information, is to take a

stimulus, s, repeat it many times, and that will give us the probability

responses given s. We're going to compute the variability

due to the noise. That is, we'll compute the noise entropy,

of, of these responses. So, for a given value of s, we'll compute

its, the entropy of the responses for that s, we'll repeat this for all s.

And then, average over s. Finally, we'll compute the, probability

of the responses, that'll just be given by the average over all the stimuli that

we presented, times the probability of the response given the stimulus, and that

will give us the total entropy of the responses.

So, in the next section, we'll be applying that idea to calculating

information in spike trains. There'll two methods that, that we work

with. One will be starting by calculating

information in spike patterns. And then we'll be calculating information

in single spikes.