0:00

Okay, now, we are going to talk about a different class of random graph

models that begins to bring in attributes of individuals or nodes into the picture.

And as we have looked in terms of random graph models so far, we have talked about

Erdos-Renyi really basic uniform at random, used for benchmark models.

And we saw that they missed a lot of real world features.

Then we talked about other kinds of link

by link models that began to bring in features.

We looked at, at modes that kept clustering.

They could match different kinds of degree distributions.

They could get some aspect of co, correlation.

And what we are going to start to do now is, is bring in attributes of nodes and,

so, we'll start with stochastic blocks models which are

sort of the natural enrichment of Erdos-Renyi random networks.

So here they are going to allow

the probability of a link to actually

depend on some characteristics of the nodes.

And this could be characteristics attributes that,

that could be, that could be latent.

So it might be that that we are not fully observing all these things.

We're trying to infer them or there could be observed characteristics.

And so we are going to start with

simple block models that, where everything is, is observed.

1:14

And so here,

the idea now is that nodes have characteristics,

things like age, gender, religion, profession, and so forth.

And links between nodes depend on those characteristics.

So people of of similar ages are more

likely to link than people of different ages.

People of similar religions and so forth.

And so now let's start with a simple situation.

We have three types of nodes in this example,

blue, green, and yellow nodes. They're connected in a network.

And we can imagine, then, that there's, the

model is that there's different probabilities of connections.

So if there were homophily here, we think of

the blues being more likely to connect to blues.

There's a probability that yellows connect with yellows.

There's a probability that greens connect with greens.

Probability that blues connect with greens, and so forth.

So we have different probabilities for every different possible combination

of types of nodes in this situation.

But then everything else about this model is independent.

So, the links are formed independently, it's just

the probabilities are varying depending on node attributes.

And any two pair of blue nodes have the same probability of linking to each other.

Any, pair of a green and a blue node has the same

probability of, of linking as any other pair of, of green and blue.

It's just that we're allowing the probabilities to vary with types

of nodes.

Okay, so if this were the, the resulting network that we observed, this is

a very easy kind of model to go ahead and, and estimate probabilities for.

And in this case, for instance, if we're trying to estimate what's the probability

of, of blue links links from, between two blue nodes, then what do we see?

Well we see that that there is five actual links present among blues right,

one, two, three, four, five and out of four possible nodes

we have six possible total so we end up with an estimate of five six.

3:41

More generally, these kinds of models are

going to be worked with, often with continuous variables.

And, so if we have continuous covariates so, for instance, we keep track of age in,

in, in days or some for instance or in fractions of years, then we

would have, instead of just green yellow and blue, we might have people that are

34 and half years old and people that are 60 years old and so forth.

And so then when we start to think about a, links between two nodes,

we could allow them to, to depend

in more complicated ways on, on the characteristics.

So if we have some vector of characteristics that i has, some vector of

characteristics that j has, then we could have a, a vector of parameters beta i.

Vector of parameters could be the j.

4:28

It could also be that, that there's, that the probability of the link

depends on differences, so people that are close in age are more likely to link

than people that have distance between ages

or it could be geographic locations, GPS data.

If you live closer together you're more likely to,

to link together than if you live far apart.

And then we have basically a function,

which we're going to base the probability on.

And now the standard formulation for this, since this could be depending on

the parameters and so forth, this might be positive or negative.

The idea would then be that, that often one uses

a logistic form where you look at the log of the

probability that there's a link between i and j compared to

the probability that there's not a link between i and j.

Right, so we take the odds ratios here.

What's the ratio of the odds that there are a link, compared to not.

And we just say that the log of that is proportional to this.

So this would be a standard way of fitting this

kind of model.

And what that's going to do is then give you

an ability to continuously estimate

5:30

probabilities based on different attributes.

And so, so now it's a more complicated stochastic model,

but one that still very easy to go ahead and estimate.

And you can use this in most any standard

statistics package or allow you to do logistic regressions.

And estimate this quite easily.

Now one possibility to

using this kind of model would be to test for homophily for instance.

So is there real difference between the probability of,

of certain types linking compared to linking across types?

So for instance, here is an example.

This is from the data set I talked about earlier in, in the course from

work that I did with Abijeet Bannerjee, Arun Chandrasekar, and Esther du Fleau.

6:13

So this is the 2013 Science paper and, and this is what I

did here is I pulled out one of the villages, village 26 in particular.

And here the nodes are, are color-coded by caste.

And in particular, the blue nodes are what are known as scheduled castes and

scheduled tribes, so these are castes that

are under affirmative action by the Indian government.

The reds are general castes and otherwise backward castes.

So these are not affirmative action castes.

So.

This is a differences in caste, and then we could look here,

and we could say okay, let's do the very simple block model.

We're just going to see what's the probability of linking with somebody of

your same caste compared to linking with somebody of a different caste.

So here we can go through and just directly count how

many links there are among members of the same caste and

going across castes, where caste here is

just this dichotomous version actually, the particular caste.

There is quite a number of castes in these villages.

So what do we end up with, if you

do the count here the probability of having a link.

Between a red and a blue node in this graph is 0.006, the probability

of having a link between either two reds or two blues is 0.089.

So more than ten times more likely to have a link within the same

caste as a caste designation in this case, as oppose to going across this boundary.

And so we see that it indeed there is a

substantial homophily based on this kind of simple block model.

And that block model then allows us to, to, keep track of these things.

And if we hadn't colored these nodes, it's

not so obviously that we would see that there

is a, a strong dichotomy here as once we

start keeping track of this and then estimating directly.

Now, we could have not had the pictures of the, of

the colors here, and we could have tried to discover that.

That's known as community detection or clustering algorithms.

So there's ways in which we could begin to look for things.

And we might begin to say, okay, look actually this,

even the blues look like they're segmented into different groups.

We could try and, and fit,

and see whether there's additional blocks, and whether there is more density in here

than over here and your algorithms and techniques people use for doing that.

So what's the important aspect that's missed from block models?

Well, the likelihood does depend on node attributes, either observed or, or latent.

But in practice. Often that the probability that that two

nodes have been interacting, so people would been interacting

could depend on whether they have friends in common.

So there is real social structure to it.

So let's you know, the fact that, that A and

B both have a common friend C makes it more

likely that they are going to be linked to each

other than if, if they didn't have a friend in common.

And so that could be an aspect, which independently of their

characteristics, people tend to meet people through other, through friends or

have reasons for spending time together in,

in triples and so forth, and that's going

to lead to extra things, so we're going to need richer models to capture that.

So next up we'll be talking about classes of models

that allow us to keep track of these explicit dependencies.

Which is going to make our life a little

more difficult statistically, because things aren't independent any longer,

but will allow us to capture a lot

of things in networks that are going to be important.