when X1 = 0 and X2 = 1, and so on an so forth.
And, it's not difficult to convince ourselves that if we define wkl to be the
negative log of the entries, of the corresponding entries in this table that,
then that gives us right back the factor that we defined to begin with.
So this is so this shows that this is a general representation.
In the sense that we can take any factor and represent it as a log-linear model
simply by including all of the appropriate all of the appropriate
features. So let's, but, but we don't generally
want to do that generally, we want a much finer grain set of features, so let's
look at some of the examples of features that people use in practice.
So here are the features used in a language model, this is a language model
that we had discussed that we've discussed previousl.
And, here, we have features that relate we have,
first of all, let's just reminder ourselves, we have two sets of variables.
we have, the variables Y, which represent the anotations for each word in a
sequence corresponding to what what, to what category that corresponds to. So
this is a person this is the beginning of the person name, this is the continuation
of a person name, the beginning of a location, the continuation of the
location, and so on, as well as as well as a bunch of words that are not, none of
person location, organization and they're all labeled other.
And so, the value Y tells us for each word what category it belongs to, so that
we're trying to identify people locations and organizations in a sentence.
We have another set of variables, X, which are the actual words in in the
sentence. Now we can go ahead and define, we can,
we can use a full table representation that basically tries to relate each and
every Y to that, that has a, a feature, that has a full factor that looks at
every possible word in the English language, but those are going to be very,
very expensive and a very large number of parameters.
And so we're going to define the feature that looks, for example, at F of say, a
particular Yi, which is of the label for the ith word in the sentence and Xi being
that ith word. And that feature says, for example,
YiI) equals person, it's the indicator function for Yi equals
person and Xi is capitalized. And so, that feature doesn't look at the
individual world, words, it just looks at whether that feature is, that word is
capitalized. Now we have just a single parameter that looks just at
capitalization and parameterizes how important is capitalization for
recognizing that something is a person. We could also have another feature.
This is an alternative feat, this is a different feature, it can, and they're
all going, it could be part of the same model, that says, Yi is equal to location
or actually I, I was a little bit imprecise here,
this might be beginning of person, this might be beginning of location and Xi
appears in some Atlas. Now, there are other things that appear
in the Atlas other than locations but if a word appears in the Atlas, there's a
much higher probability presume that it's actually a location.
And so and so we might have, again, a weight for this feature that indicates
it's that, that, maybe increases the probability in, in Yi being labeled in
this way. And so you can imagine that constructing
a very rich set of features, all of which look at certain aspects of the word and
rather than enumerating all of the possible word, words and, and giving a
parameter to each and every one of them. Let's look at some other examples of
feature-based model. So, this is an example from statistical
physics, it's called the Ising model.
And the Ising model is something that looks at pairs of variables,
so it's a pair-wise Markov network. And looks at pairs of adjacent variables
and basically gives us a coefficient for their product.
So now, this is a case where variables are in the in, are, are binary but not in
the space 0, 1 but rather -1 of, and +1. And so, now we have a model that's
parameterized as features that are just the product of the values of the adjacent
variables. Where might this come up?
It comes up in the context, for example, of modeling the spins of of, of, of
electrons in a grid. So here, you have a case where the
electrons can rotate either along one direction or in the other direction.
So here is a bunch of the bluer, the, the, the the atoms that are marked with a
blue arrow, you have one rotational axis, and the red arrow are rotating in the
opposite direction. And, you, and this basically says, we
have a term that depend, who, probability distribution over the joints set of
spins, so this is the joint, the joint spins in
the model, depends on whether adjacent atoms have the same spin or opposite
spin. So, notice that 1 * 1 is the same as -1 *
-1, so this really just looks at whether they have the same spin or different
spins and there is a parameter that looks at, you know, same or different.
That's what this feature represents and depending on the value of this parameter
over here, if the parameter goes one way, we're going to favor systems that where
the atoms spin in the same direction. And if it's going in the opposite
direction, you're going to favor atoms that spin in a different direction and
those are called ferromagnetic and antiferromagnetic.
Furthermore, you can define in these systems the notion of a temperature.
So the temperature here says how strong is this connection?
So notice that as T grows, as the temperature grows, the Wij's get divided
by T and they all kind of go towards 0, which means that the strength of the
connection between adjacent atoms effectively becomes almost smooth and
they become almost decoupled from each other.
On the other hand, as the temperature decreases,