I should say that this is a pseudorealistic learning problem, because
the instances that one samples from a network are, are always cleaner than the
instances that one gets in the context of a real world data case, data set, because
it, in a real world scenario, it is rarely the case that the network whose
structure you have the network whose that you're trying to learn has the exact same
structure as the true underlying distribution from which the data were
generated. And so this is a much cleaner scenario,
but still it's useful and indicative. So what we see here, are, the results of
learning, as a function of the x-axis, which is the number of samples and the
y-axis is a distance function between the true distribution and the learn
distribution, and that distance function we're not going to talk about this at the
moment, it's the notion called the relative entropy, it's also called KL
divergence. But what we need to know about this for
the purposes of the current discussion is that when distributions are identical
it's zero, and otherwise it's non-negative.
So, what we see here is that the blue line, corresponds to maximum life data
information. And we can see several things about the
poline. First of all it's very jagged, there's a
lot of bumps in it, and second, it's consistently higher then all of the other
lines. Which means that max and likelihood
estimation, although it does continue to get lower as we get more data, with as
high as five thousand data points, we still haven't gotten close, to the true
underlying distribution. Conversely let's see what happens with a
Bayesian estimation. This is all Bayesian estimation with a
uniform prior. And different, equivalent sample size.
So this is using a prior network with a uniform network in different values of
alpha. And what we see here is that, for alpha
equals five. That's the green line.
Alph equals ten are almost sitting directly on top of each other and they're
both considerably lower then all of the other lines and also the maximum likely
destination. As we increase the prior strength so that
we are have a firmer belief in in. A uniform prior.
We can see that we move a little bit away.
and now the performance becomes a little worse.
But notice that by around 2,000 data points we're already pretty close to the
case that we were for an equivalent sample size of five.
For 50, which is this dark blue line. It takes a little bit longer to converge,
and it doesn't quite make it, But even with an equivalent sample size of 50,
which is pretty high. you get convergence to the correct
distribution much faster than you do from maximum likelihood destination.
So, to summarize. in Bayesian networks, if we're doing
Bayesian parameter estimation. If we're willing to stipulate that the
parameters are independent, a priori. Then they're also independent in the
posterior. Which allows us to maintain the posterior
as a product of posteriors over individual parameters.
For multinomial Bayesian networks, we can go ahead and, con-, perform Bayesian
estimation using the exact same sufficient statistics that we used for
maximum likelihood destination. Which are the counts corresponding to a
value of the variable and a value of its parents.
And whereas, in the context of maximum likelihood estimation, we would simply
use the formula on the left. In the case of Bayesian estimation, we
are going to use the formula on the right.
Which has exactly the same form. Only, it also accounts for the
hyperparameters. And, in order to do this kind of process,
we need a choice of prior, and we show how that can be effectively elisteted
using both a prior distribution specified say, as Bazy network as well as an
equivalent sample size.