We talked about how the problem of learning a general Bayes net structure can be viewed as a combinatorial optimization problem in which we are trying to find a high scoring network structure and how that's problem, that problem is usually solved by some kind of heuristic search over the space of Bayes net structures. Let's talk now, about the computational cost of this algorithm and how we can use the property of decomposability, which we also previously used in the context of tree learning to considerably reduce the computational cost of this of this algorithm. So as a reminder, the heuristic search procedure that we discussed iteratively moves over the space of networks. And so if this is the network that we're currently at, we need to evaluate several different moves away from that network that [INAUDIBLE] usually local ways by adding, deleting, or reversing an arc. We then score each of these subsequent networks and see which of those has the highest score and take that move. And one can also have other algorithms that evaluate that, that don't necessarily make the greedy move at each point, but basic idea is the same. So what is the computational cost of this algorithm? Let's do the naive computational analysis. what we would get if we were to do the naive implementation of this. So, how many operators are there in each search step that we need to evaluate? so in a Bayesian network with n nodes, we have n(n - 1) different possible edges. [SOUND] One now, each of those edges is either present in the graph or not present in the graph. If the edge is present in the graph, we can either delete it, [SOUND] reverse it. And if it's absent, we can add it, which means that effectively, we have either two or one possibilities for each of those n(n - 1) possible edges. And so, we have O(n^2) possible operators that we need to evaluate at each point in the step, each, each step in the algorithm. What is the cost of evaluating each of the candidate successors that we would get by taking one of these operators? So, reminding ourselves that there are multiple components in the scores, one for each variable in the network, because of the decomposability property. So we have n different components, and for each of those, we have to look at the sufficient statistics and compute the resulting entry in the score corresponding to that variable. Computing sufficient statistics requires a traversal over the training data and so that's something that takes O(M) time where M is the number of training instances. So altogether, this step requires O(n * M). Now, we haven't talked about this, but one also needs to make sure that the resulting graph from this operator is in fact acyclic and so we need to do an acyclicity check. This is something that, in general, requires O of little m time, where m is the number of edges in the graph. So, altogether, if we sum up all of these different pieces of the cost, we end up with a computational cost which is O(n^2 * (Mn + m), whereby in large, little n is usually dwarfed by M, by big M times n and that is the cost per search step. Now if you think of a network, it's not even necessarily that large, something that has 50 or a 100 variables, so that n is 50 to a 100, and the number of training instances is 10,000, this can get really, really large and generally intractable to do in a lot of situations. So how can we improve on this? Let's see how we can exploit the decomposability property to get a significant computational savings in the search process. Let's first look at a single small operator such as the one where we add an edge between B and D to this network where where such an edge did not exist before. And, let's consider the score that we had for the original network, which in this case is, because of decomposability property, is the sum of family scores for the individual variables. So we have a component that lists the score of A relative to its empty set of parents, the score of B relative to the empty set, C relative to its parents A and B, and D relative to its parent C. Let's compare that to the score of the network following the move and we can see that this score for the new network is actually very similar to the score of the original network, because, for the same decomposability property we can break up the score into these little components and since most families haven't changed, that component in the score is going to be the same. Specifically, we're going to have the exact same score for A relative to its empty parent set, the same score for B, the same score for C, and only this only this last score for D is now going to be different, because of the fact that we've modified the family for D. But that immediately suggests that we don't really, really need to recompute these earlier components of the score, because they haven't changed. We only need to compute the last component corresponding to D. In fact, we can do better than that by, instead of considering the absolute score, instead, we're going to compute what's called the delta score which is the difference between the score of the network following the change, this network, and the score of the original network. And we're going to compute the difference between those two scores, which as we can see, is just the difference between the scores of the two families for D the B, C family in, in the following, in the new network versus the C family and the original network. And, that delta score is going to be something that we can compute just looking at a single family, ignoring the rest of the network. The same thing happens for other local operators. So for example, if we consider now the deletion operator that deletes an edge between B and C the, and we look at the delta score, that delta score only cares about the the family C, because that's the only family to have changed. And that's going to be, again, the difference between the score of C with the single with the single edge A minus the score of C with the family A, B. So again, only one family gets affected by this change. For an edge reversal, the situation's a little bit more complicated, because we can see that by flipping the edge from B to C and making it go from C to B, there's actually two families that are affected, the B family and the C family. But that's still just two as opposed to the entire network and so we can see that that delta score is going to have two components, one is the delta score for the C family and the second is the delta score for the B family. But in either case, we only end up affecting either one family in the case of edge addition or in that case, case of edge deletion or at most two families in the case of edge reversal. And so, that means that we only have to consider a much smaller fraction of the network when computing the delta square. A second use of the decomposability property comes up when we consider multiple consecutive steps of the search algorithm. So let's imagine that in the previous step, we decided to to take that, step that, added the edge, from B to C, and now we have to consider the next step in the search. Well, what are the operators that are conceivable in this in this next step of the search? For example, one such operator is to delete the edge from B to C [SOUND] this one. [SOUND] And notice that that edge deletion operation is in fact an operator that we also considered in the previous step of the search before we decided to add the edge from B to D. Now, normally is this operator the same, notice that the family of C hasn't changed between those between those two cases, in both of these cases when we're considering the move C has the parents of A and B. And so, the delta score for that particular move is not going to change either, specifically, if in this case, the delta score was this score of C minus, given family A minus the score of C given the family A, B. We have exactly that same score, the same delta score in in the previous step. That is these two delta scores in the previous step and the new step are exactly the same and so there's really no point to recompute it. [SOUND] So which scores do we need to recompute? The only scores that we need to recompute are the ones that were affected by the step that we currently, that we just took. So specifically, if we took a step that modified the family of D, then any step that involves an additional change to the family of D will have a different delta score, because the family is now different in doing the comparison. However, families that were not affected by the move, don't need to be recomputed. So to summarize, we only need to rescore delta moves, delta scores for families that changed in the last move. [SOUND] So let's summarize the computational cost of this procedure. What's the cost per move? Having decided to take a move, we have only one or two families that were affected by that move, that means that only O(n) delta scores need to be computed, because for a given family, there is only n possible edges that that impinge on that family. So only O(n) delta scores need to be computed, and for each of those, we need to compute sufficient statistics which takes O(M) time. So altogether, we have O(n * M) as the cost of doing this step, which is actually two orders of magnitude better than the old n^3 * M that we had for the original naive computation. Now this tells us the cost after we pick a move. What about the cost of picking a move? Now, naively, we might say that there is n^22 possible operators that we can consider any in given move, and so, we need to evaluate each of them, and pick the best one or consider each of them, and pick the best one. But, in fact, we can do considerably better by the use of clever data structures. So specifically, we can maintain a priority queue of operators sorted by their delta scores. Now, when we rescore those O(n) operators, in this step over here, we need to modify the score of those operators and reinsert them into the priority in their appropriate place. But that's a computation that requires O(n log n) time because there's only n of M. And, once we have done that, the best operator will be at the top of the list which we can then take identify and take in constant time. And so, this priority queue saves us complexity by taking us from O(n^2) time for picking this for traversing the set of operators to something that requires O(n log n). And so, altogether, we've actually saved a considerable cost in both of these steps. [SOUND] It turns out that one can get an even higher level of computational efficiency based on a different observation. So, it's a fact that in most network learning algorithms, the plausible families, the ones that have some chance of being selected are variations on a theme. That is, for a given variable A, there might be some number of variables you know, B1, B2, up to Bk for a very small k that are reasonable parents to be selected as parents for A. And so, how do we exploit that property in computational, to get computational savings? Turns out there's two different ways that one can utilize this. The first is the fact that because we have the same set of variables being constantly considered as being candidate families, it means that we can use sufficient statistics that we computed in one step and reuse them in a different step if we cash them, because, because we're likely to encounter the same family more than once. We might encounter B as a parent of A and A as a par, as a possible parent for B, and for both of these, we have the sufficient statistics A, B that are going to be utilized for both of them. And, so, if we compute this once and then cache it, these sufficient statistics, we don't have to recompute them again. That turns out to make a huge difference in terms of the computational cost of this algorithm. The second way in which this could be used is that if we can identify in advance the set of B1 up to Bk that are reasonable parents to consider for A, we can restrict in advance that set and not consider other parents at all, which reduces our complexity from O(n)n) to O(k),k), where k is some bound on the number of plausible parents that we're willing to consider. Now this, now is the heuristic in the sense that this is a restriction that can actually change the outcome of our algorithm. It's not just a way of reducing the cost, but it turns out to be a very good approximation in practice. To summarize it turns out that even the fairly simple heuristic structure search that we employ in in, in such as greedy hill climbing can get expensive when n is large, because the naive implementation has a cubic dependence on n. But we can exploit the decomposability property, that we also exploited in the context of tree learning, to get several orders of magnitude reduction in the cost of the search. We can also exploit the recurrence of plausible families at multiple steps in the search algorithm to get further computational savings and also restrict the search space to to get better better computational property.