So the goal of this video is to prove the correctness of Kasaraju two-pass,
depth-first-search based, linear time algorithm that computes the strongly
connected components of a directed graph. So I've given you the full specification
of the algorithm. I've also given you a plausibility argument of why it might
work, in that at least it does something sensible on an example. Namely, it first
does a pass of depth first search on the reverse graph. It computes this magical
ordering. And what's so special about this ordering is then when we do a depth first
search using this ordering on the forward graph; it seems to do exactly what we
want. Every indication of depth first search to some new node discovers exactly
the nodes of the strong component and no extra stuff. Remember that was our first
observation, but that was unclear whether depth for search would be useful or not
for computing strong components. If you call depth first search from just the
right place, you're gonna get exactly the nodes of an SCC and nothing more. If you
call it from the wrong place, you might get all of the nodes of the graph, and get
no information at all about the structure of the strong components and at least in
this example this first pass with the finishing time seems to be accomplishing
seems to be leading us to invoking DFS from exactly the right
places. So remember how this worked in the example so in the top graph, I have
shown you the graph with the arch reversed. This is where we first invoked
DFS loop with the loop over the nodes going from the highest node name nine all
the way down to the node name one. And here we compute finishing time that's the
bookkeeping that we do in the first pass so we just keep the running count of how
many nodes we've finished processing. That is how many we've both explored that node
as well as explore all of the outgoing arches and so that gives us these numbers
in the red, these finishing times between one and nine for the various
nodes. Those became the new node names in the second graph and then we reverse the
arches again and get the original graphs back and then we saw that every time we
invoked DFS in our second pass we uncovered exactly the nodes of an SCC. So
when we invoked it from the node 9 we discovered that 9, 8 and 7 those
have a leader vortex 9. Then when we next invoked DFS from 6, we discovered
6, 5, and 1, and nothing else. And then finally we invoked it from 4, and
we discovered 2, 3, and 4, and nothing else. And those are exactly the
three, SCCs of this graph. So let's now understand why this works in any directed
graph, not just this, in this one example. So let's begin with a simple observation
about directed graphs, which is actually interesting in its own right. The claim is
that every directed graph has two levels of granularity. If you squint, if you sort
of zoom out, then what you see is a directed acyclic graph, of course
comprising its strongly connective components. And if you want, you can zoom
in and focus on the fine grain structure with one SCC. A little bit more precisely.
The claim is of the strongly connected components of a directed graph induce in
a natural way an acyclic metagraph. So what is this metagraph? What are the nodes
and what are the arcs, what are the edges? Well, the metanodes are just the SCCs, so
we think of every strong connected component as being a single node in this
metagraph. So call them say 'C1' up to 'Ck'. So what are the arcs in this metagraph?
Well, they're basically just the ones corresponding to the arcs between SCCs in
the original graph. That is, we include in the meta graph an arc from the strong
component 'C' to 'C-hat' if and only if there's an arc from a node
in 'C' to a node in 'C-hat' in the original graph 'G'. So for example if this is your 'C'
and so the triangle is your 'C-hat' and you have one or maybe multiple edges going
from 'C' to 'C-hat', then in the corresponding metagraph your just gonna have a node for
'C', a node for 'C-hat' and the directed arch from 'C' to 'C-hat'. So if we go back to
some of the directed graphs that we've used as running examples so we go back to
the beginning of the previous video and it's look maybe something
like this the corresponding directed acyclic graph has four nodes and four
arches. And for the running example we used to illustrate Kosaraju's algorithm with
the three triangles, the corresponding metagraph would just be a path with three
nodes. So why is this meta-graph guaranteed to be acyclic? Well, remember
metanodes correspond to strong components, and in a strong component you
can get from anywhere to anywhere else. So, if you had a cycle that involved two
different metanodes, that is two different strong connected components,
remember on a directed cycle you can also get from anywhere to anywhere else. So if
you had two supposedly distinct SCCs, that you could get from the one to the
other and vice versa, they would collapse into a single SCC. You can get from
anywhere to anywhere in one, anywhere from anywhere in the other one, and you can
also go between them at will, so you can get from
anywhere in this union to anywhere in the union. So not just in this context of
competing strong components but also just more generally, this is a useful fact to
know about directed graphs. On the one hand, they can have very complex structure
within a strong components. You have paths going from everywhere to everywhere else,
and it may be sort of complicated looking. But at a higher level, if you abstract out
to the level of SCCs, you are guaranteed to have this simple DAG, this simple
directed acyclic graph structure. So, to reinforce these concepts, and also segue
into thinking about Kosaraju's algorithm in particular, let me ask you a question
about how reversing arcs affects the strong components of a directed graph. So
the correct answer to this quiz is the fourth one. The strong components are
exactly the same as they were before, in fact the relation that we described is
exactly the same as it was before so therefore the equivalence classes of the
strong components is exactly the same. So if two nodes were related in
the original graph, that is a path from U to V and a path from V to U, that's still
true after you reverse all the arcs, you just use the reversal of the two paths
that you had before. Similarly if the two nodes weren't related before, for example
because you could not get from U to V, well that after you reverse everything,
then you can't get from V to U, so again you don't have this relation holding, so
the SCCs are exactly the same in the forward or the backward graph. So in
particular in Kazarogi's algorithm, the strong component structure is exactly the
same in the first pass of DFS and in the second pass of DFS. So now that we understand
how every directed graph has a meta graph with the nodes correspond to a strong
connected components, and you have an arch from one SCC to another if there's any
arch from any node in that SCC to the other SCC in the original graph, I'm in a
position to state what's the key lemma. That drives the correctness of Kosaraju's
two pass algorithm for computing the strong connected component of a directed graph.
So here's the lemma statement. It considers two strongly connecting
components that are adjacent, in the sense that there's an arc from one node in one
of them to one node in the other one. So let's say we have one SCC - 'C1', with a
node I, and another, SCC 'C2' with a node J, and that in G, in the graph, there's an
arc directly from I to J. So in this sense we say that these SCCs are adjacent, with
the second one being in some sense after the first one. Now let's suppose we've
already run the first pass of the DFS loop subroutine. And remember that works on the
reverse graph. So we've invoked it on the reverse graph. We've computed these
finishing times. As usual we'll let f(v) denote the finishing times computed in
that depth first search subroutine on the reverse graph. The lemma then asserts the
following. It says first, amongst all the nodes in 'C1' look at the one with the
largest finishing time. Similarly amongst all nodes in 'C2' look at the one with the
biggest finishing time. Amongst all of these the claim is that the biggest
finishing time will be in 'C2' not in 'C1'. So what I wanna do next is I wanna assume that
this lemma is true temporarily. I wanna explore the consequences of that
assumption and in particular what I wanna show you is that if this lemma holds, then
we can complete the proof of correctness of Kosaraju's two-pass SCC computation
algorithm. Okay, so if the lemma is true then after... I'll give you the argument
about why we're done. About why we just peel off the SCC one at
a time with the second pass of depth first search. Now of course a proof with a hole
in it, isn't a proof. So at the end of the lecture I'm gonna fill in the hole. That
is, I'm gonna supply a proof of this key lemma. But for now, as a working
hypothesis, let's assume that it's true. Let's begin with a corollary, that is a
statement which follows essentially immediately, from the statement of a lema.
So for the corollary, let's forget about just trying to find the maximum
maximum finishing time in a single SCC. Let's think about the maximum finishing
time in the entire graph. Now, why do we care about the maximum finishing time in
the entire graph? Well, notice that's exactly where the second pass of DFS
is going to begin. Right, so it processes nodes in order from largest
finishing time to smallest finishing time. So equivalently, let's think about the
node at which the second pass of depth first search is going to begin, i.e., the
node with the maximum finishing time. Where could it be? Well, the corollary is
that it has to be in what I'm gonna call a sink, a strongly connected component, that
is a strongly connected component without any outgoing arcs. So for example
let's go back to the, meta graph of SCCs for the very first directed graph we
looked at. You recall in the very first direc ted graph we looked at in when we
started talking about this algorithm there were four SCCs. So there was a 'C1', a 'C2',
a 'C3', and a 'C4'. And of course within each of these components, there could be
multiple nodes but they are all strongly connected to each other. Now, let's use
F1, F2, F3 and F4 to denote the maximum finishing time in each of these SCCs. So
we have F1, F2, F3 and F4. So, now we have four different opportunities to apply
this lemma. Right? Those four different pairs of adjacent SCCs. And so, what do
we find? We find that well, comparing F1 and F2, because C2 comes after C1,
that is there's an arc from C1 to C2, the max finishing time in C2 has
to be bigger than that in C1. That is F2 is bigger than F1. For the same
reasoning F3 has to be bigger than F1. Symmetrically we can apply the limit to
the pair C2, C4 and C3, C4 and we get that F4 has to dominate both of them. Now
notice we actually have no idea whether F2 or F3 is bigger. So that pair we can't
resolver. But we do know these relationships. Okay F1 is the smallest and
F4 is the smallest [biggest!!]. And you also notice that C4 is a sink SCC and the sink has no
outgoing arches and you think about it that's a totally general consequence of
this lema. So in a simple group of contradiction will go as follows. Consider
this SCC with the maximum F value. Suppose it was not a
sink SCC that it has an outgoing arch, follow that outgoing arch to get some
other SCC by the lema the SCC you've got into has even bigger maximum finishing
time. So that contradicts the fact that you started in the SCC with a maximum
finishing time. Okay. So just like in this cartoon, where the unique sink SCC has
to have the largest finishing time, that's totally general. As another sanity check,
we might return to the nine node graph, where we actually ran Kasaraja's
algorithm and looking at the ford version of the graph, which is the one on
the bottom, we see that the maximum finishing times in the three FCC are 4,6
and 9. And it turns out that the same as the leader nodes which is not an
accident if you think about it for a little while and again you'll observe the
maximum finishing time in this graph namely 9 is indeed in the left most SCC
which is the only SCC with no outgoing arks. Okay but it's totally general
basically you can keep following arks and you keep seeing bigger and bigger
finishing times so the biggest one of all it has to be somewhere where you get stuck
where you can't go forward but there's no outgoing arks and that's what I'm calling
a sink SCC. Okay. So assuming the lemma is true we know that the corollary
is true. Now using this corollary let's finish the proof of correctness, of
Kasaraja's algorithm, module over proof of the key lima. So I'm not going to do
this super rigorously, although everything I say is correct and can be made, made
rigorous. And if you want a more rigorous version I'll post some notes on the course
website which you can consult for more details. So what the previous corollary
accomplished, it allows us to locate, the node with maximum finishing time. We can
locate it in somewhere in some sink SCC. Let me remind you about the
discussion we had at the very beginning of talking about computing strong components.
We're tryna understand depth-first search would be a useful workhorse for finding
the strong components. And the key observation was that it depends, where you
begin that depth-first search. So for example in this, graph with four SCC's
shown in blue on the right. A really bad place to start. DFS called Depth For
Search would be somewhere in C1. Somewhere in this source SCC, so this is a bad DFS.
Why is it bad? Well remember what Depth For Search does; it finds everything
findable from its starting point. And from C1 you can get to the entire world, you
can get to all the nodes in the entire graph. So you can discover everything. And
this is totally useless because we wanted to discover much more fine-grain
structure. We wanted to discover C1, C2, C3 and C4 individually. So that would be
an disaster if we invoked depth first search somewhere from C1. Fortunately
that's not what's going to happen, right? We computed this magical ordering in the
first pass to insure that we look at the node with the maximum finishing time and,
by the corollary, the maximum finishing time is going to be somewhere in C4.
That's gonna be a good DFS, in the sense that, when we start exploring from
anywhere in C4, there's no outgoing arcs. So, of course, we're gonna find everything
in C4. Everything in C4's strongly connected to each other. But we can't get
out. We will not have the option of trespassing on other strong components,
and we're not gonna find'em. So we're only gonna find C4, nothing more. Now, here's
where I'm gonna be a little informal. Although, again, everything I'm gonna say
is gonna be correct. So what happens now, once we've discovered everything in C4?
Well, all the nodes in C4 get marked as explored, as we're doing depth first
search. And then they're basically dead to us, right? The rest of our depth first
search loop will never explore them again. They're already marked as explored. If we
ever see'em, we don't even go there. So the way to think about that is when we
proceed with the rest of our for loop in DFS loop it's as if we're starting
afresh. We're doing depth first search from scratch on a smaller graph, on the
residual graph. The graph G with this newly discovered strong component 'C<i>'</i>
deleted. So in this example on the right, all of the nodes in C4 are dead to us and
it's as if we run DFS anew, just on the graph containing the strong components C1,
C2 and C3. So in particular, where is the next indication of depth first search
going to come from? It's going to come from some sink SCC in the residual graph,
right? It's going to start at the node that remains and that has the largest
finishing time left. So there's some ambiguity in this picture. Again recall we
don't know whether F2 is bigger or F3 is bigger. It could be either one. Maybe F2
is the largest remaining finishing time in which case the next DFS indication's gonna
begin somewhere more from C2. Again, the only things outgoing from C2 are these
already explored nodes. Their effectively deleted. We're not gonna go there again.
So this is essentially a sink FCC. We discover, we newly discover the nodes in
C2 and nothing else. Those are now effectively deleted. Now, the next
indication of DFS will come from somewhere in F3, somewhere in C3. That's the only
remaining sink SCC in the residual graph. So the third call, the DFS will discover
this stuff. And now, of course, we're left only with C1. And so the final indication
of DFS will emerge from and discover the nodes in C1. And in this sense because
we've ordered the nodes by finishing times when DFS was reverse graph, that ordering
has this incredible property that when you process the nodes in the second pass we'll
just peel off the strongly connected components one at a time. If you think
about it, it's in reverse topological order with respect to the directed asypric
graph of the strongly connected components. So we've constructed a proof
of correctness of Kosaraju's, algorithm for computing strongly connected
components. But again, there's a hole in it. So we completed the argument assuming
a statement that we haven't proved. So let's fill in that last gap in the proof,
and we'll we done. So what we need to do is prove the key lemma. Let me remind you
what it says. It says if you have two adjacent SCCs, C1 and C2 and is an arc from a
node in C1, call it 'I' to a node in C2, say J. Then the max finishing time in C2 is
bigger than the max finishing time in C1. Where, as always, these finishing
times are computed in that first pass of depth-first search loop in the reversed
graph. All right, now the finishing times are computed in the reversed graph, so
let's actually reverse all the arcs and reason about what's happening there. We
still have C1. It still contains the node I. We still have C2, which still
contains the node J. But now of course the orientation of the arc has reversed. So
the arc now points from J to I. Recall we had a quiz which said, asked you to
understand the effect of reversing all arcs on the SCC's and in particular there
is no effect. So the SCC's in the reverse graph are exactly the same as in the
forward graph. So now we're going to have two cases in this proof and the cases
correspond to where we first encounter a node of C1 and union C2. Now remember,
when we do this DFS loop, this second pass, because we have this outer four loop
that iterates over all of the nodes we're guaranteed to explore every single node of
the graph at some point. So in particular we're gonna have to explore at some point
every node in C1 and C2. What I want you to do is pause the algorithm. When it
first, for the first time, explores some node that's in either C1 or C2. There's
going to be two cases, of course, because that node might be in C1, you might see
that first. Or it might be in C2, you might see something from C2 first. So our
case one is going to be when the first node that we see from either one happens
to lie in C1. And the second case is where the first node V that we see happens to
lie in C2. So clearly exactly one of these will occur. So let's think about case one.
When we see a node of C1 before we see any nodes of C2. So in this case where we
encounter a node in C1 before we encounter any node in C2, the claim is that we're
going to explore everything in C1 before we ever see anything in C2. Why is that
true? The reason is there cannot be a path that starts somewhere in C1, for example,
like the vertex V, and reaches C2. This is where we are using the fact that the
meta-graph on SCC is a cyclic. Right C1 is strong
connected, C2 is strong connected, you can get from C2 to C1 and, if you can also get
from C1 back to C2 this all collapses into a single strongly connected component. But
that would be a contraction, we're assuming C1 and C2 are distinct strongly
connected components, therefore you can't have paths in both directions. We already
have a path from right to left, via JI, so there's no path from left to right. That's
why if you originate a depth first search from somewhere inside C1 like this vertex
V, you would finish exploring all of C1 before you ever are going to see C2,
you're. Only gonna see C2 at some later point in the outer for loop. So, what's
the consequence that you completely finish with C1 before you ever see C2? Well it
means every single finishing time in C1 is going to be smaller than every single
finishing time in C2. So that's even stronger that what we're claiming, we're
just claiming that the biggest thing in C2 is bigger than the biggest of C1. But
actually finishing times in C2 totally dominate those in C1, because you finish
C1 before you ever see C2. So let's now have a look at case one actually in
action. Let's return to the nine-node graph, the one that we actually ran
Kosaraju's algorithm to completion. So if we go back to this graph which has the
three connected components, then remember that the bottom version is the forward
version, the top version is the reversed version. So if, if you think about the
middle SCC as being c1, pulling the row of c1 and the left most. Scc playing the role
of C2, then what we have exactly is case one of the key lemma. So, which was
the first of these six vertices visited during the DFS loop in the reversed graph?
Well that would just be the node with the highest name, so the node nine. So this
was the first of these six vertices that depth first search looked at in the first
pass, that lies in what we're calling C1. And indeed everything in C1 was discovered
in that pass before anything in C2 and that's why all of the finishing times in
C2, the 7,8,9 are bigger than all of the finishing times in C1 -
the 1,5, and 6. So we're good to go in case two. We've proven sorry, in case one,
we've proven the lemma. When it's the case that, amongst the vertices in C1
and C2, depth first search in the first pass sees something from C1 first. So
now, let's look at this other case, this grey case, which could also happen,
totally possible. Well, the first thing we see when depth first searching in the
first pass, is something from C2. And here now is where we truly use the fact that
we're using a depth first search rather than some other graph search algorithm
like breadth first search. There's a lot of places in this algorithm you could swap
in breadth first search but in this case two, you'll see why it's important we're
using depth first search to compute the finishing times. And what's the key point?
The key point is that, when we invoke depth first search beginning from this
node V, which is now assuming the line C2. Remember depth first search will not
complete. We won't be done with V until we've found everything there is to find
from it, right? So we recursively explore all of the outgoing arcs. They recursively
explore all of the outgoing arcs, and so on. It's only when all paths going out of
V have been totally explored and exhausted, that we finally backtrack all
the way to V, and we consider ourselves done with it. That is, depth first search.
In the reverse graph initiated at v. Won't finish until everything findable has been
completely explored. Because there's an arc from C2 to C1, obviously everything to
C2 is findable from V, that's strongly connected. We can from C2 to C1 just using
this arc from J to I. C1 being strongly connected we can then find all of that.
Maybe we can find other strongly connected components, as well, but for sure
depth-first search starting from V will find everything in C1 to C2, maybe some
other things. And we won't finish with V until we finish with everything else,
that's the depth-first search property. For that reason the finishing time of this
vertex V will be the largest of anything reachable from it. So in particular it'll
be larger than everything in C two but more to the point, it'll be larger than
everything in C1 which is what we are trying to prove. Again let's just see
this quickly in action in the nine node network on which we traced through
Kosaraju's algorithm. So to show the rule that case two is playing in this concrete
example let's think of the right most strongly connected component as being C1.
And let's think of the middle SCC as being C.2. Now
the last time. We called the middle one C1 and the leftmost one C2. Now we're calling
the rightmost one C1 and the middle one C2. So again, we have to ask the question,
you know, of the six nodes in C1 and in C2, what is the first one encountered in
the depth first search that we do in the first pass. And then again, is the node
nine? The, the node which is originally labeled not. So it's the same node that
was relevant in the previous case, but now with this relabeling of the components,
nine appears in the strongly connected component C-2, not in the one labeled C-1.
So that's the reason now we're in case two, not in case one. And what you'll see
is, what is the finishing time that this originally labeled nine node gets. It gets
the finishing time six. And you'll notice six is bigger than any of the other
finishing times of any of the other nodes in C1 or C2. All, the other five nodes
have the finishing times one through five. And that's exactly because when we ran
depth first search in the first pass, and we started it at the node originally
labeled nine, it discovered these other five nodes and finished exploring them
first before finally back tracking all the way back to nine, and deeming nine fully
explored. And it was only at that point that nine got its finishing time after
everything reachable from it had gotten there lower finishing times. So that
wraps it up. We had two cases depending on whether in these two adjacent SCC's, the
first vertex encountered was in the C1, or in C2. Either way it doesn't matter, the
largest finishing time has to be in C2. Sometimes it's bigger than everything,
sometimes it's just bigger than the biggest in C-1, but it's all the same to
us. And to re cap how the rest of the proof goes, we have a corollary based on
this lemma, which is maximum finishing time have to lie in sink SCC
And that's exactly where we want our depth first search to
initiate. If you're initiated in a strong component with no outgoing arcs, you do
DFS. The stuff you find is just the stuff and that strongly connected component. You
do not have any avenues by which to trespass on other strong components. So
you find exactly one SCC. In effect, you can peel that off and recurse on the rest
of the graph. And our slick way of implementing this recursion is to just do
the single second, DFS pass, where you just treat the nodes in decreasing order
of finishing times, that in effect, unveil, unveils all of the SCCs in
reverse topological ordering. So that's it, Kosaraju's algorithm, and the complete
proof of correctness. A blazingly fast graph primitive that, in any directed
graph, will tell you its strong components.