Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

In this lecture, we will learn several approximate algorithms and several

filtering approaches to quickly

identify this edit distance or value similar to edit distance,

the values approximating edit distance,

or to filter at least,

even if it cannot compute that efficiently,

to try to see very quickly

to say whether two strings can be similar to each other or not.

So we'll try to do that quickly so that we don't pay this order M times N,

if it is computation cost each and every time we have to

compute the similarity between two strings.

So the first algorithm that we will learn is called cross-parsing distance.

So, once again, this is going to be approximate the edit distance.

So it's going to be more efficient to compute,

it's going to be less accurate than edit distance but it's

going to give us an approximation of the edit distance.

So, let's go ahead and define what we mean by cross-parsing.

So, in this case, we are given two strings once again;

there is string P, and there is string Q.

What we are trying to do is we are trying to basically parse the string P,

we are trying to move in string P. As we move in string P,

as we pass string P,

we try to find matches on this relative to string Q,

and we try to do this efficiently.

If we say that if you can't basically parse this string P,

with finding matches with respect to Q, as we move along,

if we find large matches of string Q,

we see that string P is similar to string Q.

So how does this work? We'll take a look at the details of this.

How can we do this efficiently,

and how common is it if we compute this cross parsing?

This will provide the more formal definition.

So let's erase the board.

For this, let's basically consider an example.

Let's assume that our string P,

okay is babcad and say D,

and let's assume that our second string Q is abc and adc.

Note that these two strings are somewhat similar to each other,

but they are not identical.

There's obviously to convert to P to Q,

or to convert Q to P,

we need to do some edit operations.

But we don't want to computate the distance

precisely because we know that any distance computation is expensive.

So what we will do is we will first parse P relative to Q,

and we'll call this cpq.

Then we'll parse string Q relative to string P,

we will this cqp.

Then we'll say that the cost of cross-parsing strings

P and Q relative to each other is essentially the average of these two costs.

That will essentially be how we will approximate the edit distance cost between P and Q.

Of course we still didn't define what cross-passing means.

I will basically do that with an example.

So let's again take a look at our strings.

The cross-parsing of string P relative to Q is essentially

defined as follows: so we start at the beginning of the string P,

and we find the longest possible prefix of P,

this can be possibly MT,

longest possible prefix of P that appears anywhere in Q.

We are finding a prefix of P that appears anywhere in Q. Let's do that.

So, in this case,

we have the string babcadd,

so the first prefix of the first string is b,

and indeed b appears here, that's great.

What about prefix ba? Does it appear anywhere in the second string? No, it does not.

So the longest prefix that we have found so far that appears

anywhere in the string Q is b, b alone.

So what we do, is that we have so far parsed up to position b,

and we restart the parsing from the position b,

and we say cPQ has a one cost so far.

Okay. Now our string is abcadd.

Does a prefix of this occur anywhere in string Q? Let's take a look at that.

Well, if you look at this you will see that abcad,

a prefix of the string,

or Q actually also is a prefix of our string Q here,

but it doesn't matter if it's a prefix or not it appears somewhere in the string Q.

This is great. We have found a long chunk

of string P that matches a long chunk of string Q.

So, what are we going to do?

We will say that, okay, we have found a match up to this point.

So this is going to be our new starting position,

and we have edit one more cost to the cross-parsing of P relative to Q.

Okay. Now, we have d as our new string.

Does d occur anywhere in string Q?

Well, yes it will occurs here.

So, we can cross-parse also d with respect to Q,

and we've found it, and our string ended at that point.

So this gives one more cost.

So cPQ is such as a cross-parsing of P relative Q essentially has trickles.

I had to start the cross-parsing process three times.

Note that my string was length 1 2 3 4 5 6 7,

but I could cross-parse my P only with three restarts.

The reason for that was that a big chunk of P was occurring in Q.

Okay, what about, can we cross-parse Q relative to P?

Let's see that. Can we cross-parse Q relative to P?

Well, in this case let's,

let me just erase this.

Let me just clean this a bit.

Okay. Now I'm starting from scratch,

and I'm cross-parsing Q relative to P. Let's see.

The string here in this case: abcad indeed occurs here in

my string P. So I already found big chunk of prefix of Q already in P. So,

it means that basically at this point I need to come to this location,

and I can say that basically c,

Q and P is one.

But I need to still find the rest of the string Q in P to see whether it exists.

In this case, the remaining string is c,

c indeed exist somewhere in P,

so I cross-parse also c in my P and I'm done.

So, I basically added one more.

So with only two restarts,

restart one restart two,

with only two starting positions,

I parsed my entire string Q relative to P,

so the cost of this is two.

Given this, essentially, we can basically compute the cross-parsing distance P and Q,

essential the average of cross-parsing distances of

P versus Q and Q versus P divide by two,

sum of the two distances divided by two which is the average cross-parsing distance.

In our example, what we'll say is that, well,

three minus one this distance will be two, two minus one,

this basically distance will be one,

and their average is three by two, 1.5.

So I would say that basically these two strings,

the cross-parsing distance is 1.5. Why does this work?

Well, this works because of the following reason.

Let's assume that we have two very similar strings;

abbabcabbabc, these strings are the same.

Obviously, you can see that the cross-parsing distance between P and Q, in this case,

is just one because I start at this position,

I cross-parse P relative to Q,

I find P and Q exactly it is.

So, cPQ is equal to one.

When I cross-parse Q relative to P,

again I don't need to do any restart,

I start at this position and I find my string Q in P exactly.

So, cQP is also equal to one.

The distance between them is defined by minus one for both one of them,

zero, and taking their average;

zero by two which is zero.

So in this case, when the two strings are identical to each other,

this algorithm says that the cross-parsing distance

between the two strings is zero, as you would expect.

What if the two strings are very different from each other?

Say basically one of the strings is abc,

and the other thing is xyz, what happens then?

Well, let's basically see that.

So, this is our string P,

this is our string Q. I start at the beginning

when I'm completing cP relative to Q. I start at the beginning.

I don't find it, empty set.

So I go to next position.

I don't find it, empty.

Next position, I don't find it, empty.

Then I finish, for I had to start at least three times.

What about when I compared Q to xyz?

In this case, it's the same thing.

So, I basically start at this position, I don't find it.

Then I start at the next position, again I don't find it.

Then I start at the next position,

I don't find it, and I finish.

So, see QP is also three,

and the average here will be that I will compute,

both of them will be two and the average will be two.

That is, the cPQ will be,

if the length is N and this length is N,

if they are different from each other,

completely different from each other,

so the cross-parsing distance will be N minus one,

because in both cases,

I have to start for every single position of the string.

So, when the strings are similar,

too identical because cross-parsing distance becomes zero,

when the strings are completely different,

the cross-parsing distance becomes N minus one.

So this is actually the idea.

Cross-parsing distance doesn't tell whether the number of edits are zero,

whether the number of edits are N minus one, it doesn't say that,

but it essentially gives us some sense about whether P is similar to Q or not.

So it can be used as an approximation of the distance measured.

What about the cost of this?

Can this be computed efficiently?

Right? So we're doing it, it turns out,

and we will not learn about this algorithm here,

but it turns out that since it relies on the idea of prefix search,

and as we have learned in the past,

the prefix search can be implemented efficiently using

the using the suffix tree data structure.

There is actually a very efficient algorithm to

compute cross-parsing distance using the suffix tree algorithm.

So instead of going and paying the quadratic cost of edit distance,

we can actually do the linear time cross-parsing of the two strings,

which is very efficient.

So this is the first approximate algorithm that we

learn to approximate the edit distance cost.

Ознакомьтесь с нашим каталогом

Присоединяйтесь бесплатно и получайте персонализированные рекомендации, обновления и предложения.