Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

What else can we do?

Do we have any other approaches to finding

the similarity or distance between two strings?

So in the next few slides,

what we will learn are what we call filtering

algorithms for edit distance or for string similarity.

Essentially in this case,

our goal will not be to compute a distance.

Say give a P and given a Q.

Our goal is not going to be really compute a numeric value that compares P and Q.

That's not what we will try to do,

but we will try to get a sense,

an indicator as to

whether P may match Q or P may contain a match to Q quickly.

Now, the idea that if P may match Q or if P may contain a match to Q,

then we will go and pay the expensive edit distance computation

to actually compute the precise edit distance between the two strings.

If the indicator says no,

then string P may not contain a match or string P may not match Q,

then we will not pay this edit distance cost.

So essentially filtering algorithms become sort of pre-processing step in which we do

a quick and dirty competition to

indicate whether there may be a match between the two strings.

If the answer is no,

then we don't actually go and pay

this very expensive price of computing edit distance between string P and string Q.

So the next few algorithm that you will learn is that it all will be off this category.

We will call them filtering algorithms.

So, we will learn a few approaches and these approaches will solve similar problems,

but there will be some differences between

the problems some different approaches will be applicable in slightly different problems.

So, in the next few slides I will first define the problem and present the approach.

The problem will be similar to each other,

but there will be possibly some slight differences

and I will highlight those differences.

So the algorithm solves similar problems but not exactly the same problem always.

So let's basically see the first filtering algorithm.

So in this algorithm,

we are solving the following problem.

We are given a string P of length N,

we are given a pattern string of length M and our goal is to determine whether

String P may contain a match to Q with at most K errors.

At most K mismatches.

So, what's happening is that you have a P,

string P. We have another string Q.

In this case Q is usually a smaller string,

and what we tried to do is that,

we tried to see if P somewhere in it contains a match

to Q with at most K mismatches.

In this example, k equal to 3.

So how can we solve this?

How can we solve this problem without actually having

to pay the price of doing this full comparison.

He had learned something called if you

remember nondeterministic finite automata in one of

the earlier videos which again could be costly.

We don't want to pay the price.

So We want to quickly decide whether P may contain a match to Q or not.

If the answer is no,

we will not pay that expensive process.

So the idea is very simple.

Okay, so the idea is this,

let's assume, so for the sake of for example.

So we have again string P if we have

string Q for the sake of example again let's assume that we are

allowing up to three errors in the match.

That is there can be three mismatches between Q and the part of

the string that basically matches Q in

P. I of course don't know where those mismatches are,

but I will allow it most three mismatches.

So what I will do is I will take my string Q,

and I will split string Q into K plus 1,

in this case four sub-segments.

So I will basically say okay let me be able split this into four equal sized segments.

Great.

What I will then do is, I will say,

well instead of searching this entire sequence Q in P,

I will search its sub-segments in P. How will this help?

Well, the idea is very simple, right.

So I know that there are three errors,

up to three errors but I don't know where those three errors are.

So all the three errors,

there is a chance that they might basically focus all of

them let's assume in the first of the segments,

or in the worst case,

the errors may basically be distributed,

maybe far from each other or maybe it could cover multiple segments.

But note that if I have at most three errors,

and if I have four segments,

that guarantees to me that there will be at least one segment which will be error free.

There will be at least one segment which will be error free so which can match exactly.

This is great, because what I can do is I can then make sure that

if there is a approximate match with at most K errors then at least one of the segments,

at least one of them must exist in Q exactly.

Exactly. If it exists then that may be a match.

Then I need to go and basically do

the expensive processing to see whether that match does exist or not.

But if such a match does not exist,

what I know is that that cannot be,

there is no chance that there is a match with at most K errors.

So, if such an exact match does not

exist then I can filter out this query pattern and say,

sorry I can tell you that there is no chance that there is at

most K error match between Q and P. I can

tell you without actually paying that expensive process.

So as I said the filtering algorithm,

gives us the quick way to eliminate candidates that are guaranteed not to have a match.

To see whether there's a match if

basically the filtering algorithm say yes while there may be a match,

you actually have to pay the expensive algorithm.

You need to go back and implement the expensive algorithm

to actually check whether that match exists or not.

So that's why we call these filtering algorithms.

Okay, so we have seen one algorithm to basically solve this problem.

Let's see another algorithm.

So this algorithm solves the same problem as we discussed before.

We are given a string P of length N. We are given a pattern Q of

length M and we are trying to see if string P may contain an approximate match,

a match to Q with at most K errors.

So this is the same problem as the first filtering algorithm I solved.

In this case, however,

the process will be a little bit different.

Instead of segmenting the query pattern Q,

what we will do is we will operate on the string P on our data string.

So, let's basically see this with an example.

So we have string P and we have string Q.

So in general and Q with shorters than string P because they

are interested in containment which means they have a long string P

and a short string Q and say this basically the length of

this M the length of this is N. So what we will do is,

we will basically go and slide a window of length M on

our data string for each and every position.

And what we will do is we'll count symbols.

So let's assume that our query string is abac.

What this says that basically have two a's one b and one

c. So we have accounting vectors that counts the symbols.

What we will do is that as we move our window on the data symbol for each position,

we will count the number of symbols.

Instead of trying to match the data string with the Queta string.

What we will do is that for every position,

we will count how many a's there are how many b's there are how many c's there are.

So let's assume that somewhere here there is basically

a window with two a's one b and one c. Let's assume

that there is another window somewhere here again with two a's one b and one

c. Now I don't know whether this window is actually matching this,

I have no idea.

I also don't know whether this window is matching this or not.

I have no idea.

But what I know is that this window here which I say one a,

one b and two c's is not a match. I know of that.

So it means that basically if I am going to pay

the expensive price off trying to match the query string with a given window,

I only need to consider the window that is an exact match in the number of counts.

But remember in this problem we allow upto k errors.

Which means that basically say I allow up to one error

then, let's assume that there's a window here and this

isn't that this window has in it two a's,

one b, zero c and one d. Note that what's happening is that my string was 2a,

1b, 1c this is 2a,

1b, 0c and 1d.

So there is one error. One of the symbols that I'm looking for is missing.

Now that doesn't mean that basically this is going to be exactly abac. It does mean that.

But what it tells me that,

if I am alone and one error,

I should look into this window.

So this window is a window of interest and I should look into that because

that window might contain an approximate match to my query pattern.

So once again I am instead of doing expensive matching,

I do a cheap counting to see whether there may be a match or not.

Again, a quick and dirty solution that eliminate

mismatches completely and that maintain candidate matches for future exploration.

Ознакомьтесь с нашим каталогом

Присоединяйтесь бесплатно и получайте персонализированные рекомендации, обновления и предложения.