In this lecture, we continue

discussing Paradigmatical Relation Discovery.

Earlier we introduced a method called

Expected Overlap of Words in Context.

In this method, we represent each context by

a word vector that represents

the probability of a word in the context.

We measure the similarity by using the.product,

which can be interpreted as the probability that two

randomly picked words from

the two contexts are identical.

We also discussed the two problems of this method.

The first is that it favors matching

one frequent term very well over

matching more distinct terms.

It put too much emphasis on matching one term very well.

The second is that it treats every word equally.

Even a common word like the will contribute

equally as content word like eats.

So now we are

going to talk about how to solve these problems.

More specifically, we're going to introduce

some retrieval heuristics used in text retrieval.

These heuristics can effectively solve these problems,

as these problems also occur in text retrieval

when we match a query that though with a document vector.

So to address the first problem,

we can use a sublinear transformation of tone frequency.

That is, we don't have to use

the raw frequency count of

a term to represent the context.

We can transform it into some form

that wouldn't emphasize so much on the raw frequency.

To address the synchronous problem,

we can put more weight on rare terms.

That is we can reward matching a real-world.

This heuristic is called the IDF

term weighting in text retrieval.

IDF stands for Inverse Document Frequency.

So now, we're going to talk about

the two heuristics in more detail.

First let's talk about the TF Transformation.

That is to convert the raw count of

a word in the document into some weight

that reflects our belief

about how important this word in the document.

So that will be denoted by TF of w,d.

That's shown in the y-axis.

Now, in general, there are many ways to map that.

Let's first look at the simple way of mapping.

In this case, we're going to say, well,

any non-zero counts will be mapped to

one and the zero count will be mapped to zero.

So with this mapping

all the frequencies will be

mapped to only two values; zero or one.

The mapping function is shown here as a flat line here.

Now, this is naive

because it's not the frequency of words.

However, this actually has the advantage of

emphasizing matching all the words in the context.

So it does not allow

a frequency of word to dominate the matching.

Now, the approach that we have taken

earlier in the expected overlap count approach,

is a linear transformation.

We basically, take y as the same as x.

So we use the raw count as a representation.

That created the problem

that we just talked about namely;

it emphasize too much on just matching one frequent term.

Matching one frequent term can contribute a lot.

So we can have a lot

of other interesting transformations

in between the two extremes,

and they generally form a sublinear transformation.

So for example, one possibility is to take

logarithm of the raw count,

and this will give us curve that looks like this,

that you are seeing here.

In this case, you can see the high frequency counts.

The high counts are penalize a little bit,

so the curve is a sublinear curve and it brings down

the weight of those really high counts.

This is what we want, because it prevents that

terms from dominating the scoring function.