There are some similarities between the sequence to sequence machine translation model

and the language models that you have worked within the first week of this course,

but there are some significant differences as well.

Let's take a look. So, you can think of

machine translation as building a conditional language model.

Here's what I mean, in language modeling,

this was the network we had built in the first week.

And this model allows you to estimate the probability of a sentence.

That's what a language model does.

And you can also use this to generate novel sentences,

and sometimes when you are writing x1 and x2 here,

where in this example,

x2 would be equal to y1 or equal to y and one is just a feedback.

But x1, x2, and so on were not important.

So just to clean this up for this slide,

I'm going to just cross these off.

X1 could be the vector of all zeros and x2,

x3 are just the previous output you are generating.

So that was the language model.

The machine translation model looks as follows,

and I am going to use a couple different colors,

green and purple, to denote respectively

the coded network in green and the decoded network in purple.

And you notice that the decoded network looks pretty much

identical to the language model that we had up there.

So what the machine translation model is,

is very similar to the language model,

except that instead of always starting along with the vector of all zeros,

it instead has an encoded network

that figures out some representation for the input sentence,

and it takes that input sentence and starts off the decoded network with

representation of the input sentence rather than with the representation of all zeros.

So, that's why I call this a conditional language model,

and instead of modeling the probability of any sentence,

it is now modeling the probability of, say,

the output English translation,

conditions on some input French sentence.

So in other words, you're trying to estimate the probability of an English translation.

Like, what's the chance that the translation is "Jane is visiting Africa in September,"

but conditions on the input French censors like,

"Jane visite I'Afrique en septembre."

So, this is really the probability of an English sentence conditions on

an input French sentence which is why it is a conditional language model.

Now, if you want to apply this model to actually

translate a sentence from French into English,

given this input French sentence,

the model might tell you what is the probability

of difference in corresponding English translations.

So, x is the French sentence,

"Jane visite l'Afrique en septembre."

And, this now tells you what is the probability of

different English translations of that French input.

And, what you do not want is to sample outputs at random.

If you sample words from this distribution,

p of y given x, maybe one time you get a pretty good translation,

"Jane is visiting Africa in September."

But, maybe another time you get a different translation,

"Jane is going to be visiting Africa in September. "

Which sounds a little awkward but is not a terrible translation,

just not the best one.

And sometimes, just by chance,

you get, say, others: "In September,

Jane will visit Africa."

And maybe, just by chance,

sometimes you sample a really bad translation:

"Her African friend welcomed Jane in September."

So, when you're using this model for machine translation,

you're not trying to sample at random from this distribution.

Instead, what you would like is to find the English sentence,

y, that maximizes that conditional probability.

So in developing a machine translation system,

one of the things you need to do is come up with an algorithm that can actually find

the value of y that maximizes this term over here.

The most common algorithm for doing this is called beam search,

and it's something you'll see in the next video.

But, before moving on to describe beam search,

you might wonder, why not just use greedy search? So, what is greedy search?

Well, greedy search is an algorithm from computer science which says to generate

the first word just pick whatever is

the most likely first word according to your conditional language model.

Going to your machine translation model and then after having picked the first word,

you then pick whatever is the second word that seems most likely,

then pick the third word that seems most likely.

This algorithm is called greedy search.

And, what you would really like is to pick the entire sequence of words, y1, y2,

up to yTy, that's there,

that maximizes the joint probability of that whole thing.

And it turns out that the greedy approach,

where you just pick the best first word,

and then, after having picked the best first word,

try to pick the best second word,

and then, after that,

try to pick the best third word,

that approach doesn't really work.

To demonstrate that, let's consider the following two translations.

The first one is a better translation,

so hopefully, in our machine translation model,

it will say that p of y given x is higher for the first sentence.

It's just a better, more succinct translation of the French input.

The second one is not a bad translation,

it's just more verbose,

it has more unnecessary words.

But, if the algorithm has picked "Jane is" as the first two words,

because "going" is a more common English word,

probably the chance of "Jane is going," given the French input,

this might actually be higher than the chance of "Jane is

visiting," given the French sentence.

So, it's quite possible that if you just pick

the third word based on whatever maximizes the probability of just the first three words,

you end up choosing option number two.

But, this ultimately ends up resulting in a less optimal sentence,

in a less good sentence as measured by this model for p of y given

x. I know this was may be a slightly hand-wavey argument,

but, this is an example of a broader phenomenon,

where if you want to find the sequence of words, y1, y2,

all the way up to the final word that together maximize the probability,

it's not always optimal to just pick one word at a time.

And, of course, the total number of combinations of

words in the English sentence is exponentially larger.

So, if you have just 10,000 words in a dictionary and if you're

contemplating translations that are up to ten words long,

then there are 10000 to the tenth possible sentences that are ten words long.

Picking words from the vocabulary size,

the dictionary size of 10000 words.

So, this is just a huge space of possible sentences,

and it's impossible to rate them all,

which is why the most common thing to do is use an approximate search out of them.

And, what an approximate search algorithm does,

is it will try,

it won't always succeed,

but it will to pick the sentence, y,

that maximizes that conditional probability.

And, even though it's not guaranteed to find the value of y that maximizes this,

it usually does a good enough job.

So, to summarize, in this video,

you saw how machine translation can be posed as a conditional language modeling problem.

But one major difference between this and

the earlier language modeling problems is rather

than wanting to generate a sentence at random,

you may want to try to find the most likely English sentence,

most likely English translation.

But the set of all English sentences of a certain length

is too large to exhaustively enumerate.

So, we have to resort to a search algorithm.

So, with that, let's go onto the next video where

you'll learn about beam search algorithm.