0:18
So have a word named word.
And technically, to fit it into TensorFlow,
you'd probably have to represent it as some kind of number.
For example, the ID of this word in your dictionary.
And basically,
the way you usually use this word in your pipeline is you take one-hot vectors,
this large size of a dictionary vector that only has one nonzero value.
And then push it through some kind of linear models or neural networks, or
similar stuff.
The only problem is, you're actually doing this thing very inefficiently.
So you have this one-hot vector, and then you multiply it by a weight vector, or
a weight matrix.
It actually, it's actually [INAUDIBLE] process,
because you have a lot of weights that gets multiplied by zeros.
Now, you could actually compute this kind of weighted sum much more efficiently.
If you look slightly closer, you could actually write the answer,
you could actually write the answer itself without any sums or multiplications.
Could you do that?
Yeah, exactly.
You could just take one weight corresponding to this one here, so
weight ID 1337.
And this weight would be equal to your whole product,
because everything else is 0.
I could use the same approach when I have multiple kind of neurons.
So let's say instead of vector, we now multiply by a matrix.
So now in this case you could also try to deduce the result, but
think of it as a matrix product being just a lot of vector products stacked.
Now, how do you compute this particular kind of output activation
vector of your dense layer if you use this kind of one-hot representation here?
Yeah, exactly.
So you can simply do this as you've taken the first column vector, and
of this vector, the only thing remaining is the element under ID 1337.
Then take the second vector, and
then it has the corresponding element of this vector now as your second activation.
Now you have the third, fourth, and so
on, have as many vectors as you have kind of hidden units.
And if you visualize all the remaining values in this matrix,
to just be one row of this matrix.
Basically what this says is that you can replace this large scale
multiplication by simply taking one row.
And this of course speeds up your computation a lot.
2:37
Now let's finally get back to word2vec.
Remember, we want to train some kind of vector representation so
that the words with similar contexts get similar vectors.
The kind of general idea of how we do that.
So we define a model which is very close to how you kind of your autoencoder.
Trains the representation you want as a byproduct of this kind of model training.
And what it actually tries to do is it tries to predict words' contexts.
So basically it only has one input, the word, say, the word liked.
And it wants to predict the probability of some other word for
every word being a neighbor of this word.
So basically, you would expect this model to have large probabilities at the output
for words that coincide with your word like, like the restaurant, for example.
And small probabilities for the words that don't coincide.
3:27
Now, the problem here is that this is kind of under-defined machine learning task.
So you can't actually perfectly predict that.
But we don't even need that.
So actually, we don't expect our model to predict probabilities ideally.
So perturbations are necessary,
because we only need this model to obtain this first kind of matrix.
So it's the guy on the left here.
Now, actually, what this matrix does, is it takes one-hot vector representation
of one word, multiplies it by a matrix of weight.
So since we already know that this multiplication can be simplified,
it's basically the idea that you have these matrix, and for
each kind of mini batch, for one word, you take the corresponding row of this matrix,
and then send it forward along your network.
The second layer tries to take this representation, this word vector.
I won't be afraid of this word.
5:53
Now, okay, so basically, those models are kind of symmetric.
And they're non-similar representations up to, well, some minor changes.
So the general idea stays the same.
And you can, again,
use one of those two matrices as your word-embedding matrix now.
Because for example, in the word-to-context model, what you had is,
this first matrix was, for every single possible sample,
only one row of this matrix was kind of used at a time.
So basically, you could assume that this kind of of row in the matrix is the vector
corresponding to your word and use this matrix as your word embedding.
If you train this model by yourself or
if you use a pretrained model, you'd actually notice that it has
a lot of peculiar properties on top of what we actually wanted it to have.
So of course, it does what we actually trained it for.
It trains similar vectors for synonyms,
and different vectors for semantically different words.
But there's also a very peculiar effect called kind of linear algebra,
word algebra.
For example, if you take a vector of king, then subtract from it the vector of man
and the vector of woman, it gets something very close to the vector of queen.
So kind of, king minus man plus woman equals queen.
Or another example, you could take moscow minus russia plus france equals paris.
And they kind of make sense, well,
they're kind of underdefined, in mathematical terms.
And this is just a side effect of the smaller training.
So, again, like other models we've studied previously,
this is not a desired, kind of originally intended effect.
But it's very kind of interesting and sometimes it's even helpful for
applications of these word embedding models.
Now, if you visualize those word vectors, for example, if you take first two
principal components of your trained word vectors, it also emerges that this linear
algebraic stands very nicely to kind of structured space of those word embeddings.
For example, in many cases, you may expect a kind of similar direction
vector connecting all countries to their corresponding capitals.
Or all male profession names to the corresponding female profession names.
So there's a lot of those particular properties.
Of course, you cannot expect them to be 100% certainly trainable.
So sometimes you get the desired effects, sometimes you just get rubbish.
And of course, the model doesn't strictly apply to the exact same distance have
to be preserved through, it just trains something peculiar by the way it trains.
And this kind of coincides with the idea that, for example, autoencoders and
other unsupervised learning methods have a lot of kind of unexpected properties that
they all satisfy.
So hopefully by now I managed to convince you that having those word vectors
around is really convenient, or at least cool,
because they have all those nice properties.
It's later going to turn out those word vectors are really crucial for
some other deep learning applications to natural language processing,
like recurrent neural networks.
But before we cover that, let's actually find out, how do we train them,
how do we obtain those vectors to start collecting the benefits from them.
9:46
The first one, which basically takes a one-hot vector and
multiplies it by a matrix, can be, as you already know,
replaced by just taking one row of this matrix.
But the second one doesn't have this property, because it uses a dense vector.
So you compute this thing naively,
you're actually going to face a matrix multiplication the scale of, say,
100 vector dimension by 10 to the power of 4 or 5 possible words.
Which is kind of hard for a model that only has two layers in it.
And the hardest part here is that you cannot actually cheat by
computing the partial output of this matrix.
Because actually, the problem here is that this kind of second layer,
it tries to [INAUDIBLE] here.
And when you think probability in deep learning, you actually mean softmax here.
The problem with softmax is that to compute just one class probability
with softmax, you have to exponentiate the logit for this class and then divide
it by exponentiate the logits for all possible classes, including this one.
And the second kind of [INAUDIBLE] part here is really hard,
because you have to add up probabilities, kind of.
Not unnormalized probabilities,
exponentiated logits from all classes in order to compute just one output.
10:59
Now, okay, you could of course do this.
Theoretically, there is enough GPU space to do that on modern GPUs, or
it's even feasible on CPUs.
But the problem is that it's a very simple operation that requires a lot of
power here.
So instead there are some kind of special modifications of softmax like hierarchical
softmax or sample softmax, which try to estimate this thing more efficiently,
sacrificing either some of the mathematical properties or
sacrificing the fact that the softmax has deterministic probability,
so just adding some noise.
11:33
There's also a number of similar models that try to avoid computing probabilities,
avoid having softmax altogether.
Like this GloVe, Global Vectors, which uses no such nonlinearity in its pipeline.
Now, finally, word embeddings can be extended to high-level representations.
You can find embeddings for different objects.
For example, you can find embeddings for the entire sentence,
which makes this kind of hierarchical method.
Or you could try to find embedding for a specific data, like maybe an amino acid.
There is a special, from bioinformatic,
I know this model called protein2vec that tries to vectorize protein components.
And this thing is more or less advanced part of natural language processing.
We'll add links describing it into the readings section.
But you can more or less expect that they'll be covered in more detail
in the natural language oriented course in our specialization.
Now, okay, so basically, to be continued.
If you're intrigued by this,
you can jump to the reading section before the NLP course starts.
So this basically concludes the part of today's lecture dedicated to natural
language, and word embeddings, in particular.
But don't worry.
[INAUDIBLE] reading section, we'll also have the entire next week dedicated to
advanced applications for natural language processing.
We'll study recurrent neural networks that can, when paired with word embeddings
of course, solve not only the text classification like sentiment analysis.
But also the inverse problem,
like generating the text given a particular kind of task.
This, of course, coincides very well with your course project,
which is generating text captions given images.
See you in the next section.
[SOUND]
[MUSIC]