0:00

So far, the classification examples we've talked about have used

Â binary classification, where you had two possible labels, 0 or 1.

Â Is it a cat, is it not a cat?

Â What if we have multiple possible classes?

Â There's a generalization of logistic regression called Softmax regression.

Â The less you make predictions where you're trying to recognize one of C or

Â one of multiple classes, rather than just recognize two classes.

Â Let's take a look.

Â Let's say that instead of just recognizing cats you want to recognize cats, dogs, and

Â baby chicks.

Â So I'm going to call cats class 1, dogs class 2, baby chicks class 3.

Â And if none of the above, then there's an other or

Â a none of the above class, which I'm going to call class 0.

Â So here's an example of the images and the classes they belong to.

Â That's a picture of a baby chick, so the class is 3.

Â Cats is class 1, dog is class 2, I guess that's a koala, so

Â that's none of the above, so that is class 0, class 3 and so on.

Â So the notation we're going to use is, I'm going to use capital C to

Â denote the number of classes you're trying to categorize your inputs into.

Â And in this case, you have four possible classes, including the other or

Â the none of the above class.

Â So when you have four classes, the numbers indexing

Â your classes would be 0 through capital C minus one.

Â So in other words, that would be zero, one, two or three.

Â In this case, we're going to build a new XY, where the upper layer has four,

Â or in this case the variable capital alphabet C upward units.

Â 3:45

First, we're going to computes a temporary variable,

Â which we're going to call t, which is e to the z L.

Â So this is a part element-wise.

Â So zL here, in our example, zL is going to be four by one.

Â This is a four dimensional vector.

Â So t Itself e to the zL, that's an element wise exponentiation.

Â T will also be a 4.1 dimensional vector.

Â Then the output aL,

Â is going to be basically the vector t will normalized to sum to 1.

Â So aL is going to be e to the zL divided by sum from J equal 1 through 4,

Â because we have four classes of t substitute i.

Â So in other words we're saying that aL is also a four by one vector,

Â and the i element of this four dimensional vector.

Â Let's write that, aL substitute i that's

Â going to be equal to ti over sum of ti, okay?

Â In case this math isn't clear,

Â we'll do an example in a minute that will make this clearer.

Â So in case this math isn't clear,

Â let's go through a specific example that will make this clearer.

Â Let's say that your computer zL, and

Â zL is a four dimensional vector, let's say is 5, 2, -1, 3.

Â What we're going to do is use this element-wise exponentiation to

Â compute this vector t.

Â So t is going to be e to the 5, e to the 2, e to the -1, e to the 3.

Â And if you plug that in the calculator, these are the values you get.

Â E to the 5 is 1484, e squared is about 7.4,

Â e to the -1 is 0.4, and e cubed is 20.1.

Â And so, the way we go from the vector t to the vector aL is just

Â to normalize these entries to sum to one.

Â So if you sum up the elements of t,

Â if you just add up those 4 numbers you get 176.3.

Â So finally, aL is just going to be this vector t,

Â as a vector, divided by 176.3.

Â So for example, this first node here,

Â this will output e to the 5 divided by 176.3.

Â And that turns out to be 0.842.

Â So saying that, for this image, if this is the value of z you get,

Â the chance of it being called zero is 84.2%.

Â And then the next nodes outputs e squared over 176.3,

Â that turns out to be 0.042, so this is 4.2% chance.

Â The next one is e to -1 over that,

Â which is 0.042.

Â And the final one is e cubed over that, which is 0.114.

Â So it is 11.4% chance that this is class number three,

Â which is the baby C class, right?

Â So there's a chance of it being class zero, class one, class two, class three.

Â So the output of the neural network aL, this is also y hat.

Â This is a 4 by 1 vector where the elements

Â of this 4 by 1 vector are going to be these four numbers.

Â Then we just compute it.

Â So this algorithm takes the vector zL and is four probabilities that sum to 1.

Â And if we summarize what we just did to math from zL to aL,

Â this whole computation confusing exponentiation to

Â get this temporary variable t and then normalizing,

Â we can summarize this into a Softmax activation function and

Â say aL equals the activation function g applied to the vector zL.

Â The unusual thing about this particular activation function is that,

Â this activation function g, it takes a input a 4 by 1 vector and

Â it outputs a 4 by 1 vector.

Â So previously, our activation functions used to take in a single row value input.

Â So for example, the sigmoid and

Â the value activation functions input the real number and output a real number.

Â The unusual thing about the Softmax activation function is,

Â because it needs to normalized across the different possible outputs,

Â and needs to take a vector and puts in outputs of vector.

Â So one of the things that a Softmax cross layer can represent,

Â I'm going to show you some examples where you have inputs x1, x2.

Â And these feed directly to a Softmax layer that has three or

Â four, or more output nodes that then output y hat.

Â So I'm going to show you a new network with no hidden layer,

Â and all it does is compute z1 equals w1 times the input x plus b.

Â And then the output a1, or

Â y hat is just the Softmax activation function applied to z1.

Â So in this neural network with no hidden layers,

Â it should give you a sense of the types of things a Softmax function can represent.

Â So here's one example with just raw inputs x1 and x2.

Â A Softmax layer with C equals 3 upper classes can represent

Â this type of decision boundaries.

Â Notice this kind of several linear decision boundaries,

Â but this allows it to separate out the data into three classes.

Â And in this diagram, what we did was we actually took the training

Â set that's kind of shown in this figure and

Â train the Softmax cross fire with the upper labels on the data.

Â And then the color on this plot shows fresh

Â holding the upward of the Softmax cross fire, and coloring in the input

Â base on which one of the three outputs have the highest probability.

Â So we can maybe we kind of see that this is like a generalization of logistic

Â regression with sort of linear decision boundaries, but

Â with more than two classes [INAUDIBLE] class 0, 1, the class could be 0, 1, or 2.

Â Here's another example of the decision boundary that a Softmax cross fire

Â represents when three normal datasets with three classes.

Â And here's another one, rIght, so this is a, but one intuition is

Â that the decision boundary between any two classes will be more linear.

Â That's why you see for example that decision boundary between the yellow and

Â the various classes, that's the linear boundary where the purple and red

Â linear in boundary between the purple and yellow and other linear decision boundary.

Â But able to use these different linear functions

Â in order to separate the space into three classes.

Â Let's look at some examples with more classes.

Â So it's an example with C equals 4, so

Â that the green class and Softmax can continue to represent these types

Â of linear decision boundaries between multiple classes.

Â So here's one more example with C equals 5 classes, and

Â here's one last example with C equals 6.

Â So this shows the type of things the Softmax crossfire can do when there is no

Â hidden layer of class, even much deeper neural network with x and

Â then some hidden units, and then more hidden units, and so on.

Â Then you can learn even more complex non-linear decision boundaries to separate

Â out multiple different classes.

Â