0:00

Unfortunately, there exists a strong theoretical negative result in computer

science that states that this is impossible.

It's known as the No Free Lunch Theorem, and

it was established by David Wolpert in 1996.

The No Free Lunch Theorem says that no single classification algorithm can be

universally better than any other algorithm on all domains.

Even stronger and a bit surprising formulation of this theorem

is that all classification algorithms have the same error rate

when averaged over all possible data-generating distributions.

A similar result applies to the case of regression that we presented earlier.

This statement might sound very surprising, so

let's consider a specific example.

Say we want to make a classifier that we will call hot stock or not hot stock.

The classifier will take a set of predictors or features,

X1 to XN, that we use to make a prediction.

For convenience, I added one more constant predictor,

x0 that equals 1 here, but don't worry if you don't know yet what it is for.

We'll talk about these technicalities later.

Now, the output z of the classifier would be a binary number of 0 or 1.

The value of 1 for a given stock means that it's expected to beat the market and

0 if it doesn't.

We would use the output of such classifier to make our investment decisions, so

that the classifier would be a kind of our investment, advisor.

1:50

Inside of this circle, I plotted one example of a nonlinear transformation for

the case of just one variable.

This function would have some number of parameters.

At least N + 1 parameters,

because this is the number of our input variables including our constant input x0.

2:08

One simple example of such a function would be the so-called logistic function

shown here, where argument would be a linear combination of all features.

Such a function would have N + 1 parameters, so it would be simple enough.

This is not yet a binary output, as such function will produce a continuous output.

But we will see in our follow-up videos how its inputs can be

converted to a binary value of 0 or 1.

For now, let's just continue with this example and

assume that we fine-tune parameters of this model so

that it's now trained on some large data set of stocks,

say on 2,000 days of observations for 2,000 stocks.

So that we have 4 million observations in total, each having, say, 50 features.

Again, I skip the details on how this can be done.

We will learn it in a short while, but for

now I want to focus on a high-level picture.

So assume that you built such classifier and

fine-tuned its parameters by looking at the parses of data.

Now you have a predictor that will tell you

3:23

whether any particular stock is hot or not.

You can now start trading using this predictor.

For example, you can buy ten hot stocks and sell ten not hot stocks.

Chances are that in reality, you will not be too thrilled

with the performance of your strategy, and you will want to improve your classifier.

4:20

Here I show just two such layers, but in principle,

we could add more layers someplace in between the inputs and the outputs.

Each transformation will have its own parameters.

So the whole pipeline would have many parameters and

would produce a sufficiently rich function.

If we have lots of data or lots of predictors, maybe such model after

some parameter tuning would perform better than the first, less sophisticated model.

4:49

In fact, what I describe is a schematic working of neural networks,

which we will talk about a lot in this facilitation.

Now let's assume that we have built such a more advanced model and

even found that it indeed works better for our stock data.

So maybe because it works better for stock predictions,

such more sophisticated model would always be better than the first,

less sophisticated one for any data of the same shape.

Which was in our example, 4 million rows and 50 columns.

Indeed, if it has more parameters than the first model,

shouldn't we always refer to the first model for

any datametrics of dimension of 4 million by 50?

And the answer given by the No Free Lunch Theorem is that a more sophisticated

model not only would not be always be better than the simple one.

But they will actually have exactly the same error rate if their

performance is the ratio of all possible datasets of the same size.

6:01

Now how is it possible and what does it mean?

It simply means that the set of all possible data-generating mechanisms is too

rich to be adequately represented by any given machine learning algorithm whose

capacity for generalization is determined by a particular model architecture.

For example, while a very large neural network can beat any other model for

image classification, the No Free Lunch Theorem guarantees that a much

simpler model would work better at least for some types of data.

Now is such lack of universality bad news or good news?

I personally believe that this is very good news because it's exactly what

makes machine learning so exciting and open to everyone who wants

to experiment with new types of datasets and new machine learning algorithms.

6:59

Another more practical conclusion from all of the above is

that we should not seek machine learning models

that would be universally better than any other across all possible domains.

Instead, the focus should be on models that work better for

particular domains of interest.

7:18

And this is exactly one of the objectives of this specialization,

where we explore methods that work best specifically in a financial domain

rather than those that are found to work better in other domains.

For example, for image recognition.