So far we've seen a number of supervised learning methods,
and when applying you to these methods we followed a consistent series of steps.
First, partitioning the data set into training and
test sets using the Train/Test split function.
Then, calling the Fit Method on the training set to estimate the model.
And finally, applying the model by using the Predict Method
to estimate a target value for the new data instances,
or by using the Score Method to evaluate the trained model's performance on the test set.
Let's remember that the reason we divided
the original data into training and test sets was to use the test set
as a way to estimate how well the model trained on
the training data would generalize to new previously unseen data.
The test set represented data that had not been seen during
training but had the same general attributes as the original data set,
or in more technical language,
which was drawn from the same underlying distribution as the training set.
Cross-validation is a method that goes beyond evaluating
a single model using a single Train/Test split of
the data by using multiple Train/Test splits,
each of which is used to train and evaluate a separate model.
So why is this better than our original method of a single Train/Test split?
Well, you may have noticed for example by choosing different values for
the random state seed parameter in the Train/Test split function,
when you're working on some examples or assignments,
that the accuracy score you get from running a classifier can vary quite a bit
just by chance depending on
the specific samples that happen to end up in the training set.
Cross-validation basically gives more stable and reliable estimates
of how the classifiers likely to
perform on average by running
multiple different training test splits and then averaging the results,
instead of relying entirely on a single particular training set.
Here's a graphical illustration of how cross-validation operates on the data.
The most common type of cross-validation is
k-fold cross-validation most commonly with K set to 5 or 10.
For example, to do five-fold cross-validation,
the original dataset is partitioned into five parts of equal or close to equal size.
Each of these parts is called a "fold".
Then a series of five models is trained one per fold.
The first model: Model one,
is trained using folds 2 through 5 as
the training set and evaluated using fold 1 as the test set.
The second model: Model 2,
is trained using Folds 1, 3, 4,
and 5 as the training set,
and evaluated using Fold 2 as the test set, and so on.
When this process is done,
we have five accuracy values, one per fold.
In scikit-learn, you can use
the cross_val_score function from the model selection module to do cross-validation.
The parameters are: first,
the model you want to evaluate,
and then the data set,
and then the corresponding ground truth target labels or values.
By default, cross_val_score does threefold cross-validation.
So it returns three accuracy scores,
one for each of the three folds.
If you want to change the number of folds,
you can set the CV parameter.
For example, CV equals 10 will perform ten-fold cross-validation.
It's typical to then compute the mean of
all the accuracy scores across the folds and report
the mean cross-validation score as a measure
of how accurate we can expect the model to be on average.
One benefit of computing the accuracy of a model
on multiple splits instead of a single split,
is that it gives us potentially useful information about how
sensitive the model is to the nature of the specific training set.
We can look at the distribution of these multiple scores across
all the cross-validation folds to see how likely it is that by chance,
the model will perform very badly or very well on any new data set,
so we can do a sort of worst case or
best case performance estimate from these multiple scores.
This extra information does come with extra cost.
It does take more time and computation to do cross-validation.
So for example, if we perform
k-fold cross-validation and we don't compute the fold results in parallel,
it'll take about k times as long to get
the accuracy scores as it would with just one Train/Test split.
However, the gain in our knowledge of how the model is likely to
perform on future data is usually well worth this cost.
In the default cross-validation set up,
to use for example five folds,
the first 20% of the records are used as the first fold,
the next 20% for the second fold, and so on.
One problem with this is that the data might have been created in such a way that
the records are sorted or at least show some bias in the ordering by class label.
For example if you look at our fruit data set,
it happens that all the labels for classes one and two, the apples,
the mandarin and oranges come before classes three and four in the data file.
So if we simply took the first 20% of records for Fold 1,
which would be used as the test set to evaluate Model 1,
it would evaluate the classifier only on class 1
and 2 examples and not at all on the other classes 3 and 4,
which would greatly reduce the informativeness of the evaluation.
So when you ask scikit-learn to do cross-validation for a classification task,
it actually does instead what's called "Stratified K-fold Cross-validation".
The Stratified Cross-validation means that when splitting the data,
the proportions of classes in each fold are made as close as possible
to the actual proportions of the classes in the overall data set as shown here.
For regression, scikit-learn uses regular k-fold cross-validation since
the concept of preserving class proportions isn't
something that's really relevant for everyday regression problems.
At one extreme we can do something called "Leave-one-out cross-validation",
which is just k-fold cross-validation,
with K sets to the number of data samples in the data set.
In other words, each fold consists of a single sample
as the test set and the rest of the data as the training set.
Of course this uses even more computation,
but for small data sets in particular,
it can provide improved proved estimates because it
gives the maximum possible amount of training data to a model,
and that may help the performance of the model when the training sets are small.
Sometimes we want to evaluate the effect that
an important parameter of a model has on the cross-validation scores.
The useful function validation curve makes it easy to run this type of experiment.
Like cross-value score, validation curve will do threefold
cross-validation by default but you can adjust this with the CV parameter as well.
Unlike cross_val_score, you can also specify a classifier,
parameter name, and set of parameter values,
you want to sweep across.
So you first pass in the estimator object,
or that is the classifier or regression object to use,
followed by the data set samples X and target values Y,
the name of the parameter to sweep,
and the array of parameter values that that parameter should take on in doing the sweep.
Validation curve will return
two two-dimensional arrays corresponding
to evaluation on the training set and the test set.
Each array has one row per parameter value in the sweep,
and the number of columns is the number of cross-validation folds that are used.
For example, the code shown here will
fit three models using a radio basis functions support
vector machine on different subsets of the data
corresponding to the three different specified values of the kernels gamma parameter.
That returns two four-by-three arrays.
That is four levels of gamma,
times 3 fits per level containing scores for the training and test sets.
You can plot these results from validation curve as shown here to
get an idea of how sensitive the performance of
the model is to changes in the given parameter.
The x axis corresponds to values of
the parameter and the y axis gives the evaluation score,
for example the accuracy of the classifier.
Finally as a reminder,
cross-validation is used to evaluate the model and not learn or tune a new model.
To do model tuning,
we'll look at how to tune the models parameters using
something called "Grid Search" in a later lecture.