Next up on our course of H2O supervised

machine learning algorithms is GLM, Generalized Linear Model.

I'm sure everyone is familiar with normal linear models,

you almost certainly did it at school,

plot some points on graph paper,

and try to find the best straight line algorithm that lines angle,

is what we're going to call the coefficient.

Of course, once you move into the computer,

you can do a lot more dimensions.

The generalized linear model

just takes that linear model idea and extends it in a couple of direction.

First and most importantly,

you can specify a statistical family,

where a gaussian is normal distribution,

this is the people and it's your typical linear model.

You can also specify a link function,

which is strongly related to the family.

We can also specify regularization.

So, alpha is the balance of L1 and L2 regularization.

If you don't know what L1 and L2 are,

I'm going to leave that as theory that you need to go study yourself.

Please do, because it's very important and it will come up again in Deep Learning.

Lambda is how much regularization we want,

and the H2O GLM implementation comes with a lambda search.

So, we can try and find the optimum value of lambda.

We're going to look at that in a later video.

What I want to look at in this video though

is using H2O not for predictive machine learning,

but for exploratory data analysis.

So, I've got this nice data set or perhaps very unnice data set,

it's talking about the deaths from lung cancer and trying to associate it with smoking.

It actually comes from 1964 Canadian data set.

First column is their age,

you'll notice this is a factor, a category.

Second column also a category,

if they smoke or not, no,

just cigar and pipe,

cigarette with cigar and pipe,

or just smoke cigarettes.

Our data set only has 36 rows.

What we have next are cans.

This column, the third column is the total population.

In this case, nonsmokers who are 40-44.

This, I believe, is in hundreds or in thousands.

The next column is the number of deaths from lung cancer in that year, in record.

The fifth column I added it's basically this one divided by

this one and scaled to be a 0-1,000 range.

I scaled it, made it an integer because the part I want to do with the data next.