0:00

For an example of logistic regression,

we're going to use the urine data set from the boot package in R.

First, we'll need to load the boot package.

And if it is not already installed, you'll have to do that as well.

After loading the package, we can load the data which is called "urine".

This data contain seven characteristics of 79 urine specimens.

The goal of the analysis is to determine

if certain physical characteristics of the urine are related to the formation

of calcium oxalate crystals.

Our seven variables are, first, R, which is an indicator

of the presence of a calcium oxalate crystal, so the value of R will be zero or one,

and our explanatory variables or covariates are the specific gravity,

the pH reading (this is for acidity),

the osmolarity, conductivity,

the urea concentration, and finally, the calcium concentration.

For more information about this data,

see the following source.

Let's see what the data looks like.

We'll take the head of the urine data set.

We can see that calcium oxalate crystals were not observed in the first six specimens,

and we have values for the six other explanatory variables.

Notice that we have missing values in this data set.

So, before we conduct analysis, let's remove those missing values.

'dat' is the data set that we're going to use.

We went from 79 rows down to 77 rows.

Now, let's look at a pairwise scatterplot for each

of the seven variables using pairs on our data set.

One thing that stands out

is that several of these variables are strongly correlated with one another.

For example, the specific gravity and the osmolarity

appear to have a very close linear relationship, as you can see in this scatterplot.

Collinearity between x variables or explanatory variables in linear regression models

can cause trouble for statistical inference.

Two correlated variables will compete for the ability to predict the response variable,

leading to unstable estimation.

This is not a problem for prediction of the response.

So, if prediction is our only goal of the analysis, then it's not a problem.

But, if our objective is to discover how the variables relate to the response,

we should avoid collinearity.

Because the primary goal of this analysis

is to determine which variables are related to the presence

of calcium oxalate crystals,

we will have to deal with the collinearity between the predictors.

This problem of determining which variables relate to the response

is often called variable selection.

We have already seen one way to do this.

We could fit several models that include different sets of variables

and see which one of those models has the best deviance information criterion value.

Another way to do this

is to use a linear model where the priors for the linear coefficients

favors values near zero.

Values near zero on the coefficients indicate weak relationships.

This way, the burden of establishing

an association between the variables lies with the data.

If there is not a strong signal, we assume that it does not exist

because the prior favors that scenario.

Rather than tailoring a specific prior for each individual beta coefficient

based on the scale of its covariate X, it is customary

to subtract the mean and divide by the standard deviation for each variable.

It's easy to do this in R.

We can create the design matrix X by scaling the data.

We want to omit the first column

because the first column contains our response variable, so we want the next six columns.

We want to scale those by centering them, so center=true,

and we also want to scale them, dividing by the standard deviation.

This only makes sense to do for variables that are continuous.

If we have categorical variables, we should exclude them from this

centering and scaling exercise.

Let's run the X.

Oops, I called it 'data' instead of 'dat'.

Let's rerun that,

and take a look at the first several values.

5:59

We can see that the values for these x variables have changed, so that

the column means for X are all close to zero.

For example, -9.8*10^(-15).

That's very close to zero,

and that is the case with all of these column means.

We can also calculate the column standard deviations using the apply function.

Apply, where we want to apply this function, standard deviation

to the columns which is the second index of the dimensions of X.

The rows are index one and the columns are index two.

So, let's calculate the column standard deviations of X, and they're all one.

Now that the explanatory variables or

covariates are centered and scaled, let's talk about that special prior

which will favor coefficients near zero.

The prior we're going to use

is the double exponential, also referred to, as the Laplace prior.

Here we have the standard normal PDF, as a dashed line,

and the PDF of the double exponential distribution, as a solid line.

As the name implies,

the double exponential distribution looks like an exponential distribution

with tails extending in the positive direction

as well as in the negative direction,

whereas the exponential only went in the positive direction.

We also have a sharp peak at zero.

We can read more about the double exponential distribution in the JAGS documentation

or the JAGS manual under 'distributions'.

Next, let's set up and talk about this model in JAGS.

Here is the model string.

It looks pretty similar to the linear regressions we used earlier,

except with some important modifications.

In the likelihood portion of the model, we can see that y_i, the ith observation,

now comes from a Bernoulli distribution 'dbern' rather than a 'dnorm',

where the probability of success for

observing a one for the ith observation is probability sub i or p_i.

Instead of modeling p_i directly,

in logistic regression, we're going to model the logit of p.

The logit of p gets the linear model portion

where we have a parameter for the intercept, and now we have a beta coefficient

for each of the six individual variables.

We'll place a fairly non-informative prior on the intercept.

It'll be a normal prior with variance 25 or standard deviation 5.

That's fairly non-informative in a logistic regression model

where values as high as two or three can get you very close to probability one

and values of negative two or three can get you very close to probability zero.

Next, we will create our prior for the individual beta terms.

Beta j will get a 'ddexp' or a double exponential prior

with location parameter zero, and this value for the inverse scale parameter.

As we've noted, this value for the inverse scale

gives us a double exponential distribution with variance one.

The rest of these lines should look familiar to you.

So, let's go ahead and run them.

10:11

We won't do all of the convergence diagnostics right now, but

let's look at least at the trace plots.

Let's run that.

These trace plots look like they might have some high autocorrelation,

but they don't look too bad.

We'll have to click 'return' to get the next plots

for the remaining betas and for the intercept.

These trace plots look OK.

Before we move on, let's also calculate the DIC for this model.

Instead of a numerical summary of our posterior distribution,

let's take a look at a visual summary.

Let's create the marginal posterior density plots for our six beta coefficients.

To display these all on the same page, we'll use par(mfrow),

where we want three rows and two columns in our plots.

Now we'll use the denseplot function

on our combined simulations mod1csim and we want the first six columns

which contain the six beta variables.

We also want these to be displayed

on an identical x-axis, so let's give it limits for that x-axis.

Xlim will go from negative three to positive three.

To remind ourselves, which beta goes with which variable,

let's take the column names from X.

So, beta1 is the specific gravity, beta2 is the pH, acidity, and so forth.

Let's generate these plots.

The first thing we're looking for in these plots

is evidence that the different coefficients, the betas, are different from zero.

It is clear that the coefficients for gravity, which is beta1, conductivity,

which is beta4 here,

and calcium concentration, which is beta6 here,

all three of these betas are significantly away from zero.

Notice that the coefficient for osmolarity, the beta3,

the posterior distribution here looks like

our double exponential prior, and it is almost centered on zero, still.

So, we're going to conclude that osmolarity

is not a strong predictor of calcium oxalate crystals.

The same goes for pH up here.

Urea concentration, or beta5 over here, looks like it's a borderline case.

However, if we refer back to our pairwise scatter plots,

we see that urea concentration is highly correlated with specific gravity,

and so we're going to remove this variable.

That is, we're going to create another model where we keep the coefficients for gravity,

conductivity and calcium concentration only.

Here is that model string for the next model.

We've removed the variables that were not significantly away from zero

and retained the coefficient for gravity, a coefficient for conductivity

and the coefficient for calcium.

We've also switched from the double exponential prior

to a fairly non-informative normal prior on these coefficients.

Let's now run this model.

We get a warning message: The JAGS did not use the three variables pH,

osmolarity and urea (which is not surprising to us).

It won't affect our analysis.

I'll skip ahead and run the remaining code for this model.

We need to run the burn-in period,

the posterior samples and the DIC calculation for this model.

I'll leave it up to you to check the convergence diagnostics here.

15:06

Now that we have DIC for both models one and two, why don't we compare the two,

or run 'dic1' and 'dic2'.

And we can see immediately that the first model has

more of a penalty, because there are more parameters in that model.

And interestingly,

the first model has the lower DIC of the two, although they're pretty close.

It's also important to note

that we did change the prior between the two models,

and generally we should not use DIC to choose between priors.

So, in this case, we may not be making a fair comparison.

Nevertheless, let's check our results from the second model.

Oops, that was a summary of the model object itself.

What we want is a summary of the model simulations.

There we go.

So, we can see that with a positive coefficient, the specific gravity

is associated with an increase in the probability of observing crystals.

With a negative coefficient, the conductivity

is associated with a decrease in the probability of observing oxalate crystals,

and with a positive coefficient here,

we conclude that calcium concentration is associated with an increase

in probability of calcium oxalate crystals.

Each of these estimates are several posterior standard deviations

away from zero, so we would say they are significant contributors

to the probability of calcium oxalate crystals.

Are these the same conclusions we reached from the first model?

Let's look at the combined simulation and just look at the point estimates.

Remember, specific gravity was the first coefficient: 1.63, pretty similar here.

The fourth coefficient was conductivity, had a negative effect

which we see in both models, and finally, calcium concentration was the sixth variable.

It was estimated at 1.62.

Here, it was estimated at 1.88.

Although the numbers are a little different,

the two models essentially yield the same conclusions.

There are more modeling options available in this scenario,

perhaps including transformations of variables,

different priors or interactions between the predictors,

but we'll leave it to you to see if you can improve this model.