0:06

Welcome back to course six on combining and analyzing survey data.

Â In this module, basic estimation will continue with another example.

Â Using R just to assure you have define more complex sample

Â design than in the last video.

Â So what we're going to do is to use another dataset out of

Â the R PracTools package.

Â And this one's called nhis.large,

Â again it's from the US National Health Interview Survey here,

Â but it's a the full sample from that dataset.

Â I'm not treating this as a population.

Â It's got 21,588 persons.

Â It's got 75 strata, and 2 PSUs per stratum, so a total of 150 PSUs.

Â So you can see that the number of persons for

Â analysis is a lot bigger than the number of PSUs.

Â So there's a substantial amount of clustering here, and

Â as in the simpler example that we saw earlier.

Â We've got to define a design object for R, so it knows how to handle the data.

Â So the first thing I do is require the PracTools package, so I can get the data.

Â I require the survey package, so I can analyze it.

Â And then, I use the data statement to specify which dataset I'm going to use.

Â 1:34

Now, this is a multi-stage survey, so I've got psus.

Â And the first-stage sampling unit is called PSU.

Â So I specify that here in the ids parameter.

Â In the strat parameter, I give it the name of the stratum which is stratum.

Â This is a field in the dataset.

Â And then the survey where it's called, svywt.

Â Now note that R expects these things to be a formula, so

Â you put it till the end of the frame.

Â 2:04

You can use more complicated expressions to define these things sometimes,

Â but this is fairly easy with one field.

Â We tell that data is nhis.large.

Â And then another parameter is nest = TRUE, now what that means is,

Â the PSUs' are not numbered consecutively across the whole dataset,

Â they're renumbered within strat, one two, one two, one two, and so forth.

Â If you leave out the nest statement, survey will actually detect the fact that

Â they're not numbered consecutively and it will suggest you use the nest statement,

Â so you'll her about it if you don't put it in the initially.

Â Now, what kind of variance estimator are we going to get given

Â the amount of information we provided.

Â Surveys going to use the ultimate cluster variance estimator,

Â which assumes that PSUs are selected with replacement.

Â That's the default, if we were able to specify more detailed information,

Â then the survey package has available other variance estimators.

Â But this is kind of typical in a public use dataset, where the only choice you've

Â got is use the ultimate cluster with replacement variance estimator.

Â 3:29

So we'll do a table of proportions, and the variable I'm going

Â to analyze is something called delayed medical care, because of cost.

Â What it is, is it's an indicator variable.

Â A person delayed getting medical treatment for something,

Â because it was too expensive in the prior year, and yes or no.

Â So we'll do a table on age.

Â So to do that, I use the svyby function.

Â 4:03

I send the first parameter which is a formula, that's our analysis variable,

Â and to make sure it treats that as a factor, use yes, no variable.

Â I say factor here, and then delay.med in parentheses.

Â So factor is a kind of a function that's receiving delay.med.

Â And then the stub of the table is going to be age groups.

Â So there's a variable called age.grp in the file, I used that.

Â The function FUN here is survey mean, so I specify that.

Â There are other possibilities, survey total, for example.

Â You tell it the design object, which I just created.

Â And then, it's critical, but you include this na.rm=TRUE,

Â which means If the analysis variable or

Â the stub of the table has missing values, just take those out.

Â Otherwise, you're not going to get a table.

Â Now, the survey package does not tabulate those missing's out separately.

Â It might be nice if it did, but it doesn't ,you'd have to code them as

Â something other than na in order to get those to be tabulated.

Â So I save all that in an object called age.mns, mns for means.

Â And then it turns out the two columns out of this object that I want to look at for

Â the proportion and the standard error are the second and the fourth.

Â So I'm extracting those here.

Â And then just to make my table a little more readable, I specify rownames and

Â colnames for this age.mns object.

Â The second and fourth columns, which is what I extracted and

Â then I print those out here with the round function around four decimal places.

Â So you can see in the proportion column here, the proportions or

Â lower for young people under 18 years old, and older people,

Â 65 or more, then they are for people in the working years.

Â And reason for that is that in the US all the young tend to have

Â medical insurance at a higher rate than the working age people.

Â So because they've got insurance, they tend not to delay treatment.

Â So here are the standard errors.

Â You can see they're a bit different.

Â And those are the width replacement standard errors.

Â 6:44

Now, just for comparison, let's think about what would happen if

Â we just ignored the sample design; assumed I had a simple random

Â sample with no weights, where everybody's got a weight of one.

Â So I'll do that by hand essentially.

Â I'm going to save my output in a result, in an object called age.mns.srs.

Â And I'm just using the by function here.

Â So in this string here, I'm taking the absolute value of

Â nhis.large$delay.med and I've got a dollar

Â sign in there to separate the object name from the field within at the column.

Â I subtract two and take the absolute value I did that

Â because delayed.med is coded as one or two, for yes or no.

Â So if I subtract two and take the absolute value, I recode it to zero

Â one which is panel one deal with it.

Â Now, the other parameter, another parameter in the by function is indices.

Â That's just the stub of the table again.

Â So I say age.grp again.

Â The function I'm going to do is just the simple mean and

Â I also say take the missing values out.

Â And then, for the standard error, I compute that by hand,

Â the standard error proportion.

Â So this age.mnsB.

Â I haven't shown you the separate line of code, but it's this thing in the brace.

Â It's the record of delay.med-2, one variable.

Â So I'm taking the zero one variables times 1 minus that.

Â 8:42

And I'm dividing by a table

Â of the counts in the stub of age.grp.

Â So I should say this age.mnsB is the proportion

Â who delayed medical care in a table, and

Â then around that and combine a couple of things.

Â So cbind means put two columns together.

Â 9:14

So my first column is the age.mns.srs,

Â the simple random sample proportions divided by the complex

Â sample estimates which I saved in age.mns[, 1].

Â So that's just taking the ratio of the estimated proportion.

Â So I can see whether using weights made any difference there.

Â And I named that ratio p.hats.

Â And then I do the same thing for the standard errors.

Â So here's the standard error for the srs version which I just computed up here.

Â And here are the standard errors which I extracted

Â from the complex sample estimate object.

Â And I round those to two decimal places, so we don't have as much to look at.

Â So here's what I get.

Â The ratio of p.hats in this column are around 1, this's one's a bit off, 1.09.

Â But using weights doesn't make a tremendous difference in

Â the point estimate.

Â On the other hand, it makes a tremendous difference in the standard errors.

Â The first one, for example, the srs standard error

Â is 70% of the complex sample standard error and

Â the other three here are also this for

Â srs, almost the same for the fifth category.

Â Now, that doesn't mean you outta use the srs estimates because they're

Â more precise, what it means is you're getting a deceptively low

Â estimate of standard error by ignoring the clustering in this design.

Â So here you'd have,

Â if you were to put confidence intervals on these estimated proportions,

Â you'd have confidence intervals that were much too short.

Â They'd be like this, but they ought to be like that.

Â And you'd be fooling yourself that you're getting that much

Â precision from this complicated sample design.

Â 11:30

Now, we can also do a test of independence here,

Â just to show you another analytic technique.

Â You might be interested in whether delayed medical care and

Â age are independent of each other.

Â Now, we saw in the table of proportions that they're pretty different for

Â the young and the old compared to the middle ages.

Â So you'd expect that this test of independence would

Â reject the hypothesis of independence.

Â Now, what happens here is we need to account for the complex sample designs.

Â So the function called svy chisq, c-h-i-s-q is the thing you use.

Â And then you specify a table this way.

Â Begin a formula with ~ delay.med + age.grp.

Â Here's my design object.

Â There are a couple of choices of statistics.

Â We're going to use one that's specified by F in quotes.

Â Which is called the Rao Scott adjusted Pearson's chi-square test statistics.

Â So this statistic ment to calculating Pearson's chi-square

Â which will be appropriate for simple random sampling but

Â then multiplying it by something to now, for the complex design.

Â So the function echoes back the way I called it and here's the output.

Â F is 48.295, numerator degrees of freedom

Â 3.69, denominator is 276.89.

Â Now, notice that these are fractional which is okay.

Â You don't have to have integer degrees of freedom to deal with an F distribution.

Â So we refer that, to the F-table, and the software does that for us.

Â And the p-value's essentially zero.

Â So we reject this hypothesis of independence.

Â Pretty handily, in this case.

Â