0:23
We've been covering a variety of sampling techniques and
principles of sample design through our first five units.
But here, what we're going to do is a few extensions and applications.
It's a collection of topics that add on to the things that we've been looking at.
We're not going to introduce anything new in the way of sampling techniques.
But we will introduce new ways of looking at them,
whether it's how to select samples using software or
doing stratified multistage sampling or waiting in a couple of different forms.
Sampling networks specifically, and some weighting procedures that
are sometimes described as multiplicity weighting.
And then finally something on non-probability sampling,
just a brief introduction to the topic.
1:26
How we're going to put that into a statistical system, in this case,
we're going to be using the R statistical system.
Now, if you don't know R, that's okay.
This is merely to illustrate what it looks like, some of the things that you need to
think about as you do sample selection using software.
And then we'll illustrate R on that single frame for four different sample designs.
Simple random sampling, as we've described it, but
I'm going to label that without replacement, simple random sampling with
replacement, systematic sampling and probability proportionate to size.
2:07
So our frame consists of a list of blocks,
these are census blocks as we have talked about these materials before.
And for each of the census blocks,
there are almost a thousand of them in our frame.
There's information there about the number of housing units that are there.
How many of those housing units are owned by the occupant?
How many of the housing units are rented by the occupant?
And there's a quite a bit of variation on these numbers as you go through them.
So, here's just the first 30 of the blocks in our frame
with the basic information about renting and owning.
Now, you don't need to look at this very carefully and
see it in great detail because this is just to remind us that there is a frame
from which we're going to draw our sample.
In our particular case, there are 975 elements in the population and
we're just going to draw a sample of size 20.
That is our sampling rate will be 1 in every 48 or 49 units.
Now, I put the full sampling rate here by taking 20 divided by 975 and
converting that into a fraction that has 1 and the numerator.
That is the numerator 20, is divided by 20 and the denominator,
975 divided by 20, 48.675.
And we're going to use that same sample size applied to the same frame population
size, it was 975, that same sampling fraction, and do four different designs.
We will not cover, for example, stratified sampling and
some other designs that we've discussed before, but just some of these basic
ones to illustrate what happens when we do this with software.
3:39
Now, with this particular software system,
there are some features that are needed to get the data ready for sample selection.
We need to bring the data into the system.
In the R system, we need to tell the system where the data are located,
as with many systems.
In this particular case, we are going to set the directory.
Set working directory, setwd, S-E-T-W-D.
Here, just made up of a particular location on my machine where
the sampling methods folder contains the frame.
The second step then is to open the data file.
In this particular programming language, there is a command,
read the data, in this case, read a table,
that takes the data through that function and puts it into what's called an object.
Now, in this case, an object is just our frame, and so
you'll see in our statement there.
What we're going to be doing is into the object frame,
putting in our data through the read.table function.
And that read.table function specifies the file that we're going to look at for
our data, that's the one contains the 975 cases and three variables.
Other is a header there that will be the names of the variables we're going to use.
And the separation between the different columns is through an attempt function.
But this is just an example of reading the data into such a system,
whatever system you're using, we'll have similar kinds of commands and operations.
And then once we've got it in,
it's always important to do this, view it, look at it, print it out.
In our particular case, we're going to edit the frame, that object frame
just to make sure everything's in there, all 975 cases, the three variables,
nothing got corrupted, nothing was changed in unexpected way.
So inspection, checking our work.
5:29
Now we're ready to do a sample selection.
Here's the process that illustrates the outcomes in this particular case in which
we have listed the frame and are inspecting it.
Now, in our particular frame there's actually three variables there, a sequence
number, the number of renter occupied, the number of owner occupied dwellings.
And then there's also, in the program, a numbering of each record in the file.
All right, with this particular system, the R system,
there are a series of packages.
Not everything is available at one time.
We need to load information, load programs,
load particular commands for particular tasks.
And so, with this particular system,
there are set of packages that are loaded through a library system.
A library, in this particular case is a very nice package that has a wide variety
of sampling techniques built into it called sampling, so
we're calling that library.
We've already actually loaded that package and
are now calling on the system to recognize that that
library is something that it needs to have access to and ready to operate on.
And now we're ready to do our sample selection.
And the first will be simple random sampling.
And we've added the specification that this be without replacement.
This is because, in this particular package, simple random samples can be both
without replacement and with replacement, something we haven't talked very
much about with replacement sampling, they make the distinction.
In the definitions that we've done so far,
we didn't make much of a distinction there, but that's how they do it.
And the particular command that's used here is to
take now from the package that's already been specified SRS WOR,
simple random sampling without replacement.
That would be recognized now from that library is being a command.
And we're telling it that we have a sample of size 20 from a population of size 975.
Now, curiously, we haven't referenced the data.
Does it know automatically to do that?
No, actually, what it's doing now is building a file, a new object, and
that new object you'll notice that is this command is being,
its output is being put into a new object called sam.srswor.
The sample for a simple random sample without replacement from a population
of 975, the sample of size 20.
So that that sam.srswor,
the sample, would be just a series of indicators of which cases are selected.
8:02
Then that needs to be applied to our data.
And there are many ways that this could be done in R, several ways anyway.
But what we now need is to take our frame and convert it into a sample.
In particular, we're going to take the frame and
we're going to apply a function called which to it that says,
look in that file that said what sample we have, the sam.srswor.
And look at it case by case.
And any time that a case is a 1,
what we want to do is find the corresponding element in frame.
Now what it's doing is basically aligning the two files record by record, 975 long.
And any time it finds in sam.srswor,
a case where its value is 1, 1 rather than 0.
It will identify the particular case in frame and write that into our sample.
It's something you need to understand about the R language in order to do it.
But basically what's being done is, draw the sample,
identify which cases are in the sample, then go into the frame and
extract them, and this is the extraction statement.
So how do we see it?
Well, we're going to list two things here.
First, I'm going to list the actual sample, so
here is the sam.srs without replacement.
And you see it's just a file that has a series of 1s and 0s.
Now, these are aligned here with 37 elements.
9:31
I need to know, that just happens to be how much
would fit on this particular screen.
So it's in groups of 37 as we go along.
So starting with the first, and then the 38th, and then the 75th, and so on.
And embedded in this string of 1s and 0s are sample cases, and you can see them.
Now these are the ones that were actually selected,
not in the order that they were selected, just the ones that were selected.
So that what we do is take this file and
using the which function applying it to frame, identify the cases.
Well, how do we know what the cases are?
Well, we've printed out the sample.srs without replacement.
Here are the selected cases.
Its already extracted the 20 cases from the filing,
here they are, this happens to be in order by the original ordering of the frame.
10:19
All right, so simple random sampling without replacement,
we can also do with replacement.
Very similarly by using a slight modification,
instead of doing SRS WOR, we do SRS WR.
Same specification for the population size and the sample size that's
placed into our object sam.srs, in this case, WR instead of WOR.
And then we apply the which function, if you will,
to the object frame telling it that whenever the sam.srswor is equal to 1 or
greater than equal to 1, we're going to select it.
Now that greater than or equal to 1 is important because what happens when we
select with replacement is that a case can be selected more than once.
And so we don't just have an indicator, zero one, but
we can have an indicator that is zero, not selected at all, one,
selected once, two, selected twice, three, selected three times and so on.
And so anytime that indicator is greater than or equal to one, or greater than
zero, we could have specified it that way, we're going to select the case.
So, here's again the sample, and as we inspect the sample, yeah,
the sam.srswr, that object where we put the sample.
We can see that indeed in our particular case there's a different sample that's
been selected across these cases,
this is only the first 75 to a 100 of these in this particular layout,
the other 975 are there, I just didn't print all of them out.
But now we can see in red the selections, the very first case was selected but
just one time.
There's another case selected in that first row once, but
then there's also a case that's been selected twice.
So that means that in the end, if this was the only case
that was selected twice, we would have 19 cases in our file.
The 18 cases selected once, and the 1 case selected twice.
Now, that's going to pose a little problem for
us if we want to keep track of that, but basically, there's our sample.
Our sample, simple random sampling with replacement.
But if we wanted to, we would need to merge in that factor, and
I just called it duplication.
We won't go through the code to do this, but
we're going to merge in that duplication factor as well, so
that we know which case was selected more than once.
In this case, it's the third selection in the file.
12:43
So, there we have it, we have our simple random sample without replacement,
one time for each case, that's without replacement selection is all about.
And then with replacement where we can get duplicates in the selection as well.
Well, you can now see there's a pattern to this.
We're going to have the same kind of thing with this sampling package.
In your particular package,
you will have a different way of implementing these things.
But when we do something, say for example, like systematic sampling.
Well, the system is setup to the selection for us and make it as easy as possible.
In our particular case, we're now going to do a systematic selection.
And the first thing that we're going to do [COUGH] is give information to
the system that tells it how to calculate an interval.
13:27
We're going to replicate or repeat our sample
in this particular case with an indicator in our prob.sys,
that is something based on our sampling fraction,
20/975, 975 elements in the phrase.
Now it's not looking at the frame, it's just generating this particular object,
you wonder why they don't say capital N here and lower case n but
it's a different language, they're set up in different ways, these are packages
written by individuals that are then assembled together and made available.
So you gotta read your documentation carefully to be able
to use these kinds of things.
So, in this particular case then, we've created an object, prob.sys,
that contains the basic information about our sample.
But we need to pick the selection, and
here there is something called UPsystematic.
And we won't go through the details of what this is, but
we're going to use that particular function, UPsystematic,
with probabilities, the pi-ik indicators
inside the prob.sys to make our selection systematically, and
then put that Into a sam.sys object.
A little more complicated, right?
It's a little harder to do, unless you're used to working in this particular system
and know how to implement this.
So you're going to have to read, as I said, documentation carefully.
But then we go back to the same operational which we have a frame,
then we're going to use sam.sys equal to 1 indicating a systematic selection.
And then we've listed out the cases in here, I've actually highlighted in
the first about 100 cases, two of the selections that are there.
Note that our interval is 48.75, and
that there are gaps of 48 and 49 with this kind of a system,
if you recall our systematic sampling with the fractional interval.
And indeed the gap there between those two is 48, the next gap might be 49 and so on.
It will depend on where the random start occurred in the nature of the interval.
15:33
Okay, so again, it uses that same system,
generate the sample, put it into a file that has, or
an object that has as many elements as our population, then align the two,
our frame and that sample that has indicators about which case are there.
Use the which function, applying it to the frame to grab information from the sample,
the file of sample indicators to select cases out of our frame.
16:10
We can even get more sophisticated than this when we do
probability proportionate to size.
Here we're going to be doing selection by probabilities,
in which we have a couple of variables that are created in this particular case,
and we are going to apply this to our frame, to our variable owner, _hu.
But there's a problem, we get a warning messages that comes up, and
it says that some of these inclusion probabilities are 0.
Well, that's because there are certain blocks that don't have any owner occupied
housing units on them.
16:46
And so, it's just giving us a warning saying,
are you aware that this is the case?
Now, we know that our Probability Proportional to Size Selection system
can operate in this framework, but it's just reminding us that this is going on.
But, again, there is a system.
In this case, a UPbrewer,
17:06
a probability system aimed after a statistician named Brewer
that does probability proportion of the size sampling.
Uses that information that we've setup already to create a sample indicator file,
and then we use our frame function in order to get our sample.
And so there's nothing new here,
it just keeps extending it, the same basic operation.
Generate a sample, match it if you will to the frame, and
then generate our selections as we've done before.