0:00
[BLANK
AUDIO] Welcome back,
we're talking to day about web or online data collection.
So we're transitioning from the kind of self administration,
automated self administration we've been speaking about,
which occurs mostly in the context of face to face interviews.
Now we're talking about a situation in which respondents are self-administering
a questionnaire without any interviewer present in any way really.
So we'll talk first about the impact
of web surveys on different types of survey error.
The standard types we've been discussing non-response sampling coverage and
measurement error.
And we'll talk about a number of different types of web surveys,
because there quite a few varieties that are in use today,
and they have different consequences for error.
We will subsequently talk about the effects of interactivity
in web surveys on data quality Interactivity or something that really
isn't available in other types of automated self administration or
other types of self administration in particular paper.
1:16
So let's turn first to coverage error in web survey.
So, you remember from earlier discussion that coverage
error really is a function of two things the rate of non coverage and
the difference between those who are covered by the survey,
that is, are in the sampling frame, and those who are not.
You have to put these two measures together to
determine how much coverage error there is.
So, if there is a big difference between those who are covered and those who
are not, and of course you can't know the attributes of those who are not covered,
they're not in your frame, you'll never collect data from them, but
if in principle or through some means outside the survey you're able
to determine that there's the large difference between those who are in
the frame that is covered and those who are not.
2:04
But almost everybody in the population is in the frame,
that is there's a very small rate of non-coverage.
Then that big difference won't affect the estimates nearly as much as if
there was more undercoverage.
That is, if more people in the population were not contained in the sampling frame.
So coverage error is actually a very serious concern with web surveys because
2:36
So, someone without access to internet simply will not provide data because they
can never be invited to take part in a survey.
At least of the sort that provide the kinds of estimates that we
rely on the most.
We will talk about some other varieties of web surveys where this is less of
an issue perhaps because the type of web server doesn't require
3:00
an explicit invitation so when it comes to coverage the big question is who is
online because if they're not online they cannot provide data in the web survey.
Who's not online?
There is very good data from Pew Research.
This figure illustrates that there are high rates of not being online, and
therefore high rates of undercoverage in any kind of frame that's based on internet
users for older people, for people with lower income,
for people with high school only, or a high school education.
And for people who live in rural parts of the US.
3:42
So let's say that you're wanting to measure
attitudes towards some agricultural issues.
Say farm subsidies using a web survey.
The fact that people who live in rural areas are on you're represented online
suggest that,
you might be in act really measuring the attitudes toward agriculture.
Where it is more important, that is in raw areas.
4:09
Because of this under coverage of people who living rural areas,
it's a very serious concern if one wants to generalize to
the population whatever that is, because not all members of
most populations have equal access to the internet.
4:25
Some reasons that members of the public are not online, again,
from pew research, the most frequent explanations are that it's
not relevant that they're not interested, it's a waste of time, they're too busy,
they don't need to be online or don't want to be online, or it's hard to use.
The usability issue.
It's difficult, it's frustrating, they don't know how to do it.
They have a disability that prevents them from being online or
they're worried about viruses or security issues, hackers and so on.
So there are other reasons as well but the point is not everybody is online and a web
survey that is designed to generalize the population has to address this issue.
5:10
Sampling for web surveys posses a number of interesting problems.
Their number of different approaches to sampling for web surveys.
One, is not to sample at all.
We'll look at a number of examples of types of web surveys that really don't do
any sampling.
There are list-based samples in which all members of
a population are sort of enumerated with contact information.
There are panels which are pre-assembled groups of large lists
of sometimes millions of participants and these really come in two varieties
what are known as non probability panels and probability panels.
So we'll go into them in some detail later.
And the transaction or intercept based methods.
So for example every Nth user at a website might be invited to take part in a survey.
There's the frame of general internet users including any frame with email
contact addresses which means that if members of the public are going to be
invited to take part in a web survey, it's hard to do this using email but
it's desirable to use email because it's possible to just include a link in
an email message which really reduces the obstacles required to start the survey.
So instead one can sample from a frame of mobile phone numbers and
possibly contact members of the public by texting them with a link.
This is complicated,
it's more common still to send a paper invitation through the mail.
That contains a web address a URL and the sample member
then needs to manually enter that URL which can be an obstacle.
So a list based samples are desirable, preferable.
These resemble the sampling frames in other modes.
So we talked about telephone frames earlier and they are list-based.
So for example you can randomly generate a list of phone numbers
using random digit dialing techniques so there are some lists,
list-based samples for some populations, not national populations, but
an organization that has membership lists would be an example of this.
Students at a university, good example, or employees in a company.
All have an email address and can be contacted that way, or
they may have other contact information in the frame like a mobile phone number.
7:37
And so, they're invited and given a link.
The limiting factor, the key issue is the quality of the list.
Is it complete and up to date.
If it's a student frame and it's a year out of date at
an undergraduate institution, 25% of the entries in
8:00
Online panels are an alternative to list based samples and
probably more commonly used.
One definition of a pool of pre-recruited people
who have registered to occasionally take part in web surveys.
This comes from Monya Goeritz.
This is the same as what's known as an online access panel or access pool.
You shouldn't confuse it with
8:37
Mostly I have been asked the same question over time repeatedly.
That's not the same as an online panel in which members
are invited to take part in generally one-off or cross sectional surveys
In the main received many invitations a day or on a weekly basis.
But they're not necessarily in this for repeated measurement the way
9:06
So, as I said, the members of an access panel or an online panel may receive
frequent invitations to take part in questionnaires on various topics.
And they pick and choose, and often response rates are quite low.
9:24
The incentives that sample companies have
found useful are bonus points, or entrance to a lottery,
or money transferred online by a service like PayPal.
The recruitment for what are called opt-in panels is often done online.
These are volunteers who may see a banner advertisement on a website and
click on it.
And they're brought into the online panel that way,
they're asked for contact information.
This is quite different from recruitment for probability panels which
are designed to support generalization to the generally national population.
So the recruitment for probability panels resembles a recruitment.
For samples, representative samples in other modes.
In fact it uses, techniques used in other books that is it's done offline so
either by random digit dialing in which telephone numbers are randomly generated,
called and the households or individuals associated with those phone numbers
are invited to join the panel or some sort of addressed
based sampling method in which the recruitment is done face to face.
As I said, this is very much how recruitment is done in other modes.
It's done this way because there is no frame of internet users particularly with
e-mail addresses.
But the result can be a panel of
10:56
participants who very much resemble the national population.
So just as an example, it's a lot of text here, I won't go through it all but
you can compare the Harris Poll which is a non probability or
opt-in panel to GfK's KnowledgePanel, which is a probability panel.
The recruitment is done differently, in the case of Harris Poll,
it's a mix of online and offline sources.
But including clicking links encountered online.
They use a telephone survey to calibrate the results.
Of the online survey,
a small telephone survey which is believed to be representative and
this is used to adjust the estimate standard derived from the online data.
11:53
begins with a combination of addressed based sampling and random digit dialing.
Sample members, or members of the public who are contacted through those methods,
who are not online, are provided with an internet enabled device.
Originally it was a Web TV many years ago more commonly now,
it's a tablet of some kind, as well as internet service.
And this solves, to a large extent,
the problem of members of the public who are not online.
It is quite expensive, and so it makes the data collection process much more
expensive and closer to but can be closer to telephone interviews in cost.
Response rate is generally better, but response rates are complicated,
because there's a series of opportunities for someone who is being recruited into
a probability panel, to not respond to the invitation.
So we'll look at that shortly.
And the estimates, the population,
make use of kind of more conventional weighting and post stratification.
There's no calibration the way there can be a non-probability counts.
So how accurate are the different types or
the estimates from the different types of online surveys?
One comparison was reported by Yaeger and his colleagues.
They compared results from the same questionnaire.
13:17
Administered on the telephone so the sample recruited through
random digit dialling methods as well as to sample them in the knowledge panel,
the GFK knowledge panel the bar labelled internet, that's what that refers to.
Six non-probability panels, access panels of the sort we've been describing.
The Yaeger and colleagues were able to derive and measure of error
based on comparisons to the telephone survey and treating that as kind of a gold
standard and to administrate of records which served as a benchmark.
as you can see, the probability panels very close to the error rate it's very
close to that of the telephone survey and
these are lower than all of the non-probability panels.
So it does look from this comparison anyway, like the non-probability.
Samples are associated with higher error rates
than the non-probability lead sample.
So those are panels.
There are a number of other approaches that organizations and
individuals use to collect data online.
For example, what are called entertainment polls like question of the day type poll.
So you can see in this example there's a button labeled quick vote, and
the idea is that if a user happens to be at this site, this media site,
they might want to click on this button and
14:51
vote which generally in this case means answer some poll questions.
There is really no way to define the population of participants, and
therefore it is really hard to generalize from this data, but when they're used for
entertainment purposes there's really no need to generalize to the population.
But these kinds of unrestricted self selected surveys are problematic for
those reasons.
So nothing is known about the participants.
15:34
We don't know anything about the population so we really can't generalize.
It's possible for
the same user, same participants can provide data multiple times.
And there's really no sampling.
Earlier we spoke about one design is up for sampling is no sampling and
this is really what we're talking about If you look, this would be another example
there's an ad here this taken from a screen of a smart phone.
Take our survey on Brian Williams etc.
Any body can do this there's really no information about who they are and
therefore makes it very hard to
make any broad statements about larger groups of people.
16:17
Finally, intercept surveys are quite common online.
These are probability based in the sense that
it might be every Nth user to a website for an example is invited.
But again, nothing's known about the individuals.
In contrast to say a list sample or a panel.
16:46
So that's an introduction to coverage and
sampling concerns and error in online surveys.
What we'll turn to next is non response and
measurement error to other error source we've been concerned with.
And when we talk about measurement error, we'll really talk about visual
aspects of the design that can affect the amount of measurement error.