After you've collected all your data, what do you do with it? The main technique
you'll learn in this video is how to compare rates. Rate comparison is my pick
as a first statistical test to learn in HCI because it's easy to do and relevant
to many real world activities like comparing click through rates on websites.
Before we get to the details, let's begin at a higher level. Here are three really
important questions that you can ask and answer by analyzing your data. For
starters, what does my data look like? To do this, explore your data graphically.
Plot all of your data. And then, once you see some patterns that may be interesting,
try to look at aggregate summaries of different sorts. Second, after you've
looked at your data graphically, what are the overall numbers? These aggregate
statistics help give you a quick summary of what you've seen in your experiment.
Simple things to look at initially are mean, the average, and the standard
deviation, how much variation was there within each condition. And third, are the
differences that you see and say the means, real? Separating real differences
from mirage differences is the goal of statistical testing, and you'll learn a
technique for doing that in this lecture. Say for example I have a coin, and we'd
like to know whether it's loaded, that is, is it equally likely to produce heads and
tails. So I have a coin right here and I can toss it a whole bunch of times. Tails,
heads, and I can keep going. Let's say I toss it twenty times and thirteen of those
twenty tosses turn up heads. What would we expect to get? Well if we have an even
coin that is equally likely to produce heads and tails, that would say that our
expected value for twenty tosses would be ten heads. Is three extra heads out of
twenty a significant difference? That is, is it weird enough that we would be pretty
likely to say that our coin is biased and not an even coin. Well to figure this out,
we are going to use a test statistic. And what attribute does our test statistic
need? Well one thing th at we'll want to encode is the difference from the expected
value. So the fact that we got three extra heads here and we want to know out of how
many this is out of in a couple of ways, both in a ratio sense. Three is less
material if it's out of a thousand. And also in a number of trial sense, if we had
a thousand trials and we saw a 30 percent rise in the number of heads, we might say
that, that's more unusual than if we had a couple of trials none of 30 percent rise
cuz that's only, that's only a few. And so in this lecture, we are going to use the
Pearson's Chi Square Test. This is a fancy name for the totally standard test for
being able to compare the rates of an expected value to an observed value. And
it's going to use, exactly the parameters that we decided were necessary. So it's
going to compare our observed to our expected and it's going to do that in a
way that as we have more trials. If we have a sizable difference, that will give
us increased confidence that the difference is robust and significant. And
it's called the Chi Squared Test because the value that we're going to get out of
this test is called the Chi Squared Value so our x^2 =, we're going to take the
difference between the observed and the expected value. It's always positive so
that a divergence in either direction is positive and also so that large
divergences from the mean are more notable than smaller ones. And then to get that as
a proportion, we'll divide that by the expected value. And we're going to
consider this difference for each of the possible outcome values. In our coin, we
only have two values but you can imagine a dice or other options where there will be
more possible values of the outcome. Now before we get to the outcome of our test,
I'd like to introduce some additional machinery first. You can imagine that a
normal coin. One that not loaded will, if you toss it twenty times sometimes it will
come up ten out of twenty right as the expected value. But, it wouldn't be
shocking if it came up nine heads out of twenty, or eleven h eads out of twenty.
And this distribution of expected values for something like a coin will follow
what's called a normal or a Gaussian Distribution. So you've got the expected
value or the mean in the middle and that's going to be the most probable outcome and
then it's going to slowly fall off, becoming increasingly unlikely as you head
out towards the tails. And the area under our hill is gonna sum to 100%. The sum of
all the possible probabilities adds up to 100%. And out here on the end are the two
tails and we're gonna call the area that's in those very edges, the really unusual
behavior. So to fill this in a little bit, in our coin case, our expected value is
going to be ten so our mean will be ten, and we'll have some observed value here in
our example thirteen. And the question that we're gonna ask with our statistical
techniques is whether that observed value of thirteen is sufficiently weird. Whether
it's sufficiently far out into the tail as to qualify as unlikely to have occurred by
chance. And that hinter lands of unlikely to have occurred by chance. By convention,
we're gonna have the, the portion of the two ends of the tails that together forms
five percent of the distribution. This would be for a 2-tailed example. And so,
if you've ever read in a scientific publication that the value, the
probability was P less than.05. That P value is this observed value that we're
seeing. And if it falls far enough out in the tails were going to say that was
unlikely to have occurred by chance. The second piece of machinery that we're going
to need to be able to do our statistical test is an idea called the Null
Hypothesis. And what the Null Hypotheses means is that our opening bid in any
statistical test is going to be, we don't think there's a difference between the two
conditions or however many conditions you have. So in the case of our coin, what
that would mean is your opening bid would be the Null Hypothesis which would be, the
coin is not loaded or to be a little bit more precise about it. The behaviour of
the coi n does not differ significantly from that of a normal, unloaded coin. And
what our statistical test is going to do is we're going to, check whether the data
falsifies the null hypothesis. So, that's a fancy way of saying that, if our opening
bid is this coin's behavior doesn't diverse significantly from a normal
unloaded coin. And you got say, twenty heads out of twenty. You'd reasonably say
this isn't right. So in that case our data would allow us to say that we falsified
the null hypothesis in this case. We reject the bid that the coin's behavior is
normal. And the very last thing that we're going to need out of our statistical test
is, in the case of a chi squared, we're going to need to know, what's our p value?
What's the probability that the observed behavior could have been generated by a
normal coin? So as our probability goes down as it becomes extremely unlikely that
the behavior was generated by a normal coin, once we get past our magic threshold
of .05, we're going to say we reject the null hypothesis, that's a loaded coin. So
the thing to take away from this table is that as your x^2 number gets bigger, as
that gap between expected and observed gets larger, or the number of trials gets
larger, that's going to up your x^2 and in turn, that's going to make it increasingly
unlikely that this behavior could have been generated by an unbiased coin. So now
we can return to our example and do so a little bit more formally. And we can ask,
twenty tosses, thirteen heads, at p<0.05, can we reject the null
hypothesis that there's no difference between this coin and an unbiased coin? So
let's work it out. We've got (thirteen - ten)^2/10, those are our heads. The
thirteen observed minus the ten expected. And then we're going to add the other side
of the coin which is seven observed tails minus ten expected tails again squared and
divided by ten. When we sum that all up, we get a value of 1.8. The other thing we
need to figure out is the degrees of freedom. The degrees of freedom is the
number of choices that you have minu s One. So for a coin, we have two sides so
it's two choices minus one gives us one degree of freedom. And that's cuz you
know, you really only picking one thing. If you had a die for example, a six sided
die, your degrees of freedom would be five cause there are six faces minus one, that
gives you five. So we can go to our table and remember as we go further to the right
that makes it unlikely that we have a normal coin, so that's our loaded coin.
And with one degree of freedom and a chi-squared value of 1.8, we can see that
we have coin behavior that is slightly unusual for an even coin. But not out of
the realm of reasonable. It's between ten and 25 percent of the time that you toss a
coin twenty times, you will see a divergence from the mean of this
magnitude. And so, because our divergent, because our chi-squared statistic doesn't
show up as sufficiently unusual, that is to say that it doesn't make it to the .05
level, in this case, we can't reject the null hypothesis. So that's a real fancy
way of saying that we can't yet stand up and say, this coin is a loaded coin. If we
really cared, the thing to do would to be to gather more data. So, let's say we keep
going. We're going to now toss it 60 times and we see this same ratio continue, 39
out of 60 times it shows up heads, so that's going to give us 39 - 30^2 / 30.
Heads + 21 - 30^2 / 30, that's our tails, and that's going to give us a bigger
chi-squared value, even though the ratio is the same because we're having more
trials, that's increasing our confidence, and the chi-squared value goes up. So now
it's up to 5.4. And we can look up in our table again, same coin, still a coin, so
degrees of freedom is still one, but we see now that our chi-squared value is way
over to the right and so the equivalent p value that pops out of this table is
somewhere around .02. So, we can reject the null hypothesis that our coin is no
different than an unbiased one with 98 percent confidence. One thing I'd like is
point out is that, if your trend is robust, if the ratio continues, increasing
your sample size, by in this case, a factor of three is going to decrease your
p value by a factor of nine. So, that's all and good you might say. You now know
how to walk into a gambling hall and check whether the coin that somebody presents
you is fair or not. But, what does this have to do with HCI? Well, the mechanism
that we've just learned for comparing rates, that holds for coins, also holds
for things like click-through waves on websites. Let's say we have a website that
has a button labeled Sign Up, and ten percent of visitors click that button. To
try and improve traffic to that button and get more conversions, we might change the
button to Learn More and then start gathering data. Over a week, there are
1,000 visitors to the site, and 118 of them clicked on the Learn More button. Can
we say with confidence that the Learn More button has a higher click-through rate
than the Sign Up button did? So we can work it through. We have 118 observed
clicks on this site, minus 100 expected clicks on the site, and we have 882 people
who did not click, minus 900 who we expected would not click. And as you can
see, over a week there were a 1000 visitors to this site, and 119 of them
clicked the Learn More button. Can we say with confidence that the Learn More button
has a higher click-through rate than the Sign Up button. Let's look it through. So
we have a observed number of, so we have a 119 observed click throughs minus a 100
expected and we have 881 observed non clicks minus 900 expected. Add it all up
and you get about 4.01 as the chi-squared value and again we have one degree of
freedom because we have two choices clicked and didn't click, one degree of
freedom. And when we look it up on our table, what we see is that the
chai-squared value is just slightly larger than the threshold for p<0.05. And so
we can say that this change indeed probably did influence the click rate. So
the chi-squared test in statistical testing as a general methodology gives you
two really powerful tools. First, it gives you a way of formal izing, we're pretty
sure. And deeply intertwined with this is it gives you a way of generalizing from
small samples. Are the differences that I've observed on a small scale likely to
generalize, if I were to scale this up? And this idea of inference from small
samples owes a lot to beer. In 1908, William Gosset was a chemist at the
Guinness Brewery in Dublin. At the time, Guinness was hiring top graduates from
Oxford and Cambridge to apply biochemistry and statics to Guinness' industrial
processes and Gosset devised the t-test as a cheap way of monitoring the quality of
stout. He published this test in Biometrica, but he used a pen-name in the
journal so that Guinness could keep its use of statistics as a trade secret. For
Gosset and Guinness there were really two important benefits of being able to make
broad quality estimates from small samples. First, if Gosset's testing
consumed all the beer, there would be none left to sell. Second, many statisticians
find it difficult to do mathematics after having consumed a large quantity of
Guinness. And so the quality of the results are importantly contingent on
testing on just a small sample. And this general strategy of working from a small
sample and performing significant testing can be done under a variety of framework
so today we talked about the chi-squared test but there are several others. For
example, if you have continuous data as opposed to discrete rate data, there's the
t-test, and if you have more than two conditions, there's a test called the
Inova or the Analysis of Variants. And these all work for the same normal
Gaussian data that we've been looking at so far. So for example, these tests can
help you compare which vacuum cleaner gets things cleaner if you have a measure of
cleanliness or which running shoes help you run faster or which input device,
trackpad, mouse, stylus is fastest for input and while handling it is beyond the
scope of this class, I think it's important to point out that data often
isn't normally distributed. Data could be bimodal, so if everybody falls into one of
two camps and nobody in the middle, then you don't have the nice big blob for a
normal distribution. Or, it might be, shifted over to one side. For example,
anything that's time-based. You can be infinitely slow, but you can't be
infinitely fast. And right now I just wanted to point out that that's something
to watch out for. In practice, a lot of the tests that you'll come across are
reasonably robust to modest deviations from being normal. And so, especially for
practical purposes, things often work out. And so, knowing what those assumptions are
and whether a particular test is appropriate for your data is half the
battle. Another clever technique that I picked up from Ranco Havi is to use AA
tests which is to say, take one condition like all of the people who got the Sign Up
button. Divide that in half and see if you see a statistically significant difference
between one half and the other half. That can be a good warning sign about whether
you're seeing mirages. Or, and again this is way beyond what we'll cover here, you
can use techniques like randomization testing which make no explicit model of
your underlying data and rely on repeated simulations as a way of modeling the data.
So to pop back up, what we learned today is that, to get a feel for your data,
graph it all. We also saw how statistics offers us tools that help you distinguish
real trends from mirages. And we learned a common technique, the chi-squared test,
for comparing rates. And here, as with other lectures, my goal is both with an
introduction to an area and a concrete skill that you can put to regular use. The
web has provided huge increases in the quantity of available data and also made
it much easier for you to run experiments online. So, my hope is that many of you
will use the experimental skills you've learned here all the time. And while
there's nothing fancy in this video, you may find it useful to review it once or
twice when you first use them for your own work. And really we've just scratched the
tip of the iceberg. If you'd like to learn more, as a next step I highly recommend
Jake Wobbrock's course on Practical Statics for HCI, he's got a series of
online materials. If you'd like to learn about practical strategies for doing
experiments in the general sense, I highly recommend David Martin's Doing Psychology
Experiments. If you'd like to learn the philosophy behind statistical testing,
there's a great book called Statistics as Principled Argument. And if you'd like a
nice flow chart of which test should you use when, I recommend the book Learning to
Use Statistical Tests in Psychology.