0:31

And for that, we'll use a one way repeated measures ANOVA.

Â This is a parametric ANOVA.

Â And we've done a one-way ANOVA, you'll remember, before, but now it's a one-way

Â repeated measures ANOVA, which indicates a within subjects factor.

Â 0:46

So we'll read in search, scroll, voice as our data table,

Â that third level of our technique factor.

Â Let's view that, as we commonly do.

Â So we still have only 20 subjects, and we have technique search scroll and voice.

Â 1:06

We have order as before, one and two, where voice is always three.

Â Now, that would be a real challenge if we ran a study this way,

Â where we brought people in to do voice always as the third technique.

Â Because we might be introducing some confound there by having it always last.

Â But perhaps in an exploratory aspect of the experiment, we might tack on

Â a condition like voice maybe to test a prototype at the end of the study.

Â 1:49

And as we often like to do, we want to see a few more statistics

Â about each of the levels in terms of they're mean and median.

Â So we can see here for example that, that

Â scrolling seems to be the longer, the slower of the techniques.

Â Searching and then voice is a little bit faster than searching.

Â Is it fast enough to be different?

Â That's the question and we can do looking at the standard deviations

Â in the next output, and it helps us judge that a little bit.

Â And we can look at our histograms as well.

Â These haven't changed for search and scroll but for voice as the new one,

Â we can see a lot of clustering between 80 and 90 there.

Â And the box plot helps us see their relative position in terms of the time it

Â takes to find a contact in the contacts manager.

Â 3:02

then we can look into pairwise comparisons.

Â If the overall or omnibus test is not significant,

Â we're not justified in looking further for pairwise comparisons.

Â We're going to use the EZ library, and

Â I've got some comments here in the code that help explain how this is working.

Â So the EZ library allows us to build this model M with specifying

Â the dependent variable time within a subjects variable technique and

Â the within subjects ID as subject, and also the data table here.

Â So we have a one factor three level within subjects variable called technique.

Â And we built our model.

Â And then the comment says, we have to check our model for

Â violations of something called sphericity.

Â 3:49

Sphericity is the situation where the variances

Â of the differences between all the combinations of levels of within subjects,

Â factor are equal, or very near equal.

Â It always holds for within subjects factors that have just two levels.

Â So we don't have to worry about it.

Â But with three or more levels, sphericity has to be tested and

Â examined with Mauchly's test of sphericity.

Â These are some of the complications and

Â complexities that within subjects variables introduce.

Â We'll see later when we use mixed effect models that we can actually

Â model co-variance explicitly, and we don't have to use sphericity to test for it.

Â So we first check in our model here the Mauchly output.

Â If it's significant, it indicates a violation,

Â and we have to use a corrected form of our ANOVA.

Â 4:40

Here, we do have a P value of less than .05.

Â That star means that that's the case.

Â So we have a violation of sphericity and

Â we'll use a corrected output which I'll show you in a moment.

Â If there's no violation, we can just use the regular ANOVA.

Â If there is a violation, we'll use the sphericity output, and

Â use it within that, the Greenhouse-Geisser correction.

Â So first let's look at the ANOVA table, without correction.

Â We can see an F test, recall it has two degrees of freedom.

Â Degrees of freedom in the numerator are two, and the denominator are 38.

Â Here, is our F statistic.

Â And the P value is obviously quite a bit less than .05.

Â And GES is a value that tells us the effect size.

Â It's called the generalized effect size.

Â We won't go into that in this class, but

Â effect size has to do with the strength of the effect.

Â You don't want to interpret a P value as effect strength and so

Â the generalized effect size is a way of getting that.

Â Actually, the GAS stands for Generalized Ada Squared, and

Â it compares to Ada squared or partial Ada squared, which are other effect sizes.

Â But because ES also matches effect size,

Â I find that an easier way to remember what it means.

Â 6:08

Okay, we're actually going to do some calculations here to compute the degrees

Â of freedom for the corrected results.

Â So we'll just do those, and

Â add that to this sphericity table that's output from this EZ ANOVA function call.

Â So here's our table, and again we have technique as our effect.

Â We know there's a sphericity violation, so we're going to use, there are two outputs

Â here, the Greenhouse-Geisser correction and the Huynh-Feldt correction, the HFE.

Â We'll use the Greenhouse-Geisser correction.

Â This is the Greenhouse-Geisser statistic.

Â And the P value that goes with it, obviously less than .05.

Â So technique is still statistically significant for the F test.

Â Because there is a sphericity violation,

Â if this wasn't less than .05 we wouldn't have a significant result.

Â Now we'll ignore the Huynh-Feldt results, we only need one set.

Â And then here are the Greenhouse-Geisser degrees of freedom in the numerator and

Â denominator.

Â And we can round those to nearest, say tenth.

Â And that's what we computed up above here, so

Â we have the full data we need to report the result.

Â So it's reported just like an F test result,

Â but with these adjusted degrees of freedom,

Â and the adjusted degrees of freedom and

Â the F value from the original effect table.

Â Incidentally, the same uncorrected results in R can be given by fitting this model

Â here which you should be able to understand now and then,

Â summarizing over that.

Â I'll just do that briefly.

Â But that wouldn't give us the sphericity tests, the Mauchly's sphericity test,

Â and so that's why we don't use that generic form here.

Â 7:55

Now, because the overall test was statistically significant,

Â we can reach in and do post hoc comparisons.

Â And for that we will use the paired samples T test,

Â but we need a wide format table for that.

Â So we'll use D cast as we've done before to make a wide format table

Â based on technique and we'll view that.

Â So we have subject in the left column and then scroll, search,

Â and voice across the top.

Â 8:22

We verified that and then in the next three rows,

Â we store up the individual paired sampled T tests.

Â And then we adjust for multiple corrections and display the results.

Â And we can see that all three results are statistically significantly different.

Â 8:54

Well, let's look at errors now for the three techniques.

Â As we've said, errors often aren't conformant to the assumptions of ANOVA.

Â So we'll do some looking at errors for the three techniques here.

Â We can see the means and medians there.

Â And standard deviation there as well in that next output.

Â And some histograms will give us a sense of the distribution of errors,

Â those first two by search and scroll haven't changed from before.

Â Here are the voice errors those certainly don't look normally distributed,

Â and we can look at the box plots for errors and we can see that in fact scroll,

Â still seems the least.

Â And voice, although it seemed fast, it was maybe more error prone.

Â If we go back to a couple graphs back, we can see this was the time things took,

Â voice was the fastest and we know that was a significant difference but

Â when we go forward here and see errors, voice seems the most error prone.

Â What we have in our hands here is a speed accuracy trade off in human performance.

Â That's very, very common.

Â When people are faster, they tend to make more mistakes.

Â That's not universally true when we're comparing techniques.

Â It may not always hold, but more often that not, that may be the case.

Â So keep that in mind as you measure both speed and errors or accuracy.

Â 10:25

We can ask again as we did before,

Â are those errors Poisson distributed in this new voice condition?

Â So we'd done a fit and examining that, we see in fact that there

Â is definitely no significant departure from a Poisson distribution.

Â That will be interesting to us later when we return to this data and

Â analyze it using a Poisson distribution directly.

Â But for now we'll do a Friedman test on errors.

Â And again we have the same syntax as we did for

Â the Wilcoxon signed-rank test where we have errors by technique and

Â subject as our blocking factor across rows here.

Â And so the Friedman test shows a P value that certainly is much lower than .05.

Â And we might expect that in looking at the graph.

Â That means the overall test of errors is significant.

Â So we can reach in and look at the pairwise comparisons using

Â the Wilcoxon signed-rank test as our pairwise test.

Â We correct for multiple comparisons and

Â we see that all of the results are less than .05 even when corrected.

Â So with confidence, we can say all of the pairwise comparisons,

Â the two way comparisons here between search and scrolling, scrolling and voice,

Â and search and voice are all significantly different in terms of errors.

Â 11:58

Lastly, we can look at the Likert scale ratings.

Â Ordinal ratings,

Â one to seven also don't generally comply with the assumptions of ANOVA.

Â Let's explore that data.

Â Here, we can see means and medians again for how people rated effort.

Â How hard or effortful was it to use these techniques to find contacts?

Â And we can see that the standard deviations look similar so

Â the spreads around them probably about the same.

Â Looking at some histograms, we see the effort on a seven point scale for

Â search, for scroll and for voice.

Â They all look like they were more towards seven, let's do a plot and see here.

Â Where we see efforts about similar for scroll and search, but

Â maybe a little more effort for voice.

Â Perhaps it was, we know there were more errors, so

Â perhaps it was making voice recognition mistakes.

Â Let's do the Friedman test on the overall effort ratings and

Â here we see an interesting outcome.

Â The P value is not significant meaning there's not a detectable difference

Â in the effort ratings on one to seven scale that people gave for

Â these three different techniques.

Â I have a note here for what that means.

Â Since the omnibus test is not significant, the post hoc comparisons,

Â the pairwise comparisons are not justified.

Â If we could do them, we would carry them out like we did for errors just above.

Â So we know how to do that.

Â But we're not justified in doing that in this case.

Â That's an important principal in these analyses to remember.

Â 13:36

So we've just completed our analysis of the performance of subjects looking for

Â contacts in a smart phone contacts manager using three techniques, searching,

Â scrolling, and their voice.

Â 13:47

So we had one factor, it had more than two levels.

Â It had three levels as we just said.

Â It was a within subjects factor.

Â All subjects did all three of those techniques to find

Â a set of contacts in a contact manager.

Â 14:02

We used a one way repeated measure ANOVA for our parametric tests,

Â and we used the Friedman test for the non parametric tests

Â across all those three levels of the technique.

Â We followed up the one-way repeated measures ANOVA with

Â paired samples T tests for post hoc contrast testing.

Â And for the Friedman test,

Â when it was significant we followed it up with the Wilcoxon signed-rank test.

Â 14:37

Now, what happens if we go beyond not just two or

Â three levels of a factor, but

Â if we go into having multiple factors themselves?

Â This will bring us to the factorial ANOVA and the aligned-rank transform.

Â It'll take us towards linear models and

Â eventually generalized linear models, which will be next.

Â