0:06

So, here we are in our R studio with our coursera.r file and

we're moving on to the scenario where we're comparing the number

of distinct pages visited in an A/B test, and

we're going to go through a few analyses to do that here.

And as the comment indicates, what we'll be doing is an independent samples T

test and we'll talk more about that as we go.

So, as is our usual procedure,

we'll read in one of our data files, that goes with this work.

And that is a PG views or page views .csv, so we'll read that in.

0:44

And as is our typical process here, we'll take a view of what that is so

we can be comfortable with it.

So as you can see we have a subject column so we can see that each subject is

measured just once, it seems, and then a site column so which site where

they issued, A or B, and as I scroll down here, it kind of refreshes.

I'll go all the way to the bottom, and then it'll refresh.

And so we do have 500 subjects, as we said in the description.

And then a column called pages.

And it looks to be pretty much kind of single digit

1:17

counts of how many pages were viewed.

Looks like, obviously one would be a minimum we would guess and

I saw maybe a ten in there or an 11 in there as maybe a maximum.

We can find out more formally what those are.

That gives us a sense of what we're dealing with.

1:32

We'll go ahead and recode that subject, that subject column as

a factor since it's just a number it thinks it's a numeric variable,

but as we've now talked about variable types, we know we want it to be a factor.

It won't be used directly in this analysis, but we're going to keep dong

this good practice because as we progress in the sophistication of our analyses,

we'll see that we end up using the subject later.

And then let's go ahead and take a little sum review.

We can see that there are 500 distinct,

these six plus 494 other distinct levels of subject.

That's just the subject identifier.

It looks like 245 of those subjects were exposed to site A, 255 to site B.

So very nearly a 50/50 test and

certainly kind of a realistic outcome, as often is the case.

And then here, because pages is a numeric response variable.

It computes for us a min and a max, 1 and 11 there, and some other data.

We can see the mean is right near four and the median is four.

We'll also look a little bit more at some descriptive statistics

using the plyr library.

2:40

This function, DDPLY, DDPLY,

allows us to apply a function over certain aspects of the table.

And remember, I'll remind you, you can always type a question mark and

then a function name, assuming that the library for

it is loaded, and it'll bring up the help for that name.

So DDPLY is a split the data frame, apply function and

return results in a data frame.

So what we see as input here is the data table itself is page views.

We want to split by site and apply this inline function

where we are summarizing over the pages by site.

So when we do that, we can see for each site, A and B, we can see now

some of the same statistics that we saw before overall, but now split by site.

So we can see the mean for site A is 3.4, the mean for site B is almost 4.5.

So that suggests there may be a difference,

but we've learned that comparing means directly is not the full story.

We need to know something about the variance.

So, this other function allows us to summarize and

get the mean number of pages which we have here.

But also then the standard deviation which would be of interest.

We can see that in the site A condition, there was a standard deviation

about half the size, of the number of pages viewed in the site B condition.

So there were more pages viewed in site B, but

also with greater deviation around that mean.

One way to view that is with a histogram.

So we can call the hist function and we can look at the page views for

site A and the number of pages.

4:21

So I think we can just graph that there, and

we can see a couple of things about this.

We can kind of see the range from this from about one to six.

We can see in site A, it looks to be kind of a normal distribution,

kind of a bell curve or Dalsian curve there.

Let's go ahead and look at a histogram of site B.

And here we can see something a little bit different.

A very few number of pages visited up above,

seven and eight and ten, quite a few down lower.

Doesn't quite look like a bell curve.

It doesn't look normally distributed, and

those kinds of considerations will come up as we go forward in the course.

For now, we're going to ignore those differences, but they are relevant and

we will talk about them more in the future.

Another way to look at the data too is a box plot.

So with the plot command, we can see pages by site.

And now we understand that notation a little better.

Pages being the y variable, the outcome by site,

which is our independent variable or x variable if you will.

In the meantime then, we're going to execute our independent sample's t test.

Why is it independent samples?

What does that mean?

Remember that factors can be between subjects or within subjects.

And between subjects is the type of factor that site would be,

because each visitor gets either website A or B, but not both.

So it's an independent samples T test.

In the future we'll see a paired samples T test that is appropriate for

within subjects situation.

5:57

You can see this parameter at the end.

To T test var equal.

That's saying the variance is equal.

We can see in this box plot that's obviously not true and

we'll formalize that consideration as we go as I said in the future, but for

now we'll just do a basic uncorrected T test assuming that the variance is equal.

In reality T tests are fairly robust to changes and deviations in variants.

They don't have to be exactly equal anyway.

6:24

So, let's go ahead and execute that and we can see that we have the T test here.

Well, what's this output mean?

So, the data confirms we're looking at pages by site and

that's in fact exactly the design we talked about.

The t-value is the t-statistic, so just like with the chi squared statistic,

in the previous things we went through, the t-statistic is

the value in the t distribution that we are getting from this data.

The degrees of freedom is 498.

Obviously related to the 500 subjects that we have there, and

then the p value is very, very small, far less than 0.0001,

so that's about all we care about, but very near zero.

7:12

Some other results as well, we can see the mean for Group A and

B are like we saw before in those summary statistics.

So the bottomline here is we have a significant difference between

the number of pages visited in website condition A and B.

Okay. So that is the T test for

our simple website AB test.

7:54

As you know from before, we completed the top test of proportions table previously,

and now we've come down to the analysis of variance table and

we're in that first row, and what's turned red there is that independent samples

T test that we just did.

If we look on the left column it has one factor and that was pages,

it had two levels and it was a between subjects factor, so

that's what the third column with the B means, and we're in a parametric test.

And next time we talk we'll get more into what

the difference between what parametric tests and non-parametric tests are.

But you can see the table sets up a sort of equivalence relationship

where if we're in a parametric situation we have certain tests and

if we're in a non-parametric situation we have others.

For now you can think of the difference as whether or not we can make certain

assumptions about the data, which are required for parametric test.

For example that the data is normally distributed is a common assumption

we'll have to contend with and for many measures the data is.

We can see in these box plots however that for site visit A,

the data is clearly not normal and we saw that in the histogram as well.

9:07

So that's the difference between those columns.

And we'll formalize that more as we go.

But we've done the independent samples t-test and

that's where we'll leave it for now.

Let's see how we would report that t-test result in writing.

9:29

So we analyzed page views, and

our result was a t-test, which we indicate here.

It has one parameter for its degrees of freedom, and that was 498.

So this is it's degrees of freedom.

This is the test type, obviously and

the test statistic was 7.21.

In our case it came out as negative 7.21.

You can put that in or not, it's up to you and

really it just means which order the two levels of the website were in.

If you compare A to B then you'll get negative 7.21.

If you flip that and compare B,

the difference in the mean of B to A then it will be positive 7.21 so

it really doesn't matter whether you have the minus sign or not.

So that's the statistic.