0:01

So there's one thing that is worth pointing out which, maybe which differs

in ggplot than it does from say the base plots Is that if you

have a plot where, where the the data kind of exceed the limits of

the plot the behavior between base plot and ggplot can change a little bit.

So here on the left hand side I've just simulates some data.

So just, so, so this is not max data. And I just, I, and I, I intentionally

introduced a little outlier here.

So in the 50th data point, I just changed that value to be 100.

So now as the, just kind of this random series of, of noisy data.

And then right in the middle, there's a point that's 100.

And so if I call plot, so I'm going to make a standard base plot here.

I call plot on x and y, and I, and I

say type equals l, because I want to make a line plot.

0:45

But then typically if you have some out lier like this you don't want to

look at the outlier you just want to look at the core of the data.

So it's

typical to kind of set the the y axis limits to be, to

be roughly kind of where the data are and just ignore the outlier.

So you can see that the time series that gets

drawn has all the data connected and that you can see

roughly where it's going to shoot off to a hundred and

comes back down to be roughly where it's suppose to be.

So you know that outlier is out there

somewhere, but you don't see it in the plot.

Now, if I do the equivalent plot in ggplot, I can create my ggplot with with

the test data, and the aesthetics to the x and y.

And then I add the geom_line function to make

a line plot as opposed to a scatter plot.

You can see that just plots the whole, all the data including the outliers.

And it's maybe not exactly the kind of plot

you want to make because the outliers maybe not that interesting.

So if you want to do this, it's you have to be careful about how you do it.

And so the first is that on the left-hand side, you

might think, well, I'll just change the y limits to be within,

kind, in the range of most of the data between minus 3 and 3.

The issue here is that what ggplot will do is that it will subset the data.

To include the values that are between minus 3 and 3.

And so, of course, the outlier is not included in this data

set and so you won't see that data point in this plot.

So you can see this clearly where the outlier's missing the

two lines are not connected, but then everything else is connected afterwards.

So if you want to recreate the kind of phenomenon

that you saw with baseplot You have to add, this special

function called coord_cartesian, which that sets the limits to be minus 3.

The one, the y axis limits to be minus 3 and 3.

Now you can see in the plot here that

the outlier is in fact included, in the dataset.

It's the dataset hasn't been subsetted to only include

the ones that are in the y axis range.

Um,so, I just want to go over a slightly more complex example of kind of adding

pieces to a plot, just so you can get

a sense of how the different layers are added on.

And then hopefully get you going from there.

So, so here I've just, I've made the

scientific question just a little bit more complex.

I want to know how is the relationship between PM 2.5 and

nocturnal symptoms vary by both BMI and nitrogen dioxide or NO2.

And so as NO2 or BMI values change how what does the relationship between

PM PM 2.5 and nocturnal symptoms look like?

So one tricky thing about this is unlike our previous BMI

variable which is kind of categorized into normal and over weight.

Now, NO2 variable is continuous, or it's really the

log of the NO2, and it's really a continuous variable.

So we need to, so we can't really condition on a continuous variable

when we're making plots because then there would be an infinite number of plots.

And so we need to categorize this variable into a reasonable series of ranges.

And so what we're going to do is we can use the cut function

for this purpose, to cut literally cut the data into a series of ranges.

3:32

So here is some code to make NO2, split NO2 into tertiles, so this is going to be

three separate categories you know, kind of between

zero, the minimum, and the 33rd percentile, the 33rd.

In the 66 and the 66 to the maximum.

And so the first thing I need to do is use the quantile function

to figure out where in the data ranges are the 33'rd and 66th percentiles.

And once I've use the quantile function to find these cut points I pass that to

cut function and I use the cut function

to actually NO2 into these three different ranges.

And so what the cut function does is it just

returns a factor variable where each of the original data

points is replaced to buy its category in terms of

the, which tertile it's in, so in terms of the low,

the middle, or the high tertile.

it's, it's a very handy function for when you're using things

like lattice or ggplot and you have to categorize continuous variables.

So now you can see the levels of this

variable, the cut variable are, there's three different levels.

There's kind of 0.378 to 1.2, and 1.2 to 1.42 and then 1.42 to 2.55.

So those are the three categories that I've split the NO2 variable into.

So here's the final plot, just to show you what I'm going

for, and then we'll work backwards, figure out exactly how to do it.

So you can see that there's eight different plots here.

On the top you see all the normal weight children.

And on the bottom you see all the overweight children.

So those are the two categories of BMI.

5:18

And so it's, it's sometimes, it's often

important to look at the missing data just.

Just to see if there's anything special about those missing.

You don't always want to exclude them right off the bat

because there might be something special about them you've missed.

So, what does this plot have?

Well first of all I've, I've modified

the transparency on the points.

So I've made them a little bit transparent so

you can see a little bit of the density there.

I've added a smoother to each panel, so this is a linear regression smoother.

So, it's not the default.

And I've turned off the Confidence bands.

5:50

I've changed the kind of default labels and the titles, so

I've added to, to reflect and be a little bit more descriptive.

And then finally I used a

non-default font, so the default font is Is Helvetica,

and I've changed the font here to be Avenir.

And so, there are a number of options that I've modified here.

And so, here's the code for doing it.

So, the first, in the first set of code, I, I just call ggplot.

I give it the data and I give it some

basic aesthetics in terms of the x and y variables.

And then, to this G object, I add a bunch of things.

I add points using geom_point. I add a, I make the panel

using the facet_wrap function and I add a smoother using the geom smooth

where I specify the LM method and I turn off the standard error bars.

6:34

I, I changed the theme to be this black and white theme where I,

and then I modified the font to be Avenir instead, instead of the default.

And I've also made the font a little bit

smaller, to be ten points instead of the default 12.

And then finally, I've called the labs function three different times

to change the labels, the x label, the y label, and the title of the plot.

So you can see that I've added all these different things piece by

piece to make this plot a little bit more interesting every single time.

And it's easy to do this with ggplot, and I, and

then, and the nice thing about ggplot which I didn't do here

is that you could in fact save this to a new object

and then you would have everything stored in a single R object.

And then if you wanted to add on more

layers, you could add to that, that new object,

you could continue to add different things if you wanted to.

So it's a very modular, very kind of a, a useful framework.

For constructing plots that are new just for your data.

7:25

So, just to summarize very quickly, I know this has been a very

brief introduction in ggplot, and there's a lot things that you could talk about.

But given that this is not a course specifically on ggplot, my hope was

to kind of get you started, to get you typing in some basic code,

making some basic plots.

I hope, and then if you want to know more, you can kind

of look at some of the references that I mentioned in, previously.

So, I think in summary, ggplot is a very powerful, it's very flexible if you

can learn the grammar and learn the different

pieces that you can add to a plot.

And that can be tuned and modified.

There are lots of different types of plots you can make.

I left out a lot, but you can explore and mess around.

I think that's how, that's kind of the best way to learn about these things.

And to, and to take a look at some of the references that I mentioned in part one.