Hello everyone. This lecture will be about predicting missing data, which may sound like a strange concept, but it is important to consider for sample size analysis. One thing that a good researcher does is look for problems before they occur. One of the requirements in writing an NIH grant proposal is actually anticipating problems and describing what you're planning to do about them. Also, if you write a grant proposal that pretends there isn't going to be missing data in the context of social science research, you're going to be rejected because that's unrealistic. So, we're going to give you some ideas on how to predict what kinds of missing data you have. We will define missing data and discuss the different types of missing data you may encounter. Then, we will talk about how to predict missing data in our designs. Missing data is present whenever an outcome isn't measured for some reason or recorded an error. Anytime you have repeated measures or multiple layers of clustering, the probability of having missing data is extremely high. So, this is pretty common in multilevel and longitudinal studies. Here are some examples of how missing data can occur. Inconsistent participation could result in missing data if people have to come back to a clinic to visit. Then, on the second visit, their dog gets sick or they've taken their child to another doctor's appointment. Another reason for missing data could be study dropout. People may move. There maybe something unpleasant about the study so they don't participate anymore. Something may pop up in their lives that appeals to them more than the study. Dropout is a common problem. Machine failure and data entry can also occur, but that problem is less common as technology improves. Missing data complicates both estimation and inference. It can bias the estimates and it can affect the hypothesis testing accuracy. In fact, a P-value can be thought of as an estimate of a probability. So if you think about it, inference errors really end up being bias estimates of P-values. P-values don't necessarily mean what you think they mean. Finally, missing data typically reduces power. Data missing completely at random is the simplest case of missing data. This means probability of missing data is not influenced by observable or unobservable variables. Data missing at random a little bit weaker of an assumption than data missing completely at random is what most power data analysis assumes when data is missing. Data missing at random means the probability of an observation being missing is not influenced by unobserved data. Think about that definition. We're talking about the fact that data is missing and we're asking, what predicts that data being missing? In the case of data missing at random, the answer would be nothing in particular. Therefore, data missing at random essentially means that data is missing for no particular reason worth considering. No reason that as a result of some phenomenon we are observing. We will be using these graphics to represent different types of missing data. The rows represent independent sampling units, which could also be participants in many cases, and the columns represent observations, which in longitudinal studies would be over time. We typically assume data is missing at random. As you can see, there's one value missing for each observation, but there's no pattern to pick up on that would lead you to assume anything other than missing at random. Dropout is a common reason for missing data, especially in longitudinal studies. Remember, these observations, one through three are occurring over time. They could be days, weeks, months, or even years apart from one another depending on the study. Dropout simply means that participants or independent sampling units stopped showing up for observations. Once a value is missing, all subsequent values are missing as well. As you can see, this represented by participant one, who stopped at observation two and participant two, who stopped showing up at the third observation. We can also have different rates of data missing for different levels of predictors. Here you can see an example of treatment-related dropout. As you can see, the amount of data that is missing is correlated with whether you're in the treatment or control group. This should make researchers suspicious that their dropout is related to the treatment itself. So, if you have to think about why that may be, for example, perhaps the treatment itself has negative side effects in which the patients aren't tolerating so they decide to drop out. The same way we just did with that example, researchers must evaluate missing data for patterns. As we said, this is a dropout pattern as the missing values are later in the study, and one miss-value results in continued miss-values for a participant. Specifically, researchers need to look for deferentially missing data, such as we saw in the example where the treatment group had dropout occurring and the control group did not. Here we can see that again. Remember, it could be side effects of the treatment or perhaps one intervention is more time consuming than the other and people just say, "I'm tired of doing this. It's not fun." These are the types of things investigators need to be aware of. This is just another way of looking at missing data, by analyzing the percent of missing data. This is actually data analysis. It is easy to see here differential missing data groups and the way it increases the treatment group for each observation suggesting dropout. It's important to think these things through conceptually before doing a power analysis in order to calculate a credible sample size. Which is why the next topic we're going to discuss are, what are the implications for a study plan? The reasons behind missing data can affect power analysis and differential dropout can affect design choices. Here are a couple of examples. If women miss more visits because of the nature of the study, then in order to maintain the sample size you need to recruit more women. At that point, if I'm a reviewer, I've got to ask the question, does this bias the result? I would suggest that there's a careful analysis that needs to be done there. It wouldn't stop me from funding the study, but I want people to think that if we do expect a gender difference, the question is, is it because you're doing something that's biased against women or you need to change the intervention? Another example would be treatment related toxicity, which is always going to be a problem when you're bringing in new drugs. It's why you look at plots and graphs like we have done and been talking about. Differential dropout related to this definitely needs to be taken into account as it has ethical considerations related to the health and well-being of participants. There are several methods that can help you predict missing data patterns as you can see in this list. We will talk about many of these in more detail in the future lectures. What is important is to understand as you want to use these methods, so you are able to predict missing data patterns rather than being surprised or blindsided by them. This can save you valuable time, resources, and headaches. Here you can see an excerpt from a published study. Take some time to read through it and try to pick out what we think will be important for predicting missing data. Let's imagine you read the excerpt as part of your literature review. Now your estimates for predicting missing data can be informed and more accurate based on different design aspects of the study. Here you can see key statements that were reported. The part that we really care about are for our own study including dropout, non-compliance, and loss to follow-up percentages. Realize that many data analysis approaches do not allow for missing data. The general linear multivariate model is one of those. So, one of the best ways to design, to help reduce your missing data is to simplify your study. Only do a few things and do them well. Also making sure tech is reliable using validated instruments and processes, making sure you minimize the time and discomfort that participants must go through, and thinking about retention protocols through the use of pilot studies will also minimize the risk of missing data. It is worth noting that some researchers calculate sample size without considering missing data then adjust accordingly. It depends on context and how easy it is to make design adjustments. Let's do a quick review summary. We find that there's missing data when some observations fail to be recorded. We want to make sure we can recognize deferentially missing data when one group treatment or scheduled meeting time shows more dropout than others. Recognizing this allows researchers to design in ways to try to minimize missing data. Finally, missing data and patterns of missing data can be anticipated through different methods like literature review in internal or plan pilot study, previously published or unpublished studies, or even experts in the field. That's it for this lecture on missing data. Thank you for your time.