Hello everyone welcome back now that we have an idea of what it takes to maximize the quality of our data analysis when working with design data. We're going to consider some case studies of what can go wrong if we fail to account for design features when performing analyses of design data. And this is called analytic error. When some of those different design features that we discussed previously are not correctly accounted for when we're actually sitting down to analyze design data. Okay so we're going to look at some case studies in analytic error. Okay remember from our totaldqata quality framework we're now focused on data analysis at that final phase of the overall framework. And making sure that while we've tried to maximize quality at all the other dimensions of measurement representation, we want to continue to maximize quality when we're thinking about how we're analyzing the data at that final phase of the overall process. So again that's our focus here and these case studies are going to show you what can happen when we fail to perform a high quality data analysis at that final phase. Okay so first case today we're going to look at is the SESTAT case study. So we analyzed survey data from the Science and Engineers statistical data system that's what SESTAT stands for. And this data system is sponsored by the National Center for Science and Engineering Statistics or NCSES. We're going to look at three main SESTATs surveys, all of which collected data from complex probability samples like we talked about previously. Were the key design features that we need to account for our sampling weights that account for different probabilities of being selected in the sample and possibly non response adjustment stratification. So stratified sample designs and cluster sampling the selection of clusters of individuals at random as a part of the overall sample design. All those design features need to be accounted for in a high quality analysis of these different data sets. So three different data sets, the national survey of college graduates or MSCG, the survey of doctorate recipients or SDR and the national survey of recent college graduates or NSRCG this survey is actually no longer active today. Now the reference for this particular study was Western colleagues in a 2016 article In the Journal Plus One, Public Library of Science one. This is an interdisciplinary journal and we presented this general study of analytic error to be of general interest to researchers from a variety of different disciplines. So here's more about this case study. We examined the implications of making analytic errors for analyses of the SESTAT data. So what we did is we downloaded the 2010 public use SDR and NSCG data. So these data are freely available online, we just downloaded these data sets and then we perform different types of analyses. We obtained the replicant waits for variance estimation from NCSES. So this is one way that you can estimate standard errors of weighted estimates to account for those complex sample design features the strait and the clusters is by using what are called replicate weights. That's the tool that should be used in order to estimate these types of variances. So we obtained that data from NCSCS which was discussed in the documentation for these data sets. We then worked with NCSES staff to identify key descriptive estimates that users of these data would be interested in computing to describe the target populations of these different surveys. And we also identified regression models that were of substantive interest. So models that allow the data users, NCSES data users to make population inferences about relationships between variables. So we wanted our example analyses to be grounded in reality that was really the key point here. So we wanted to work with the NCSES staff to identify analysis that would be of interest to users of these data. We then performed three types of analyses for both descriptive and analytic regression parameters by analytic I mean regression parameters. We performed three different types of analyses for both of these different kinds of parameters. The first type of analysis fully accounted for the complex sampling features. So in the context of our course right now that means we're maximizing the quality of our data analysis were fully accounting for those complex sampling features. The second approach was a type of analytic error. We account for the sampling weights only and we ignore the other complex sample design features, the stratification and the cluster sampling. Where again in the case of these data sets, the stratification and cluster sampling were captured in those replicate weights that we can obtain from NCSES. Other data sets will make codes available containing the stratification and cluster sampling features. But in that second analysis we only accounted for the sampling weights. In our analysis we failed to account for the stratification and cluster sampling. And then the third analysis was was what we might call a naive analysis. We completely ignored those complex sampling features. Okay, so that would be a low quality approach to doing the overall data analysis is just to be completely ignorant about those complex sampling features and their role or their effect on the overall analysis. So we compared and contrasted estimates that were produced based on these three different approaches when focusing on the same data and the same estimates of these different parameters. We also considered ratios of variance estimates to assess what are known as miss specification effects. And that's what happens when you ignore key sampling features in the analysis. We looked at ratios of the essentially the standard error is squared to see if there were notable differences in those standard errors due to ignoring these key sampling features. So what did we find In this particular case study among eight different categorical variables that we analyzed in the 2010 survey of doctorate recipients. Inferences related to estimated distributions in the population. Basically population percentages on these different variables would have changed completely for five of the variables if these design features were ignored. And these variables included current salary, race, ethnicity, attending professional meetings in the past year, major field of study and labor force status. Okay, these were all different categorical variables available in the SDR but our inferences would have completely changed about the distributions of these variables. Now the changes were generally small, but we would have arrived at different conclusions altogether for the proportions that fell into different categories defined by these different variables. If we ignored these design features. So five out of eight variables, we would arrive at different substantive conclusions about what the population looks like in terms of those different variables. Then when looking at data from the NSCG among 10 continuous and categorical variables in the 2010 NSCG, there would be substantial changes in inference for nine of those variables. So that means if you look at an estimate of the mean, say in a confidence interval or an estimate of a proportion and it's confidence interval, we would arrive at completely different conclusions for 90 of those variables. If we completely ignored those design features, it's pretty shocking. But we would get a very different picture of the population ignoring those design features. So here's an extreme example of what we're talking about from the 2010 NSCG of this type of analytic error. So a key indicator coming from that national survey of college graduates is whether in the survey they say that their primary job is in the science and engineering fields. That's a key indicator that the US uses to understand its workforce. When we fully account for the complex sampling features in the analysis. Again, the high quality analysis in the context of our course right now the weighted estimate of the percentage of individuals with a primary job in science and engineering is about 30%. And you can see that that standard error is about 0.3% taking those design features into account. Okay, so that would be the estimate that we want to report in a paper or a technical report or something like that. If we only account for the final weights as one might expect, we arrive at this exact same weighted estimate. It's still 30.38%, so that's fine. But the standard error notice rises somewhat and that's because we're not accounting for the stratification that was inherent to the National Survey of College Graduates and stratification again reduces our standard errors. It makes our estimates more precise. So we would fail to get that benefit of the reduced standard error by failing to account for the stratified sampling that was used to select this sample initially. But look what happens if we're completely ignorant about the complex sampling features and we ignore the weights and we ignore those replicate wage which which are used for variance estimation and reflect the stratified cluster sampling. Our estimate now becomes 54.94 of people in the target population having a job in science and engineering. Almost 25% point difference compared to the correct weighted unbiased estimate because we ignored those weights in the overall analysis. Furthermore, notice how that standard error drops, we are overstating the precision of our estimate by failing to account for those complex sampling features. So basically everything is going really wrong here when we completely ignore those complex sampling features and we would have a very very misleading picture of what that population looks like by failing to account for those design features in our overall analysis. Now why is this happening? The NSCG weights were highly correlated with several other measures of interest race, ethnicity, highest degree obtained salary, major degree. There was over sampling of individuals in these different categories especially those who are in science and engineering. So they're over sampling of individuals more likely to be in those job categories. So that's why failure to use the way to paint such a misleading picture here because of the over sampling in these different categories and the correlation of those weights with several other key measures. So critically important to account for the weights in this case. People with higher probabilities of being included in the NSCG sample had unique values on these variables of interest. And so the weights are informative and we need to make sure those weights are being accounted for in the analysis otherwise, we're in this situation where we're painting a very misleading picture of what the population looks like. So this is what we mean about maximizing the quality of the analysis. If we were to report 55% okay, we're ignoring those other design features and we're reporting a low quality estimate that's subject to a lot of bias because of the failure to account for those design features. Now that was descriptive estimates, what about regression models? Okay in a regression model that we fitted to log transformed current salary as a dependent variable in the 2010 National Survey of College Graduates the main effect of having a science and engineering degree on log transformed salary would change completely. If we fully account for the complex sampling features, we get evidence of a positive relationship, that 0.16 is the regression coefficient for having a science and engineering degree. So if you have a science and engineering degree, you're expected to have a higher current salary and the standard error for that estimated coefficient was 0.03. Had we completely ignored the complex sampling features in the analysis. That estimated coefficient for having a science and engineering degree is divided by eight. Okay, so we actually lose evidence of that significant relationship by failing to account for those design features. Okay, furthermore, the standard error is lower than it actually should be. So we would have no evidence of a relationship. Again, another type of analytic error in a poor quality estimate by failing to account for those design features. Take a different model into consideration suppose we had a logistic regression model for having a science and engineering job as a dependent variable in the 2010 NSCG. And we're interested in making inference about the interaction between race, ethnicity and gender. This too would have changed radically. If we fully account for the complex sampling features, this is just a P value for testing the null hypothesis that that interaction is significant. If we fully account for the complex sampling features are P values 0.19, which means we would fail to reject the null hypothesis that those interactions don't matter. But if we ignored those complex sampling features altogether, we would find evidence of a strongly significant interaction between race, ethnicity and gender. And again, we might go on to report that in the paper and say that that's an important research finding when in fact it's just a low quality estimate coming from this dataset because we failed to account for those complex sample design features. When we account for those features, there's no evidence of this kind of interaction in the overall population. Okay, so here's the second case study. This is from the BRDIS and we'll talk about what BRDIS means here in a second. Very little work to date has considered the analytic error problem in the establishment survey context. So the two prior surveys we were talking about were surveys of individuals. BRDIS was a survey of establishments and so what we did in this case studies, we considered the implications of failing to account for sample design features in a real establishment survey. And given that establishment surveys can generally vary widely in terms of their size, the number of employees, their profits, etc. Probabilities of selection are often a function of size and the weights can therefore vary widely because the weights are a function of the probabilities of being included in the sample. So weights become even more critical to think about when we're dealing with establishment survey data. So we performed the same types of alternative analyses of data for example with and without accounting for the weights and the estimation from the 2013 BRDIS where BRDISstands for Business Research and Development and Innovation Survey. Okay, that's where the BRDIS acronym comes from. We also considered the effects of ignoring the stratification in the BRDIS sample design. So different establishments were stratified okay, in the overall national sample and we need to account for that stratification if we want our standard errors to be reduced and have more efficient estimates. So just a disclaimer about the BRDIS this research uses data from the US Census Bureau's longitudinal employer household dynamics program which was partially supported by grants from the National Science Foundation, the National Institute and Aging Grants and grants from the Alfred P Sloan Foundation. Just need to make that clear and that any opinions and conclusions expressed here in are those of me and do not necessarily represent the views of the US Census Bureau. All results have been reviewed to ensure that no confidential information is disclosed. And you can read more details about the BRDIS survey at this web link here at the bottom of the slide. Okay so here are results from the descriptive analyses. So these are descriptive analyses of selected variables, you can see approximate sample sizes. We didn't report the actual sample sizes here for variables including total salary expenditures and millions, total worldwide employees and thousands, total US expenditures on R&D. In thousands and total worldwide expenditures and R&D in thousands. So we want to calculate means on each of these different variables using our different analytic approaches. Okay, as you can see the bird is selected a pretty big sample of these different establishments to compute these different estimates. So in the first approach again, completely naive, we ignore the weights and we ignore the stratified sampling. So there are codes available in the BRDIS data files representing the different strata and we ignore those codes entirely. Okay, so you see these estimates based on the naive approach a low quality analysis of the BRDIS data. Okay so you can see the different estimates. Look at the shocking differences in the estimates of these means for approaches 2 and 3. Where we're fully accounting for the weights and we're either adjusting for the this strata to compute our standard errors that's approached three or we're not accounting for thie strata to compute our standard errors. That's approach to look at how different the estimates of these means become when we account for the weights in the BRDIS there's just no comparison. And we would be painting such a misleading picture of what the population tends to look like in terms of means on these different variables for these US establishments. If we failed to account for those weights in our estimation, look at the differences and the estimates of these means. We also see much lower standard errors of those estimated means when we're accounting for the weights, especially when we account for the straight up in the estimation. You can see that in approach 3, accounting for the strata and the variance estimation reduces the standard errors even further. But we would clearly arrive at substantially different conclusions about these population parameters when describing these establishments. If we fail to account for those design features in the analysis. Here are results from the regression models that were fitted and we looked at in this case total research and development expenditures in the US as the dependent variable and our predictors included total salary expenditures and total worldwide employees. And then the interaction between those two variables. So we're in the same situation here, look at what happens to these estimated coefficients when we fail to account for the weights. Okay, in approach 1 where we fail to use the weights, you see the coefficients, look at how much they change. In approach is 2 and 3 when we're actually using the weights. And again in this case an approach 3 where we're accounting for the sampling strata, look at how the standard errors drop compared to approach to we're getting more precise estimates when we're fully accounting for all those design features. So approach three in this case would be the high quality the maximized quality data analysis that we're looking for when analyzing these kinds of design survey data sets and that's what we would want to report in a publication. Okay, so just to summarize the BRDIS case study failing to account for the survey weights in the 2013 BRDIS led to substantial changes and inference for both means and regression coefficients. And again, this was because the weights were informative, they were correlated with key measures and that means they're going to play a role in estimation of these different parameters important to account for them. A failure to account for the stratified sample design of the 2013 BRDIS lead to overly conservative inferences. In other words, our standard errors were too large. The publication by myself and Joe Sachs in 2018 presents presents additional details about this analysis if you're interested. The key kind of takeaway point here is that high quality analyses will add a minimum begin by accounting for critical design features in the data analysis. As these features often have large impacts on the estimates that were reporting or their standard errors. That's how we want to make sure that we're maximizing the quality of the analysis performed. Okay, so what's next this week? We're going to turn to a discussion of maximizing the quality of data analysis when we're working with gathered datasets. Okay, thank you