When trying to answer an analytic question, even if there is agreement on the interpretation of the question. There can be differences in how the interpretation is implemented programmatically. Let's look at a relatively simple example again from Dr. Michael Kohn. Which patients have ever had the diagnosis code for Congenital Anomalies at Children's Hospital Colorado? You might be thinking this should just be something like select star from diagnoses where ICD equals 759.89. Here is the real SQL code that his informatics team generated to answer this query. Why, well, this query takes into account the diagnoses in their system can actually be recorded in four different places. The problem list, billing, encounters, and past medical history, each of these sub queries is joined together with a union. Let's just look at the problem, let's query first, this query requires a 6 table, join. Just to get 2 pieces of information, the patient ID and a corresponding diagnosis. But there's also some unexpected code and the where statement and PL problem underscore status underscore C not equals 3. Thankfully, this developer commented their code well, and we can see that a problem status C of t3 means that the entry has been deleted. In other words, there was an entry on the problem list that the medical team removed from the patient record, we can't know why it was removed. So we'll just not consider it in our query, this is an example of a hidden influencer in our query, things that are part of the code. But that we rarely describe in our analysis, let's look at another example. In this case, the team wanted to identify all patients who are readmitted to the emergency department with an infection, this query crossed 32 tables. As you can imagine, with these more complex queries, it isn't entirely clear that any 2 analysts would build, the exact same set of joins in, the same order. To be fair, I'm not sure that a single analyst answering the same question months apart would build the same query. When clinical data models are complex over time you learn and forget different tables and relationships that exist in the mode. There are usually a number of different ways to get the information that you need, some of which are more elegant than others. Importantly, I don't want you to get too focused on the fact that this is all in SQL, you might be thinking, well, that's just too complex. Why don't they bring it into our Python to make the analytic steps more clear, sequential and commented. Beyond the computational benefit that comes from performing these steps in database, fundamentally the same issues of table choices, order of operations. And hidden influencers come into play regardless of which programming toolkit you use. Now, there are a lot of efforts to create more analytic oriented data models that essentially pre join these tables to make a single table that has similar types of information in one place. For example, a table with all the diagnosis information, or a single table for laboratory data, more complex data models like the Odyssey consortiums aumont model. Also try to account for semantic consistency across all the tables using standard terminologies to reduce variability. However, in all cases, those data models are implemented by writing queries like the examples we've shown, to build those derivative tables. While there is a lot of standardization that goes into that ETL process, you can bet that there are also a lot of hidden influencers that come along for the ride. And these influencers are truly hidden from the end users of the data, as rarely do we get access to the ETL code that built those models at this time. It's really not clear the extent to which those hidden influencers differ between each site's implementation, nor is it clear the effect those differences might have on our analysis.