When you reflect on your own experience with your local healthcare system, how many different organizations, hospitals, clinics, physicians, and other healthcare professionals have you engaged for your care? No matter what your healthcare environment you use in your country, over time, your healthcare activities are spread out across many different organizations who in most instances, do not share data with each other. This fragmentation of healthcare data is true even in single-payer, government centered healthcare systems, or comprehensive, vertical healthcare organizations. Depending on how your healthcare system is organized, you may have a single health record identifier that is unique to you, and can be used to identify all of your healthcare interactions across the entire healthcare system, lucky you. However, even in single-payer systems, this is not always the case, and of course, data entry errors can also calls records on the same individual to appear as distinct records. Record linkage seeks to resolve two issues. First, to detect and merge records about the same individual within a single organization. This use case is called record de-duplication. Second, by merging data across different organizations. It is possible to obtain data elements from one dataset that is not available in the other dataset. Or to obtain higher quality data values from one dataset which improves the data quality in the other dataset. This second use case is called data enrichment. Our real-world example later in this video is an example of data enrichment focused on improving data quality in an electronic medical record dataset. The technical features of record linkage are very complex, and recently have exploded with noble methods using encryption and machine learning techniques. Here, we only provide a high level conceptual framework. As mentioned, should you be so lucky as to be in a health system with a single unified, all encompassing medical record number represented in this slide by the label MRN, then you could directly link records within a dataset or across two datasets by a simple test for equality. In most real-world settings, this dream case just doesn't apply. Here, we show that dataset 1, has the MRN, but dataset 2, identifies its patient using a Medicaid ID number on MIN. Since these two numbers are completely independent, they cannot be used to link two record. More often, one must infer a match across the datasets using other variables, which are called pseudo identifiers. Pseudo identifiers are variables that can be used usually in various combinations to identify records that refer to the same individual. There are strong pseudo identifiers that alone can provide significant linkages, such as social security numbers, or driver's license numbers in the United States. Other pseudo identifiers are weaker, such as combinations of first names and last names. A typical linkage method will use multiple pseudo identifiers to increase the accuracy of the linkages. In this slide, we show the use of last name labeled as LN, first name labeled as FN, date of birth or DOB, and zip code. Combinations of pseudo identifiers are used to improve the accuracy of record linkage. This slide shows just a very small number of combination of record linkage variables that were used in a real-world multi-pass record linkage system. It highlights that as the number of linkage variables grow, the number of combinations of variables grow by 2_nth power. Thus, as one attempts to link millions of records together, the record linkage process can become extremely computationally expensive. There exists a huge range of records linkage methods, and we do not examine any specific method in this video. Here, we show the major classes of record linkage methods. Deterministic methods use combination of variables that must match exactly to link two records. Probabilistic methods are the capability to incorporate partial matches as evidence for linking records. Unlike deterministic methods, probabilistic methods require the user, to set a threshold or a confidence value, that is used to declare two records to be close enough to be linked. In the other dimension, clear text linkage, uses data in its original human readable form. Encrypted linkage, also called privacy preserving record linkage or PPRL, can be used in settings where it is unacceptable to reveal human-readable values, in the linkage process. While record linkage can be used to achieve many different goals as described previously, we illustrate the ability of record linkage to improved data quality by introducing a concept that we have called a relative gold standard. The term gold standard, may be familiar to some of you. The term refers to a standard that represents the highest possible level of quality. A gold standard reflects the best that exists or can be produced with today's technology or methods. Applying this definition to data, a gold standard dataset, represents a set of data that has been deemed to represent data elements that have the highest level of quality and integrity possible. Acknowledging that today's gold standard may be replaced by an even better gold standard based on new technology or methods. Not all datasets can achieve the standards required to be deemed a gold standard. But there are some datasets where the data owners have invested substantial efforts to achieve a very high level of quality, higher than is expected or found in other datasets, that have not invested the same amount of effort in data quality or integrity. We define a relative gold standard dataset as, an institutional data source that is known to have higher data quality in a data domain, when compared to other data sources that contain the same data domain. There are two key implications to this definition. First, we define the relationship between two data sources as relative rather than absolute, which is a tacit acknowledgment that both datasets are likely to contain errors. Second, we limit the relationship to a specific data domain. The same data source may have extremely high-quality integrity in one data of domain, but it may not have, the same quality standards present in some other data domain. In fact, two data sources may switch roles as relative gold standards, depending on the data domain being considered. We illustrate the use of record linkage by examining the results of using record linkage between a data source with poor data quality against a relative gold standard dataset. The challenge is where to find a dataset that can be considered a relative gold standard. In our experiment, we used a clinical registry. A clinical registry is a specialized database that is maintained by a small group of highly motivated individuals, focused on a specific disease of interest, such as diabetes, autism, or cancer. In this setting, the limited scope of the registry allows the data owners to apply intense efforts to each data element that is entered into their registry. This intense extra effort, while not scalable to large clinical populations that are found in the electronic medical record, results in smaller databases that have highly accurate data on a well-defined patient sub-population. For our experiment, we used a pediatric bone marrow transplant registry or BMT registry, as our relative gold standard. The BMT registry owners are meticulous in their pursuit of complete and accurate data on their patient population. For the standard dataset, we used our electronic medical record, which contains data collected during routine clinical practice. We used patient race as the variable to access the impact of record linkage on data quality. Why patient race? Race is a concept for which there is no absolute right answer, that can be used to determine correctness. For the BMT registry, reporting on patient race is a requirement of their national funding agency for continued financial support. So it is a critical data element for this database. For the electronic medical record, patient race is not used in any clinical reporting or funding decision, so there's much less pressure, to get meaningful values into this field. Finally, we picked this specific variable because there was much discontent with the values of patient race recorded in the medical record coming from researchers who wanted to use this variable from the medical record in their research. This experiment was specifically designed to show how record linkage between the medical record and the BMT registry relative gold standard could be used to improve data quality, which is why I'm showing these results in this video. Skipping over many details which are described in the full article included in the course materials. We were able to link 1,192 patients across the two databases using deterministic linkage methods. We extracted the values for patient race for each linked individual in both databases. One task that needed to be performed was mapping the different values for race from the two databases into a single common set of values. Fortunately, there was pretty high agreement between the two value sets, so mapping from one set into the other value set was relatively straightforward. This is not always the case. This slide shows the results for the 1,192 linked patients. Race values recorded in the BMT registry are across the top. Race values recorded in the electronic medical record are down the left column. The diagonal represents patients where the same value for race appears in both databases. As you can see, the two databases agreed only 68 percent of the time. Cohen's Kappa, a statistical measure of agreement that controls for possible random agreement was only 0.26. Perfect agreement would result in a Kappa score of 1.0. Let's look how record linkage improved data quality, which is the main reason for doing all of this work. In this slide, we show that there were 45 cases in the BMT database that were marked as unknown, which represents only three percent of the patient population. Of these 45 cases, the medical record had values other than unknown in four cases. But looking in the other direction, the medical record had 400 cases recorded as unknown race, which represents 33 percent of the population. The BMT registry had values other than unknown for 359 of the 400 cases. That is, the BMT registry could provide values other than unknown for 90 percent of the unknown values recorded in the electronic medical record. Let's take a look at the potential analytical impact of this record linkage experiment. Patient race is often used in medical research either as a predictive variable or as a confounder in a statistical model. You do not need to understand the difference between these two terms to appreciate how this record linkage experiment could significantly change any analytic result. In this slide, I provide the distribution of patient races that would have been calculated using the raw medical record values prior to record linkage. In the right column, I provide the race distribution for the same patient populations after updating the unknown patient races with more complete values recorded in the BMT relative gold standard. Remember that we feel that the BMT registry is a relative gold standard for patient race, which means values for this variable from the registry are considered of higher quality than the same values in the medical record. We are justified in replacing the original medical record values with the BMT registry values. Notice there are significant differences in the before and after distributions which could change an analytic result. There are many ways in which record linkage can increase data quality, including providing incorporating novel variables not present in a dataset or filling in missing data that are captured in one dataset but not the other. Similarly, record de-duplication can increase the quality and quantity of data elements by merging data that previously was divided between two records into one merged or linked record. These are additional examples of data enrichment. If your work enables you to obtain data from more than one data source, it is always worthwhile to investigate using one of the many record linkage methods as a way to increase the quality of data available for your studies.