Translating and evaluating historic phenotyping algorithms using SNOMED CT

Abstract Objective Patient phenotype definitions based on terminologies are required for the computational use of electronic health records. Within UK primary care research databases, such definitions have typically been represented as flat lists of Read terms, but Systematized Nomenclature of Medicine—Clinical Terms (SNOMED CT) (a widely employed international reference terminology) enables the use of relationships between concepts, which could facilitate the phenotyping process. We implemented SNOMED CT-based phenotyping approaches and investigated their performance in the CPRD Aurum primary care database. Materials and Methods We developed SNOMED CT phenotype definitions for 3 exemplar diseases: diabetes mellitus, asthma, and heart failure, using 3 methods: “primary” (primary concept and its descendants), “extended” (primary concept, descendants, and additional relations), and “value set” (based on text searches of term descriptions). We also derived SNOMED CT codelists in a semiautomated manner for 276 disease phenotypes used in a study of health across the lifecourse. Cohorts selected using each codelist were compared to “gold standard” manually curated Read codelists in a sample of 500 000 patients from CPRD Aurum. Results SNOMED CT codelists selected a similar set of patients to Read, with F1 scores exceeding 0.93, and age and sex distributions were similar. The “value set” and “extended” codelists had slightly greater recall but lower precision than “primary” codelists. We were able to represent 257 of the 276 phenotypes by a single concept hierarchy, and for 135 phenotypes, the F1 score was greater than 0.9. Conclusions SNOMED CT provides an efficient way to define disease phenotypes, resulting in similar patient populations to manually curated codelists.


INTRODUCTION
Computational use of electronic health records (EHRs) for individual patient care (eg, clinical decision support, risk prediction models) or improving health services (eg, audit, service evaluation, research) requires patient phenotypes and outcomes to be defined based on data contained within the EHR. [1][2][3] Many EHR systems record diagnoses using clinical terminologies, for example, the Clinical Modification of the International Classification of Diseases is used in the United States for coding encounter diagnoses for billing, and the Read Clinical Terminology (Read terms) has been used by GPs in the United Kingdom since 1985. 4 Thus, definition of phenotypes often involves the creation of lists of clinical terms (often called "codelists," "code sets," or "value sets"). However, a single clinical meaning may be represented by more than 1 term, which may be in different parts of the terminology hierarchy. As a result, codelist generation can be an onerous task requiring the identification of all codes of potential relevance to a concept followed by manual adjudication. 5 Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) is a newer terminology 6 which is becoming more commonly used in EHR data worldwide, 7 and in the United States has been required for representing patient problems since 2013 in order for an EHR to receive government certification. 6 The SNOMED CT ontology includes multiple hierarchies and concept attributes which may enable simpler, more efficient, and more reproducible phenotype definitions. 8 However, these special features of SNOMED CT are currently under-used by researchers, particularly when using EHR data from UK primary care, which was historically coded using Read terms.

SNOMED CT
SNOMED CT 6 was initially developed in 1964 as the Systematized Nomenclature of Pathology, 9 and merged with the Read Clinical Terminology to create SNOMED CT. It is now maintained by SNOMED International, formerly the International Health Terminology Standards Development Organisation 10 and is mapped to multiple other terminologies. 11 In the United Kingdom, SNOMED CT is replacing Read terms used for coding diagnoses in primary care. 4 In SNOMED CT, alternative descriptions with the same meaning (eg, "Cardiac Insufficiency" and "Weak Heart") are modeled as synonyms which belong to the same concept ("Heart failure [disorder]"), minimizing the need for extensive text searches. SNOMED CT contains a "relationship" table encoding attributes of concepts and links between concepts, such as parent-child "is a" relationships which link a concept to its subtypes. This makes it simple to define a codelist based on a small number of ancestor concepts. The polyhierarchical structure allows concepts to be linked to more than 1 ancestor, for example "Acute diastolic heart failure (disorder)" is a descendant of both "Acute heart failure (disorder)" and "Diastolic heart failure (disorder)." As well as parent-child relationships, SNOMED CT defines over 100 potential attributes for each concept, 12 such as procedures associated with a body site, and enables new concepts to be defined by linking concepts using SNOMED CT expressions (also known as "post-coordination"). 13 SNOMED CT has a mechanism for recording changes to concepts and their relationships over time. 14 The NHS Digital SNOMED CT release includes a Query Table which allows inactive terms to be searched according to their original location in the SNOMED CT hierarchy. 15

Related work
The SNOMED CT "is a" relationships have been widely used to define groups of conditions for clinical and research purposes, particularly by the Observational Health Data Sciences and Informatics (OHDSI) community. OHDSI curate a set of vocabulary resources including SNOMED CT for the Observational Medical Outcomes Partnership (OMOP) Common Data Model. OMOP phenotypes can be defined using SNOMED CT hierarchies 16 as has been used in the eMERGE network. 17 SNOMED CT value sets used in Health Level 7 Fast Healthcare Interoperability Resources (HL7 FHIR) Val-ueSet resources for sending messages between EHR systems can also be defined using SNOMED CT expressions ("Content Logical Definition") and expanded into a simple list of concepts. 18 Chu et al compared the process of creation of hierarchy and listbased SNOMED CT value set creation for 10 conditions, and found that the hierarchy-based value sets were simpler (median 3 vs 78 concepts) and faster to construct. 8 Willett et al found that 125 diagnosis value sets could be created using SNOMED CT hierarchies, requiring a median of only 2 concept hierarchies per value set, and were easy to understand and share. 19 Winnenburg and Bodenreider derived metrics to quantify the completeness, correctness, and nonredundancy of published value sets compared to the hierarchy from the source terminology that has the same intended meaning. 20 A number of software packages have been developed to assist in navigating SNOMED CT, 21 interpreting SNOMED CT Expression Constraint Language 22,23 and enabling fast evaluation of SNOMED CT expressions using graph-oriented databases. 24 Rationale and aim for the present study EHR research databases from the UK primary care, such as the Clinical Practice Research Datalink, 25,26 the Health Improvement Network, 27 and UK Biobank, 28 have been extensively used for research. Researchers using these databases have previously used Read term value sets to define phenotypes, as general practice data were previously recorded using Read terms. Hundreds of Read-based research phenotype definitions have been deposited in repositories such as ClinicalCodes.org, 29 the CALIBER phenotype portal, 30 and the new Health Data Research UK Phenotype Library. 31 There is a lack of prior studies on the use of SNOMED CT with UK primary care research databases. We hypothesize that SNOMED CT can simplify the process of creating accurate and parsimonious definitions of disease diagnoses for use in Read or SNOMED CT-based EHR research databases.
In this study, we investigate the phenotyping process in detail for 3 exemplar diseases: diabetes mellitus, asthma, and heart failure, comparing patient cohorts derived using different phenotype definitions in the CPRD Aurum primary care database. Building on these findings and our previous semisupervised approach to phenotype definition, 28 we also develop a semiautomated method for converting existing Read Version 2 phenotype definitions published by the CALIBER program 30,32,33 to SNOMED CT. We apply this method to several hundred existing phenotype definitions to evaluate how well the method performs across a wide range of phenotypes. We also present a new package (Rdiagnosislist) 34 for the R statistical system to make SNOMED CT convenient to use for researchers familiar with R.

Overview
This study has 2 parts. First, we evaluated different methods of using SNOMED CT to create codelists for defining a small number of phenotypes, ranging from a simple hierarchy-based method to a more detailed, time-consuming but potentially more thorough method. Second, we chose a method that can be used semiautomatically at scale. We applied this method to the phenotyping task of identifying prevalent or historic medical conditions, which is commonly required for epidemiological studies using EHR. For both parts, we used as our reference Read V2 codelists which have been manually curated, validated, and previously used for EHR health research.

Data source
We used a random sample of 500 000 patients from CPRD Aurum, 24 a longitudinal database of routinely-collected EHR from UK primary care practices using the Egton Medical Information Systems EHR. CPRD Aurum was chosen because it contains data recorded using both Read and SNOMED CT.
The study period was from January 1, 2013 to December 31, 2018. We included patients aged over 18 years registered at practices contributing data of acceptable quality for research, and a minimum of 1 year of follow-up after the start date. In CPRD Aurum, all clinical concepts are denoted by a CPRD-specific identifier ("medcode"), and CPRD provides a dictionary linking each medcode with equivalent Read and SNOMED CT concepts. In order to provide a valid comparison between methods, we excluded CPRD data rows with observations that lacked Read V2 maps. We mapped all codelists to CPRD Aurum medcodes and excluded those that were not present in CPRD Aurum.

Phenotype development using Read
We used Read version 2 codelists previously created by researchers for a range of EHR studies using the CALIBER data platform (based on the CPRD Gold primary care EHR database). 30 These codelists had been created by keyword searches of term descriptions and by browsing the Read code hierarchy 32 and were used in a study of 308 diseases across the lifecourse ("chronological map of human health"). 33

SNOMED CT terminology
We used the August 2020 SNOMED CT UK Edition (which includes the international release and the UK clinical and drug extensions), the UK Query Table and History Substitution Table  August 2020 edition, UK Data Migration Workbench (Read to SNOMED mapping) April 2020 edition, and the CPRD Aurum medical dictionary. We obtained the SNOMED CT "Snapshot" source files from NHS Digital in Release Format 2. The SNOMED CT release is composed of 3 main tables: "concepts" which represents the clinical meanings, "descriptions" which contains the words or phrases used to describe each concept, and "relationships" which defines the connections between concepts. It also contains "reference sets" which are lists of SNOMED CT concepts for particular business uses, such as translation between SNOMED CT and other coding systems.
The software script to create the SNOMED CT phenotypes was developed in Python 3.8 using Spyder 4.1 IDE and the Networkx library version 2.5, 35 which is a python package for network analysis. We also carried out analyses in R 4.1 using a new package that we developed, called "Rdiagnosislist." 34 Diabetes mellitus, asthma, and heart failure phenotypes For each disorder, we created codelists for phenotype definitions using 3 methods-a concise method using just the "is a" hierarchy ("primary"), a method that uses additional SNOMED CT relationships ("extended"), and a thorough but more labor-intensive method involving manual searches of term descriptions ("value set") ( Figures 1 and 2).
For the "value set" method, we aimed to identify all relevant SNOMED CT concepts regardless of their place in the hierarchy, by performing a keyword search of SNOMED CT descriptions. We used the SNOMED CT knowledge model to simplify the list of concepts by subsumption, and manually reviewed the list to decide if each concept would be included. This broad approach aimed to achieve complete coverage of phenotype-related codes including diagnostic findings, test findings, complications, related diseases, management, procedures, and follow-up. We additionally defined exclusion criteria for each codelist, consisting of historical, possible, suspected, or negated concepts, which were excluded from the final phenotype definition. We used python Networkx-based methods to explore the SNOMED CT hierarchy, develop SNOMED CT expressions, and visualize the connections between concepts.
For the "primary" method, the inclusion codelist consisted of a single concept describing the disease ("195967001 j Asthma [disorder]," "84114007 j Heart failure [disorder]," or "73211009 j Diabetes mellitus [disorder]") and its descendants. For diabetes, we excluded "11687002 j Gestational diabetes mellitus (disorder)" and its descendants, to match the intent of the reference Read codelist, which did not include gestational diabetes. None of the descendants of the primary concept were excluded for the asthma or heart failure codelists.
We also created "extended" codelists including: (1) all concepts in the primary codelist, (2) all concepts related to the primary concept hierarchy using with the "Due to" or "Associated with" relationships, and (3) concepts linked by the relationship "Associated finding" with the additional criteria: "Finding context" attribute equals "Confirmed present" or "Known present," and "Subject relationship context" equals "Subject of record." The third category included concepts such as "Heart failure confirmed" and "History of heart failure," which have the "situation" semantic tag and are not descendants of the "Heart failure" concepts. The aim of the extended codelists was to encompass as many concepts as possible that imply that the patient currently has the diagnosis or experienced it in the past.
For each codelist, we added inactive SNOMED CT concepts using the Query Table (Figure 1), because historic EHR data in CPRD may include inactive concepts.

Phenotypes from the chronological map of human health
We developed an R script to download and process Read Version 2 codelists from a GitHub repository (https://github.com/spiros/chronological-map-phenotypes) for 276 diseases defined using primary care data in a study of diseases across the lifecourse. 33 Each codelist consisted of Read terms for the disease diagnosis or history of the disease (terms for suspected diseases were not included). We used NHS Digital mapping files and the CPRD Aurum dictionary to map the Read codelists to SNOMED CT concepts, and denoted these converted codelists "Read-derived. " We also sought to derive a parsimonious SNOMED CT expression to define each phenotype. For each mapped concept, parent of a mapped concept, or descendant of a mapped concept, we calculated the precision and recall for using the concept and its descendants to represent the entire codelist ( Figure 3 and Table 1). We clinically reviewed the highest performing concepts for each phenotype, and judged whether any of these concepts had the same meaning as the phenotype itself. We used the functionality of the Rdiagnosislist package to display SNOMED CT hierarchies to assist this process. If it was possible to find a suitable concept to represent the entire codelist, this concept was selected and the codelist was generated using SNOMED CT relationships ("is a" descendants and linked personal history and confirmed situation concepts). Inactive SNOMED CT concepts were then added using the Query Table. We included personal history concepts as the codelists were intended to be used for classifying disease prevalence or history; definitions for incident disease should not include personal history concepts.

Evaluation of phenotypes
We extracted the earliest recorded diagnosis for each condition for patients in the study population using different phenotype definitions. We compared the precision, recall, and F1 scores of SNOMED CT phenotype definitions in selecting patient cohorts, taking the Read-derived cohorts as the gold standard. We also com-pared the number of patients, mean age, and sex distribution of cohorts defined using different methods.

Study population
The sample used in analysis consisted of 237 122 men and 262 878 women with a mean (SD) age of 40.2 (19.1) years at the mid-point of their registration period.  Diabetes mellitus, asthma, and heart failure phenotypes

Read V2 codelists
We successfully created value set SNOMED CT-based phenotype definitions based on keyword searches for words relevant to diabetes mellitus (Supplementary Table S1), asthma (Supplementary  Table S2, Figure 4), and heart failure (Supplementary Table S3). The total number of concepts in the codelist was for 258 diabetes, 53 for asthma, and 44 for heart failure. The majority of concepts had the semantic tag "disorder," comprising 58.5%-96.6% of concepts in each codelist. Other semantic types comprised a small number of concepts in each codelist except for the asthma value set, which contained 12 concepts (22.6%) with the semantic tag "regime/therapy" (Supplementary Table S4, Figure 4). The vast majority of SNOMED CT concepts and patients were obtained by using "is a" SNOMED CT relationships with primary concepts ( Table 2). Among diabetes concepts, the relationship "Has focus" selected 2051 additional patients (7.69%) compared to using "is a" relationships alone. The majority of these additional patients (1820) were selected by the "Diabetes mellitus screening" concept, but some of these patients may not actually have diabetes (if they underwent screening and the screening test was negative).
Definitions for primary codelists in tabular form and as SNOMED CT expressions are given in Supplementary Table S5 Cohorts derived from value set codelists included more patients compared to those derived from primary or extended codelists. Considering Read-derived codelists as the gold standard, value set codelists had higher recall, but slightly lower precision than primary or extended codelists (Table 3). F1 scores exceeded 0.93 for all comparisons.
There were minimal differences in the age and sex distributions between cohorts defined using the different methods, although some differences were statistically significant because of the large sample size. For the diabetes phenotype, the primary and extended cohorts had a slightly lower proportion of female patients than the Read cohort (difference in proportions 0.012 or 0.010, respectively). Some of the SNOMED CT asthma and heart failure cohorts were younger than cohorts selected using Read, for example the primary asthma cohort was mean 0.36 years (95% CI: 0.09-0.64) younger and the primary heart failure cohort was 0.67 years (95% CI; 0.29-1.05) younger than the Read cohort (Table 4).

Phenotypes from the chronological map of human health
On manual review of the Read to SNOMED CT mappings, we were able to represent 257 of the 276 phenotypes with a SNOMED CT concept, and in 59% (151/257), this was the concept that was suggested automatically (the concept hierarchy with the highest F1 score for the inclusion of SNOMED CT concepts against the gold standard of the original Read codelist mapped to SNOMED CT). Some phenotypes could not be represented as a single concept hierarchy because they are a collection of different diseases (eg,  Examples of SNOMED CT concepts related to heart failure, showing which would be included in codelists created using the "primary," "extended," and "value set" methods. "enteropathic arthropathies"), or because they are defined by exclusion (eg, "other hemolytic anemia," "stroke not otherwise specified") (Supplementary Table S7). Full results are included in Supplementary File 5.
For 29 phenotypes, the SNOMED CT concept hierarchy selected an identical cohort to the mapped Read codelist, and in 135 (52.5%), the F1 score was greater than 0.9.

Summary of main findings
By examining 276 diseases, we have demonstrated that the SNOMED CT knowledge model can simplify the process of selecting patient cohorts and outcomes in EHR research databases, and enable the transformation of historic phenotyping algorithms. In cases where the SNOMED CT knowledge model corresponds well to the researcher's intended definition of a phenotype, it may be possible to represent the phenotype using a single concept hierarchy. For the examples of diabetes mellitus, heart failure, and asthma, patient cohorts derived using different Read and SNOMED CT phenotyping methods were very similar. We have also produced an R package 34

Add inactive SNOMED CT concepts
Map to CPRD Aurum medcodes

Defining phenotypes using SNOMED CT
Conventionally, phenotype and outcome definitions consist of an enumerated list of terminology concepts ("codelist," "code set," or "value set") created by the manual process of keyword searching, which does not take into account the ontology contained within the terminology system. 1 Consistent with previous studies, we found that the SNOMED CT ontology can simplify the process of produc-ing diagnosis codelists that are concise, understandable to clinicians, and useful for data analytics. 8,19 We have extended these findings to a large EHR database containing data encoded using both Read V2 and SNOMED CT from hundreds of primary care practices in the United Kingdom. We have also developed semiautomated methods for converting a Read V2 codelist into a parsimonious SNOMED CT phenotype definition, implemented in an R package. 34 Figure 4. Concept network diagram for the asthma "value set" phenotype definition, with key concepts labeled. Key to node colors: black with white text ¼ primary concept, orange ¼ linked finding or disorder, white with blue outline ¼ linked concepts of other semantic types eg, procedure or regime/therapy. Edges represent relationship types: black ¼ "Is a," green ¼ "Has focus," red ¼ "Associated procedure," and orange ¼ "Associated finding." Table 1. Example of semiautomated codelist conversion from Read Version 2 to SNOMED CT, by calculating concept-based precision and recall for the hierarchy of each mapped concept and its parents. In this example, the SNOMED CT hierarchy for the concept "12295008 j Bronchiectasis" also includes Kartagener syndrome, which is not included in the Read codelist, hence the precision is 0.83 instead of 1 Terminologies change over time as medical knowledge changes or errors in the terminology system are corrected. 14 A codelist defined using SNOMED CT expressions or hierarchy will automatically include new concepts if they are modeled in an appropriate place in the hierarchy, and SNOMED CT enabled EHR systems could enable patient populations for decision support, audit, or research to be defined in a parsimonious way. The principal parent-child "is a" relationship enabled the majority of relevant concepts to be captured. Additional concepts could be included by using other relationships, for example "Due to" could enable inclusion of concepts for which the current concept is a sequela or consequence, and "Associated with" provides a link to procedures associated with a diagnosis. SNOMED CT concepts with the semantic tag "situation" include concepts for suspected and historic conditions, as well as a few concepts for confirmed disorders, which may be important in some cases. For example, the latest UK release of SNOMED CT contains 2 important "situation" concepts related to COVID-19: "1300721000000109 j COVID-19 confirmed by laboratory test" and "1300731000000106 j COVID-19 confirmed using clinical diagnostic criteria," which are related to, but not direct subtypes of "840539006 j Disease caused by Severe acute respiratory syndrome coronavirus 2 (disorder)" (the main COVID-19 diagnosis concept).

Recommendations for the use of SNOMED CT in defining research phenotypes
Based on our findings, we recommend the use of SNOMED CT concept hierarchies and attributes for generating phenotype definitions in a parsimonious, explainable, and reproducible way. However, there researchers need to be aware of the following caveats: First, SNOMED CT concepts can become inactive but may have been used to record historic data that exist in research databases. There may be no current equivalent for the meaning of some inactive concepts (if they are ambiguous or deprecated). In this project, we included inactive SNOMED CT concepts using the NHS Digital Query Table, which categorizes the provenance of the link on a 4level scale (ranging from 0 ¼ subsumption is always true, to 3 ¼ original concept had at least 2 distinct meanings). Second, the researcher or clinician may have a different intent regarding the inclusion of subtypes compared to the SNOMED CT hierarchy, such as whether gestational diabetes is included within a diabetes phenotype.
Third, SNOMED CT semantic types may not be adhered to strictly when data are entered, which means that essential information may not be in the expected location in the hierarchy. For example, cancer diagnoses may have been recorded using "morphologic abnormality" concepts which are intended for use on histology reports. Semantic tags are not always consistent with the SNOMED CT hierarchy within SNOMED CT itself. 36 Fourth, the SNOMED CT methods rely on complete and accurate relationships between concepts being modeled within the SNOMED CT ontology. While this is broadly the case for diagnoses, concepts in other domains (eg, symptoms/findings) may not be as well modeled, and it is possible that relevant terms may be overlooked if researchers rely solely on the hierarchy. We recommend that ongoing engagement between clinicians, researchers, and SNOMED CT editors should continuously improve the quality of the SNOMED CT model.

Limitations of this study
A limitation of this study is that we considered the existing Read V2 phenotype definitions as the gold standard, and assumed that they were complete and accurate based on the manual review process involved in creating them. Thus it was assumed that any additional Read V2 terms selected by the SNOMED CT method and mapping were extraneous, but this may not have been the case.
Second, we relied on accurate mappings between SNOMED CT and Read V2 as supplied by NHS Digital and CPRD in order to ensure that codelists created using different terminologies had the same meaning.
Third, we did not have access to patients' original text notes, and were unable to verify the accuracy of the original Read or SNOMED CT concepts entered in the health records. CPRD does not provide researchers with access to the free text within EHRs because of concerns over confidentiality. However, previous validation studies using the primary care databases 37 have shown that coded diagnoses are likely to be correct when present, but may be incomplete and may not use the most specific, precise term available. Fourth, our study was limited to the primary care setting. We assume that SNOMED CT coding in secondary care settings will be similarly accurate in order to enable data to be used for research, but it will be important to validate the accuracy and completeness of coding.

CONCLUSION
In conclusion, the SNOMED CT knowledge model provides an efficient way to define disease phenotypes in EHR databases. We have demonstrated that the method can result in similar patient populations compared to manually curated codelists, and we have developed tools to assist researchers in making use of these models. For some disease phenotypes, a simple definition based on a single concept hierarchy may be sufficient to define the phenotype. These methods may facilitate the use of EHR data for research and improving patient care.

FUNDING
This work was supported by Health Data Research UK, which receives its funding from the UK Medical Research Council, Engi-

AUTHOR CONTRIBUTIONS
ADS devised the protocol, sought approvals and supervised the project. ME, AG, MQ, and ADS analyzed the data. ME drafted the manuscript. All authors contributed intellectual content and approved the final manuscript. ADS is guarantor.

SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.