Diabetic retinopathy environment-wide association study (EWAS) in NHANES 2005-8.

Background: Several circulating biomarkers are reported to be associated with diabetic retinopathy (DR). However, their relative contributions to DR compared to known risk factors, such as hyperglycemia, hypertension, and hyperlipidemia, remain unclear. In this data driven study, we used novel models to evaluate the associations of over 400 laboratory parameters with DR. Methods: We performed an environment-wide association study (EWAS) of laboratory parameters available in National Health and Nutrition Examination Survey (NHANES) 2007-8 in individuals with diabetes with DR as the outcome (test set). We employed independent variable ('feature') selection approaches, including parallelized univariate regression modeling, Principal Component Analysis (PCA), penalized regression, and RandomForest. These models were replicated in NHANES 2005-6 (replication set). Findings: The test and replication set consisted of 1025 and 637 individuals with available DR status and laboratory data respectively. Glycohemoglobin (HbA1c) was the strongest risk factor for DR. Our PCA-based approach produced a model that incorporated 18 principal components (PCs) that had AUC 0.796 (95% CI 0.761-0.832), while penalized regression identified a 9-feature model with 78.51% accuracy and AUC 0.74 (95% CI 0.72-0.77). RandomForest identified a 31-feature model with 78.4% accuracy and AUC 0.71 (95% CI 0.65-0.77). On grouping the selected variables in our RandomForest, hyperglycemia alone achieved AUC 0.72 (95% CI 0.68-0.76). The AUC increased to 0.84 (95% CI 0.78-0.9) when the model also included hypertension, hypercholesterolemia, hematocrit, renal and liver function tests. Interpretation: All models showed that the contributions of established risk factors of DR especially hyperglycemia outweigh other laboratory parameters available in NHANES.


What is already known about this subject?
There are >500 publications that report associations of candidate circulating biomarkers with diabetic retinopathy (DR).
Although hyperglycemia, hypertension, and hyperlipidemia are established risk factors, they do not always explain the variance of this complication in people with diabetes; DR also shares risk factors with other diabetes complications including markers of renal and cardiovascular disease.
'Holistic' studies that quantify risk across all of these parameters combined are lacking.

What is the key question?
It is unclear whether risk models for DR may be improved by adding some of these reported biomarkers -there is an unmet need to systematically evaluate as many circulating biomarkers as possible to help rank their associations with DR.

What are the new findings?
We show that hyperglycemia is the strongest risk factor across all models.
We stratified the rest of the highest ranked parameters into groups related to diabetes control, renal and liver function, and hematocrit changes.
How might this impact on clinical practice in the foreseeable future?
The importance of focusing on parameters beyond hyperglycemia control to reduce risk of progression from diabetes to DR is emphasized.

INTRODUCTION
Diabetes represents the most common cause of microvascular changes in the retina.The initial retinal lesions of diabetic retinopathy (DR) are microaneurysms but they can occur in eyes with and without diabetes (1)(2)(3).With increasing duration of diabetes, other lesions develop and co-exist in the retina such as retinal hemorrhages, exudates, intraretinal microvascular abnormalities and neovascularization of the retina or optic disc.Based on the presence of individual lesions or a constellation of them, DR severity level is graded from mild, moderate and severe non-proliferative diabetic retinopathy (NPDR) to proliferative diabetic retinopathy (PDR) (4,5).Diabetic macular edema (DME) can occur in any stage of DR (5).In population-based studies, approximately a third of people with diabetes have DR (6,7).The established systemic risk factors for DR are suboptimal control of hyperglycemia, hypertension and hyperlipidemia (8,9).Hypertension can also cause some of these retinal lesions independent of diabetes (10).
There are several laboratory parameters that have been shown to be abnormal in people with DR such as hyperuricemia (11), low vitamin D levels (12), low thyroxine levels (13), anemia (14), oxidative stress and inflammatory markers (15).In addition, DR is also associated with markers of diabetic kidney disease including microalbuminuria and serum creatinine (16,17) and cardiovascular disease markers such as raised C-reactive protein (CRP) (18).Most of these associations and risks of DR are reported based on analysis of candidate laboratorybased serum or urinary markers.
In addition to these risk factors, there are several other non-modifiable and modifiable risk factors that have been attributed to the development and progression of DR.Some of these include age of onset of diabetes, duration of diabetes, male sex, and ethnicity (19)(20)(21).
There is an unmet need to rank these reported retinal, systemic and laboratory risk factors to understand their relative contributions or associations with DR in people with diabetes.The National Health and Nutrition Examination Survey (NHANEShttps://wwwn.cdc.gov/Nchs/Nhanes/) was initiated in the 1960s in order to examine the health and nutritional status of US citizens and has been surveying the population up to the present time.Since 1999, it has examined ~5000 citizens per year and includes various topics, including cardiovascular disease, diabetes, environmental exposures, eye diseases, hearing loss, infectious diseases, kidney disease, nutrition, etc.The data also contains several laboratory markers including environmental toxins, allergens, pollutants.
In this study, we used an environment wide association study (EWAS) methodology (22)(23)(24)(25) on NHANES 2007-8 to evaluate the rank order of systemic and laboratory risks of DR among individuals identified as having diabetes to evaluate their relative associations with DR.Our findings are then replicated in NHANES 2005/6.Our objective is not to only use previously reported risk factors but also provide new research avenues from this data driven agnostic modelling study.

Study data preparation
We used National Health and Nutrition Examination Survey (NHANES) 2007-8 as our primary cohort and 2005-6 as a replication cohort.Both datasets were prepared in the same fashion, however, for ease of interpretation, the following methods describe 2007-8.Specifically, three main categories of data were used: examination data (Ophthalmology -Retinal Imaging data; OPXRET_E), demographics data (DEMO_E), and laboratory data (Figure 1 footnote).The main outcome of interest in the examination data was 4 levels retinopathy severity, worse eye (OPDURL4) -this variable was recoded as binary with levels: no retinopathy; retinopathy (including mild NPR, moderate/severe NPR, and proliferative).All datasets were downloaded as SAS XPORT (xpt) format and read into R (v4.0.2) via the Hmisc package.
Individuals with a missing value in the main outcome variable were removed before aligning the examination, demographics, and laboratory data via each individual's respondent sequence number (SEQN).This dataset was then further filtered for only those individuals who had diabetes (Figure 1).Variables were removed from the data that had 0 variance (i.e.constant values) (Supplementary Table 1).Prior to any analysis, in addition, any variable that contained a single value occupying > 90% of total values was removed, as were variables that had > 90% missingness.Further specific filtering and encoding was then applied per dataset.[A] Examination data: variables that were different encodings of the main outcome were removed; variables that related to the status of the examination appointment were removed; OPDUHMA was removed, as it is a combination of 2 other variables that were retained (OPDUMA and OPDUHEM); variables related to glaucoma, for which there is already a single variable, were removed; variables related to the left or right eye where there was already a variable for 'worse' eye were removed; values encoded as missing were recoded as NA; and all other remaining variables were encoded as binary, with 0 representing the absence of the condition, and 1 representing any recorded presence (at any level) of the condition.[B] Demographic data: variables associated with interpreters and the language of the interview were removed; variables that were duplicates or different encoding of each were removed.[C] Laboratory data: categorical variables were removed and only continuous retained; duplicate variables related to the oral glucose tolerance test (OGTT_E) were removed; variables related to time since domestic activities ('pump gas', 'shower', etc.) were removed; variables that were duplicates or different encodings of each were removed; variables measured on the imperial system of weights and measures were removed if they had a corresponding variable in SI units.We focused only on continuous laboratory variables for the following reasons: 1, in NHANES, the majority of categorical variables are derived from the continuous variables; 2, our PCA-based approach can only work on continuous variables; 3, for RandomForest™, having continuous variables increases the number of splitting points in the data, and metrics of importance such as Gini are known to exhibit less bias on such data (26).

Covariates
Age, ethnicity, and diabetes duration were used as covariates.Diabetes duration was calculated as age at screening minus the age at which the individual was first informed that he/she had diabetes.

Statistical analysis
Prior to statistical analysis, continuous laboratory variables were logged (log e ) and then transformed into z-scores to ensure that these were on the same scale.In regression analysis, the complex sampling design of the NHANES dataset was accounted for through use of survey sampling weights via the survey package in R / CRAN.To do this, the following value-pairs were used with the svydesign function: (id, SDMVPSU; strata, SDMVSTRA; weights, WTMEC2YR; nest, TRUE).
Univariate analysis was performed on all candidate predictors using a survey-weighted compute-parallelized logistic regression model via the R / Bioconductor package RegParallel, adjusting for age, ethnicity, and duration of diabetes separately.The Benjamini-Hochberg (27) procedure was used to control the type I error false discovery rate (FDR).A customized Manhattan plot was generated using ggplot2, while pairwise scatter and correlation plots were generated via a customized pairs plot.Finally, a heatmap was generated via the R / Bioconductor package ComplexHeatmap.
As our study is also hypothesis-generating, multivariate approaches based on principal component analysis (PCA), penalized regression, and the RandomForest™ classification algorithm were additionally used.Variables were pre-filtered and prepared as per univariate testing.Principal component analysis was performed via the R / Bioconductor package PCAtools.After conducting PCA, each eigenvector was then independently regressed against retinopathy outcome via binary logistic regression and those that passed p≤0.05 were used to construct a multivariable model that was further tested in ROC analysis via the pROC package in R.
Separately, as model complexity and multi-collinearity can arise from a large number of predictors, elastic net regularization (penalized regression with L1 and L2 penalties of the Lasso and Ridge methods) was used to reduce the number of predictor variables using glmnet in R / CRAN.To fit the model, 100x cross-validation was used and alpha (α) set to 0.5.The final chosen variables were those whose coefficients were not shrunk to zero -these were plot as violin plots with scatter overlays to show differences between non-DR and DR via ggplot2.To determine accuracy, model predictions were made on the data using the lambda (λ) one-standard-error rule using the predict function from the stats package in R.
Finally, the RandomForest™ (RF) model was fitted via the randomForest R / CRAN package.For this, the dataset was divided randomly into 50% training and 50% validation.Prior to model fitting, the initial model was tuned using functionality provided by the caret package in R / CRAN, as follows: 1), a 10x cross-validation control function was defined via trainControl function; 2) the best value for 'mtry', i.e., the ideal number of variables to randomly sample, was determined using the train function across a search / tuning grid ranging between 1-40 and with Kappa as the metric; and 3) using the selected value of 'mtry', the ideal number of trees, 'ntrees', was determined also via the train function with selection metric based on Kappa.After the initial model was fit, variables with mean decrease in accuracy ≤ 1% were excluded and the model re-fit.This was then repeated in a recursive fashion until all variables with negative mean decrease accuracy were removed from the model.

Final risk models
Variables selected from RandomForest™ were grouped based on similarity of function or clinical use.Each group was then used to create independent univariate or multivariable binary logistic regression models with DR as the end-point.A single Wald test p-value was derived for each model using wald.testfrom the aod package.ROC analysis was performed using pROC.McFadden's and Nagelkerke's pseudo-R 2 were derived via the pscl and rms packages, respectively.

Role of the funding source
The funders had no role in study design, data collection, data analysis, data interpretation, writing, editing the report, or the decision to submit for publication.

Study cohort
In NHANES 2007-8, retinal imaging data is available for 3863 individuals, demographics data is available for 10149 individuals, and laboratory data is available for between 394 and 9307 individuals, depending on the individual laboratory dataset in NHANES (see Figure 1 footnote).After aligning all data and filtering for those who had diabetes by our classification, 1025 individuals remained in our dataset.The selection process is illustrated in Figure 1, while Table 1 provides an overview of the demographics of these individuals.
For our replication cohort, NHANES 2005-6, we prepared laboratory data following the same filter criteria as NHANES 2007-8 and produced a final dataset of 2459 individuals, among which 637 (with retinopathy, 176; no retinopathy, 461) had diabetes.

Retinal lesions of diabetic retinopathy
To help validate our methodology and cohort selection, we aimed to determine retinal lesions that define DR.To this end, we identified 9 retinal lesions in NHANES 2007-8 that were statistically significantly associated with DR and survived to p-value adjustment for false discovery (Table 2).The top lesions were retinal microaneurysms (p≤0 .0001), followed by retinal hard exudates (typically due to lipoprotein deposition in the retina and may be associated with macular edema) (p≤0 .0001).Other key lesions at p≤0 .0001 were retinal soft exudate (now termed cotton wool spots), retinal blot hemorrhages, intraretinal microvascular abnormalities (IRMA), and macular edema.In NHANES, retinal microaneurysms and retinal blot hemorrhages are encoded to be mutually exclusive, i.e., an individual is recorded as having retinal microaneurysms only when not accompanied with retinal blot hemorrhages, and vice-versa (Table 2).

Univariate logistic regression analysis
In total, 6 variables reached statistical significance in the unadjusted univariate analysis, 11 after adjustment for age, 2 after adjustment for ethnicity, and 7 for diabetes duration (Supplementary Figure 1; Table 3).Glycohemoglobin (HbA1c) was the only variable that was statistically significant in both the unadjusted and adjusted analyses.Other risk variables of note that reached statistical significance in the unadjusted analysis included serum glucose (mmol/L) (i.e., RBS), osmolality (mmol/Kg), urinary albumin (mg/L), and fasting glucose (mmol/L) (i.e., FBS).The only protective variable, i.e., negatively associated, was hemoglobin (g/dL).These variables indicate suboptimal diabetes control, abnormal kidney function and presence of anemia as risk factors for DR.There was evidence of co-correlation among these statistically significant variables from the unadjusted analysis (Supplementary

Principal component analysis
Unsupervised PCA using all laboratory variables revealed that 59 PCs could account for 80% or more variation in the dataset.Eighteen PCs were statistically significantly associated with DR at p≤0.05 via independent binomial regression models testing each PC (Supplementary Table 2).The top variables responsible for variation along these PCs included measures of blood glucose (HbA1c, random blood glucose and fasting blood glucose), kidney function markers (urinary albumin, blood urea nitrogen [BUN]), hematological markers (hematocrit, hemoglobin, red blood cell distribution width), inflammatory markers (CRP), white blood cell count, urinary nitrates, segmented neutrophil count), and toxic elements (urinary beryllium and cotinine) among others -these PCs were also statistically significantly correlated to microaneurysms, the previously-identified top retinal lesion, and the covariates used during univariate testing (Supplementary Figure 3).Through ROC analysis, these 18 PCs achieved AUC 0.796 (95% CI: 0.761-0.832).

Penalized regression model
We fitted an unbiased elastic-net penalized regression model to the laboratory variables and cross-validated it 100x.The model selected 9 variables whose coefficients were not shrunk to zero: urinary albumin, BUN, urinary cobalt, CRP, HbA1c, blood osmolality, serum potassium, systolic blood pressure, and urinary nitrate (Figure 2).Of note, these measurements mainly represent diabetes and blood pressure control and kidney function.This model had an accuracy of 78.51% and AUC 0.74 (95% CI: 0.72-0.77)when predicted on the same dataset on which the model was produced.

DISCUSSION
This EWAS of NHANES 2007-8 data with DR outcomes in individuals with diabetes included an unbiased feature selection approach based on a rudimentary univariate regression enabled for compute parallelization, PCA, penalized regression, and RandomForest™ of a large number of laboratory parameters.In contrast, epidemiological studies are typically conducted based on pre-conceived hypotheses and involve a single or just a few variables.
These methods can be scaled to datasets of any size and therefore provide ways of working with large clinical and epidemiological datasets for the purpose of searching for novel hypotheses that could then lead to further focused investigations.
In our rudimentary approach, which is ultimately running many univariate models in a parallelized fashion, HbA1c was the only variable to reach statistical significance after adjusted for age, ethnicity, and diabetes duration.The relationship between HbA1c and DR has been explored extensively and was selected as the strongest risk factor in every approach we undertook, with a mean decrease accuracy of 31 .94% via RandomForest™.
Our penalized regression and RandomForest™ algorithms also identified an association between elevated systolic blood pressure -but not diastolic-and DR (mean decrease accuracy, 1.9%), again confirming literature (10,(28)(29)(30).Further risk variables identified by both penalized regression and RandomForest™ were renal function tests including BUN, urinary albumin, potassium, osmolality, and urinary nitrate.These confirm the strong association of DR with markers of impaired kidney function.Other known risk factors that contributed higher up in the ranking order include hematocrit (%) and cholesterol.Although HbA1c has the strongest association with DR, our study highlight how the addition of other clinical parameters, e.g., from renal and liver function and hematocrit can increase the sensitivity and specificity of predicting DR outcome, with our final clinical risk model achieving AUC 0 .83 (95% CI: 0 .77-0 .89) (p=0 .00012) (Nagelkerke R 2 0 .33), higher than any traditional diabetes control parameter in isolation or in combination.
The EWAS methodology and our RandomForest™ approach of non-targeted recursive feature also indicates a small contribution from toxins and metals, including 3hydroxyphenanthrene, 9-hydroxyfluorene, phthalates, blood o-Xylene, and blood nitromethane.Therefore, retina may be a target tissue for environmental contamination.Some of the associations provide directions to future mechanistic research in DR.For example, we found that retinal microaneurysms (FDR-adjusted p≤0 .0001), the most statistically significant retinal lesions in individuals with DR, is already correlated with some of the variables such as HbA1c, CRP, BUN, beryllium, and hematocrit, suggesting early effects.In contrast, increased urinary cobalt, triclosan and barium became significant only when adjusted for duration of diabetes.Most of these parameters are also linked to risk of allergies and lung disease, an association that has not been previously explored systematically.
As this is a cross-sectional study, a cause-effect relation cannot be established.Moreover, we are unable to rule out any confounding effects of any unmeasured factors.On the other hand, the main strength of the study is the use of the well characterised NHANES cohort in whom standardised protocols were used to measure laboratory parameters.We are not aware of any other association studies in DR where over 400 laboratory parameters were analysed simultaneously to develop multiple models.As the top variables of all four data driven agnostic models were similar, we also believe our findings are generalisable.

CONCLUSION
We confirm that DR is a complex disease and that the already established risk factors contribute significantly to the risk models of DR, with HbA1c being the strongest risk factor.
Although our model provides an accuracy of approximately 80%, it also provides  Notes: The model was initially trained on all laboratory variables in an unsupervised fashion, with Kappa-based model tuning to select the optimum values for 'mtry' (the ideal number of variables to randomly sample) and 'ntrees' (the ideal number of trees).Only variables contributing >1% mean decrease in accuracy from the initial model were retained, followed by recursive steps to remove low-informative variables.Variables are manually assigned to groups based on similar organ function or other characteristic.The Gini importance measure relates to the 'splitting' criterion that is employed in classification trees, and it is known to be less biased for continuous variables (26), which naturally have more splitting points compared to categorical variables.'Group' is manually curated.Features from RandomForest™ were grouped logically based on similar function or clinical use (Table 4) and then tested independently in a univariate or multivariate regression model against DR outcome.
± The only toxins / metal included was Phosphorus (mmol/L) -others filtered out due to high missingness (>50%), resulting in difficulty fitting model.Penalized regression-selected variables.
Variables were selected from a 100x cross-validated model with α=0.5.Final variable selection was based on coefficients not shrunk to 0. Model accuracy was determined to be 78.4% accuracy and AUC 0.71 (95% CI: 0.65-0.77).

Figure 3. Final clinical risk models.
Features from RandomForest™ were grouped logically based on similar function or clinical use (Table 4) and then tested independently in a univariate or multivariate regression model against DR outcome.A final risk model including markers of hypertension, hypercholesterolemia, renal and liver function tests, and hematocrit achieved AUC 0.84 (0.78-0.9).
mechanistic insights into future research on DR including interrogating the interaction of low-ranking risk factors with more established factors in the models and highlights need to explore epigenetic screens to gauge better how risk factors influence gene expression.Most importantly, the study reinforces the need to control known risk factors of DR, especially hyperglycemia.

Table 2 . Retinal lesions that constitute diabetic retinopathy outcome.
Notes: Lesions are taken from the NHANES ophthalmology -retinal imaging (OPXRET_E) dataset.Only lesions with FDR-adjusted p≤0 .05 are listed.Soft exudate is now termed cotton wool spots.

Table 3 . Laboratory variables associated with retinopathy in individuals with diabetes.
Variables were first tested in an unadjusted / non-covariate adjusted analysis, and then again adjusting for age, ethnicity, and diabetes duration.To provide a broad overview, any variable passing nominal (i.e.prior to FDR-correction) p≤0 .05 from either the non-covariate adjusted or any of the covariate-adjusted analyses are listed.