Research Review: How to interpret associations between polygenic scores, environmental risks, and phenotypes

Background Genetic influences are ubiquitous as virtually all phenotypes and most exposures typically classified as environmental have been found to be heritable. A polygenic score summarises the associations between millions of genetic variants and an outcome in a single value for each individual. Ever lowering costs have enabled the genotyping of many samples relevant to child psychology and psychiatry research, including cohort studies, leading to the proliferation of polygenic score studies. It is tempting to assume that associations detected between polygenic scores and phenotypes in those studies only reflect genetic effects. However, such associations can reflect many pathways (e.g. via environmental mediation) and biases. Methods Here, we provide a comprehensive overview of the many reasons why associations between polygenic scores, environmental exposures, and phenotypes exist. We include formal representations of common analyses in polygenic score studies using structural equation modelling. We derive biases, provide illustrative empirical examples and, when possible, mention steps that can be taken to alleviate those biases. Results Structural equation models and derivations show the many complexities arising from jointly modelling polygenic scores with environmental exposures and phenotypes. Counter‐intuitive examples include that: (a) associations between polygenic scores and phenotypes may exist even in the absence of direct genetic effects; (b) associations between child polygenic scores and environmental exposures can exist in the absence of evocative/active gene–environment correlations; and (c) adjusting an exposure‐outcome association for a polygenic score can increase rather than decrease bias. Conclusions Strikingly, using polygenic scores may, in some cases, lead to more bias than not using them. Appropriately conducting and interpreting polygenic score studies thus requires researchers in child psychology and psychiatry and beyond to be versed in both epidemiological and genetic methods or build on interdisciplinary collaborations.


General comments
In order to understand bias, derivations in the supplementary material express the beta from the fitted model as a function of the corresponding true parameter (i.e. beta from the underlying data-generating model). In each example, we follow the steps: Step 1: Write true and fitted model equations Step 2: In the fitted model, find fitted betas as a function of the observed correlations Step 3: In the true model, find the observed correlations as a function of the true betas Step 4: In equations from 2, replace observed correlations with their expressions derived from 3, thereby obtaining the fitted betas as a function of the true parameters, e.g. fitted beta = true beta + bias.
Notation throughout: G* = true additive genetic factor X* = true exposure Y* = true outcome G = observed genetic effects X = observed exposure Y = observed outcome l = measurement error (loading) r = correlation b = fitted beta β = true beta e = error in fitted models ε = error in true models V = variance pM = fitted proportion mediated πM = true proportion mediated Derivations for the first two examples are presented in detail for readers new to structural equation modelling. The following variance-covariance rules are used throughout: We assume uncorrelated errors: • Error terms of different variables are uncorrelated • The error term of one variable is uncorrelated with other variables • The error term of an outcome is unrelated to the predictor variable However, the error term of a given variable (e.g. residuals of Y on X) remains correlated with the variable (e.g. Y) and cannot be ignored.
The structural equation models in Figure 1 and 2 include error terms and illustrate how error terms are handled, for example the absence of a double-headed arrow between εX and εY encodes the assumption of uncorrelated errors, but the arrow from εY* to Y* encodes the association between the error term (or residual) and the corresponding variable.   Figure S1 illustrates terms used throughout the article. The terms are borrowed from directed acyclic graphs (DAGs), that can be used to encode causal models and assumptions. In (i) the directed arrow encodes a causal effect of A on B. In (ii) the directed path goes from A to C via B. B is here a mediator, in the sense that the causal effect of A on C is happening indirectly via B. In (iii) C directly causes A and B. C is therefore a confounder of the association between A and B. The path between A and B via C is called a 'backdoor' path. A backdoor path creates an observed association between A and B even in the absence of a causal effect, represented by the absence of a directed arrow between A and B. In (iv) C is a collider as both arrows from A and B 'collide' in C. In this situation, the path is blocked in C. Contrary to the confo18under situation, there is no observed association between A and B. However, if C is adjusted for, this creates a spurious association between A and B. In (v) C is a confounder of X and Y and should therefore be adjusted for in order to retrieve the causal effect of X on Y. However, C is also a collider of A and B. Adjusting for C thus creates a spurious association between A and B, which introduces a backdoor path from X to Y via A and B.

Appendix S1. Direct genetic effect with measurement error (Figure 2)
Here the outcome is measured with error, but the polygenic score can also be conceived as a measure with error of true additive genetic effects from common SNPs. This is encoded in the Figure by the arrow from G* to G, with the loading lG.
Step 1. Model equations for the true and fitted model Model equations for the true model: Model equations for the fitted model: Step

Find fitted betas as a function of the observed correlations for the fitted model
We have only two variables so the standardised regression estimate is equal to the observed correlation between the two variables.
Step 3. Find the observed correlations as a function of the true betas for the true model Standardised version with variances equal 1: Step 4. Replace observed correlations from step 2 with their expressions derived from step 3 We get: The fitted bGY is thus equal to the true beta (βG*Y*) attenuated by two loadings capturing measurement error for Y and G.
The bias is defined as the fitted effect minus the true effect: When lY = lG = 1, there is no measurement error and thus no bias. Conversely, if the polygenic score or the outcome are pure noise (lY or lG = 0), then the bias is total, equal to -βG*Y*, leading to the observed association bGY = 0.
Note that from above we have: Note that lG 2 can be approximated by using SNP-heritability (h 2 SNP), assuming that If we further assume that h 2 SNP and the variance explained by the polygenic score are both affected in the same way by measurement error in the outcome (i.e. concretely, that means assuming equal measurement error for Y in the h 2 SNP study and in the polygenic score study), then we obtain the path from GSNP to Y: and Note that the equality G* = GSNP does not fully hold as SNP-heritability is not fitted based on all SNPs and SNP-heritability does not capture the total heritability. Hence the resulting reliability only refers to the reliability of the polygenic score as a measure of SNP-heritability rather than total heritability.

Appendix S2. Exposure model (Figure 3)
We consider the measurement error only in G, i.e. assuming X* and Y* are measured:

Step 1. Model equations for the true and fitted model
Model equations for the true model: Variance required for the derivation: Equations for the fitted model (in this case we keep X* and Y* assuming we have observed them, in discrepancy with the actual fitted model in Figure 3b): and the variance of X* in the fitted model is:

Step 2. Find fitted betas as a function of the observed correlations in the fitted model
Calculate correlations in the fitted model: which when standardised corresponds to: and which when standardised corresponds to: and 6 which when standardised corresponds to: Next, we express betas as a function of observed correlations. From equation (2) we can retrieve: and we can substitute this expression in equation (3), where we further include equation (1): So that: We can now turn to expressing bGY* as a function of observed correlations. From equations (1) and (3) we have: and we can substitute this in equation (2) to get: Thus, we get: Step 3. Find the observed correlations as a function of the true betas for the true model which when standardised corresponds to: And 8 which when standardised corresponds to: And which when standardised corresponds to:

Step 4. Replace observed correlations from equation (2) with their expressions derived from equation (3)
We now can rewrite the fitted betas as: and The fitted bX*Y* is therefore the true βX*Y* plus the true genetic confounding βG*X*βG*Y* scaled by the measurement error of G (1 -lG 2 ). This makes sense as the fitted beta should be larger than the true beta as genetic confounding has not been adjusted entirely to the extent that G is a noisy measure of G*.
The bias can thus be quite large if reliability (lG 2 ) is low. Lastly, we have: The fitted mediated effect bM using the polygenic score is: The fitted mediated model is therefore the true effect scaled by the loading of G (lGβM) but there is also an additional term corresponding to an additional "mediation" path via the BiasX*Y*. This is because the fitted bX*Y* is under-corrected for genetic confounding. This under correction exaggerates the path bX*Y* which, in turn, exaggerates the fitted mediation.

In terms of proportion:
The true proportion mediated is: and the fitted proportion mediated is: Note that the second term is always positive when G is a polygenic score for Y as shown below. We have: so the term can be rewritten as: where rG*Y* is the square root of the heritability of Y, which is positive, and βG*Y* is the direct path, which is either 0 if the effect is totally mediated by X* or positive in case of partial mediation. Both terms (1 -lG 2 ) and (1 -lG 2 βG*X* 2 ) are > 0 and < 1.
So when there is total mediation, there will be either no bias in case X* is a total mediator of the true genetic effects (i.e. βG*Y* = 0) or, which is much more likely, the proportion of the polygenic score effect on Y* mediated by X* will be overestimated compared to the true proportion of additive genetic effects on Y* mediated by X*.
Derivations are conducted as above but with less detail.

Figure S2. Collider bias including measurement errors
Step 1. Model equations for the true and fitted model For the fitted model, we have the following equations: and for the true model, we have the following equations: The required variance of X* is: and which when standardised corresponds to: and which when standardised corresponds to: Step 4. Replace observed correlations from step 2 with the corresponding expressions derived from step 3 We now can rewrite the fitted correlations as: If we now assume no measurement error lG = lX = lY = 1, we find the expression first presented in the manuscript for collider bias, i.e.: If we assume only perfect measurement of X* and Y*, which means that bXY becomes bX*Y*, we have lX = lY = 1, and We see that when the polygenic score imperfectly captures the heritability, the fitted bXY is equal to the true βX*Y* plus additional bias terms that are more complex. To explain them, let's first assume that the polygenic score does not measure anything, lG = 0, we have: That is, the fitted beta is equal to the true beta, plus the confounding effect via U* and genetic confounding, which is reintroduced in full as the polygenic score does not adjust for genetic confounding. The collider bias disappears as, in effect, there is no adjustment.
If the polygenic score measures some of the liability, then 0 < lG < 1, and we get two terms adding bias to βX*Y*. First: Here, a higher lG in the denominator leads to a smaller denominator, which increases the collider bias via U. This is important as it means that the bias is high when G* explains more of X* (e.g. highly heritable X*) and that better polygenic scores will better capture the association between G* and X* and increase collider bias.

The second term is
This means that when lG < 1, genetic confounding is reintroduced to some extent. An increasing lG will reduce genetic confounding, and lG = 1 will eliminate genetic confounding.
Overall, more accurate polygenic scores will thus remove confounding more efficiently but also increase collider bias.
We further have: Note that if measurement is perfect, lG = lY = lX = 1 the expression simplifies to the expression for bG*Y* presented in the manuscript. Here, if lG = 0 meaning that the polygenic score is entirely unreliable, then the entire expression is null, i.e. bGY = 0, which is logical as the polygenic score is not associated with Y. As lG increases the numerator increases, and the denominator decreases, leading to an increased bGY Measurement error in the outcome decreases bGY. Note the importance of measurement error in the exposure lX 2 . If the measurement is not perfect, i.e., lX < 1, there is an additional bias term which arises because the effect of X* is not entirely adjusted for. This means that even if genetic effects on Y are entirely explained by X*, there will be a residual correlation between the polygenic score and the outcome. The third term corresponds to the collider bias via U*: This term is also scaled by lX 2 , meaning that measurement error in the exposure weakens the collider effect. Measurement error in different variables can have effects of opposite directions on bGY making it not straightforward to assess whether bGY will be over or underestimated. For example, assuming all betas are positive, increased measurement error in G (lower lG) will decrease bGY, whereas increased measurement error in X (lower lX) will increase bGY.