Active inference under intersensory conflict: Simulation and empirical results

It has been suggested that the brain controls hand movements via internal models that rely on visual and proprioceptive cues about the state of the hand. In active inference formulations of such models, the relative influence of each modality on action and perception is determined by how precise (reliable) it is expected to be. The ‘top-down’ affordance of expected precision to a particular sensory modality presumably corresponds to attention. Here, we asked whether increasing attention to (i.e., the precision of) vision or proprioception would enhance performance in a hand-target phase matching task, in which visual and proprioceptive cues about hand posture were incongruent. We show that in a simple simulated agent—using a neurobiologically informed predictive coding formulation of active inference—increasing either modality’s expected precision improved task performance under visuo-proprioceptive conflict. Moreover, we show that this formulation captured the behaviour and self-reported attentional allocation of human participants performing the same task in a virtual reality environment. Together, our results show that selective attention can balance the impact of (conflicting) visual and proprioceptive cues on action—rendering attention a key mechanism for a flexible body representation for action. Author summary When controlling hand movements, the brain can rely on seen and felt hand position or posture information. It is thought that the brain combines these estimates into a multisensory hand representation in a probabilistic fashion, accounting for how reliable each estimate is in the given context. According to recent formal accounts of action, the expected reliability or ‘precision’ of sensory information can—to an extent—also be influenced by attention. Here, we tested whether this mechanism can improve goal-directed behaviour. We designed a task that required tracking a target’s oscillatory phase with either the seen or the felt hand posture, which were decoupled by introducing a temporal conflict via a virtual reality environment. We first simulated the behaviour of an artificial agent performing this task, and then compared the simulation results to behaviour of human participants performing the same task. Together, our results showed that increasing attention to the seen or felt hand was accompanied by improved target tracking. This suggests that, depending on the current behavioural demands, attention can balance how strongly the multisensory hand representation is relying on visual or proprioceptive sensory information.


Introduction
7 in order to comply with task instructions, would adopt an 'attentional set' (Posner et al., 1976;1978;cf. 113 Rohe & Noppeney, 2018) prioritizing the respective instructed target tracking modality over the task-114 irrelevant one. In other words, the instructed tracking or response modality should become 115 "situationally dominant" by attentional allocation (Kelso et al., 1975;cf.Warren & Cleaver, 2001;116 Redding et al., 1985). 117 Figure 1. Task design and behavioural requirements. We used the same task design in the simulated and behavioural experiments, focusing on the effects of attentional modulation on hand-target phase matching via (near-)stationary, prototypical (i.e., well-trained) oscillatory grasping movements at 0.5 Hz. Participants (or the simulated agent) controlled a virtual hand model (seen on a computer screen) via a data glove worn on their unseen right hand. The virtual hand (VH) therefore represented seen hand posture (i.e., vision), which could be uncoupled from the real hand posture (RH; i.e., proprioception) by introducing a temporal delay (see below). The task required matching the phase of one's right-hand grasping movements to the oscillatory phase of the fixation dot ('target'), which was shrinking-and-growing sinusoidally at 0.5 Hz. In other words, participants had to rhythmically close the hand when the dot shrunk and to open it when the dot expanded. Our design was a balanced 8 (i.e., the fixation dot's size change), only one of the hands' (virtual or real) movements could be phase-matched to the target in the incongruent conditions-necessarily implying a phase mismatch of the other hand's movements. In the VH incong condition, participants had to adjust their movements to counteract the visual lag; i.e., they had to phase-match the virtual hand's movements (i.e., vision) to the target by shifting their real hand's movements (i.e., proprioception) out of phase with the target. Conversely, in the RH incong condition, participants had to match their real hand's movements (i.e., proprioception) to the target's oscillation, and therefore had to ignore the fact that the virtual hand (i.e., vision) was out of phase. The curves show the performance of an ideal participant (or simulated agent).

119
The simulated agent had to match the phasic size change of a central fixation dot (target) with the 120 grasping movements of the unseen real hand (proprioceptive hand information) or the seen virtual hand 121 (visual hand information). Under visuo-proprioceptive conflict (i.e., a phase shift between virtual and 122 real hand movements introduced via temporal delay), only one of the hands could be aligned with the 123 target's oscillatory phase (see Fig. 1 for a detailed task description). The aim of our numerical analyses 124 or simulations was to test whether-in the above manual phase matching task under perceived 125 intersensory conflicts-increasing the expected precision of sensory prediction errors from the 126 instructed modality (vision or proprioception) would result in improved task performance, whereas 127 increasing the precision of prediction errors from the 'distractor' modality would impair it. Such a result 128 would demonstrate that in an active inference scheme, behaviour under intersensory conflict can be 129 augmented via top-down precision control; i.e., selective attention (cf. Feldman & Friston, 2010;130 Edwards et al., 2012;Brown et al., 2013). 131 Figures 2-3 show the results of these simulations, in which an active inference agent performed the 132 target matching task under the two kinds of instruction (virtual hand or real hand task; i.e., the agent 133 had a strong prior belief that the visual or proprioceptive hand posture would track the target's 9 incongruence was realized by temporally delaying the virtual hand's movements with respect to the real 136 hand). In this setup, the virtual hand corresponds to hidden states generating visual input, while the real 137 hand generates proprioceptive input. 138 adjusted model beliefs about visuo-proprioceptive congruence; i.e., in the congruent tasks, the agent believed that its real hand generated matching seen and felt postures, whereas it believed that the same hand generated mismatching postures in the incongruent tasks. Each pair of plots shows the simulation results for one grasping movement in the VH and RH tasks under congruence or incongruence; the left plot shows the predicted sensory input (solid coloured lines; yellow = target, red = vision, blue = proprioception) and the true, real-world values (broken black lines) for the target and the visual and proprioceptive hand posture, alongside the respective sensory prediction errors (dotted coloured lines; blue = target, green = vision, purple = proprioception); the right plot (blue line) shows the agent's action (i.e., the rate of change in hand posture, see Methods). Note that target phase matching is near perfect and there is practically no sensory prediction error (i.e., the dotted lines stay around 0).
Under congruent mapping (i.e., in the absence of visuo-proprioceptive conflict) the simulated agent 139 showed near perfect tracking performance (Fig. 2). We next simulated an agent performing the task 140 under incongruent mapping, while equipped with the prior belief that its seen and felt hand postures 141 were in fact unrelated, i.e., never matched. Not surprisingly, the agent easily followed the task 142 instructions and again showed near perfect tracking with vision or proprioception, under incongruence 143 (Fig. 2). However, as noted above, it is reasonable to assume that human participants would have the 144 strong prior belief-based upon life-long learning and association-that their manual actions generated 145 matching seen and felt postures (i.e., a prior belief that modality specific sensory consequences have a 146 common cause). Our study design assumed that this association would be very hard to update, and that 147 consequently performance could only be altered via adjusting expected precision of vision vs 148 proprioception (see Methods). 149 Therefore, we next simulated the behaviour (during the incongruent tasks) of an agent embodying a 150 prior belief that visual and proprioceptive cues about hand state were in fact congruent. As shown in 151 states of vision and proprioception, resulting in elevated prediction error signals. The agent was still 153 able to follow the task instructions, i.e., to keep the (instructed) virtual or real hand more closely 154 matched to the target's oscillatory phase, but showed a drop in performance compared with the 'ideal' 155 agent (cf. Fig. 2). 156 We then simulated the effect of our experimental manipulation, i.e., of increasing precision of sensory 157 prediction errors from the respective task-relevant (constituting increased attention) or task-irrelevant 158 (constituting increased distraction) modality on task performance. We expected this manipulation to 159 affect behaviour; namely by how strongly the respective prediction errors would impact model belief 160 updating and subsequent performance (i.e., action). The key result of these simulations (Fig. 3A) was 161 that increasing the log precision of vision or proprioception-the respective instructed tracking 162 modality-resulted in reduced visual or proprioceptive prediction errors. This can be explained by the 163 fact that these 'attended' prediction errors were now more strongly suppressed by model belief 164 updating-and action. Conversely, one can see a complementary increase of prediction errors from the Importantly, the above 'attentional' alterations substantially influenced hand-target phase matching 167 performance (Fig. 3B). Thus, increasing the precision of the instructed task-relevant sensory modality's 168 prediction errors led to improved target tracking (i.e. a reduced phase shift of the instructed modality's 169 grasping movements from the target's phase). In other words, if the agent attended to the instructed 170 visual (or proprioceptive) cues more strongly, its movements were driven more strongly by vision (or 171 proprioception)-which helped it to track the target's oscillatory phase with the respective modality's 172 grasping movements. Correspondingly, increasing the precision of the 'irrelevant' (not instructed) 173 modality in each case led to worse simulated tracking performance. 174 The simulations also show that the amount of action itself was comparable across conditions (blue plots 175 in Figs. 2-3; i.e., movement of the hand around the mean stationary value of 0.05), which means that 176 the kinematics of the hand movement per se were not biased by attention. Action was particularly 177 evident in the initiation phase of the movement and after reversal of movement direction (open-to-178 close). At the point of reversal of movement direction, conversely, there was a moment of stagnation; 179 i.e., changes in hand state were temporarily suspended (with action nearly returning to zero). In our 180 simulated agent, this briefly increased uncertainty about hand state (i.e., which direction the hand was 181 moving), resulting in a slight lag before the agent picked up its movement again, which one can see 182 reflected by a small 'bump' in the true hand states (Figs. 2-3). These effects were somewhat more 183 pronounced during movement under visuo-proprioceptive incongruence and prior belief in 184 congruence-which indicates that the fluency of action depended on sensory uncertainty. 185 In sum, these results show that attentional effects of the sort we hoped to see can be recovered using a 186 simple active inference scheme; in that precision control determined the influence of separate sensory 187 modalities-each of which was generated by the same cause, i.e., the same hand-on behaviour by 188 biasing action towards cues from that modality. 189  Fig. 2. Note that, in these results, one can see a clear divergence of true from predicted visual and proprioceptive postures, and correspondingly increased prediction errors. The top row shows the simulation results for the default weighting of visual and proprioceptive information; the middle row shows the same agent's behaviour when precision of the respective task-relevant modality (i.e., vision in the VH task and proprioception in the RH task) was increased (HA: high attention); the bottom row shows the analogous results when the precision of the respective other, irrelevant modality was increased (HD: high distraction). Note how in each case, increasing (or decreasing) the log precision of vision or proprioception resulted in an attenuation (or enhancement) of the associated prediction errors (indicated by green and purple arrows for vision and proprioception, respectively).
Crucially, these 'attentional' effects had an impact on task performance, as evident by an improved hand-target tracking with vision or proprioception, respectively. This is shown in panel (B): The curves show the tracking in the HA conditions. The bar plots represent the average deviation (phase shift or lag, in seconds) of the real hand's (red) or the virtual hand's (blue) grasping movements from the target's oscillatory size change in each of the simulations shown in panel (A). Note that under incongruence (i.e., a constant delay of vision), reducing the phase shift of one modality always implied increasing the phase shift of the other modality (reflected by a shift of red and blue bars representing the average proprioceptive and visual phase shift, respectively). Crucially, in both RH and VH incong conditions, increasing attention (HA; i.e., in terms of predictive coding: the precision afforded to the respective prediction errors) to the task-relevant modality enhanced task performance (relative to the default setting, Def.), as evident by a reduced phase shift of the respective modality from the target phase. The converse effect was observed when the agent was 'distracted' (HD) by paying attention to the respective task-irrelevant modality.

190
We first analysed the post-experiment questionnaire ratings of our participants (Fig. 4) to the following 191 two questions: "How difficult did you find the task to perform in the following conditions?" (Q1, 192 answered on a 7-point visual analogue scale from "very easy" to "very difficult") and "On which hand 193 did you focus your attention while performing the task?" (Q2, answered on a 7-point visual analogue 194 scale from "I focused on my real hand" to "I focused on the virtual hand"). For the ratings of Q1, a 195 Friedman's test revealed a significant difference between conditions (χ 2 (3,69) = 47.19, p < 0.001). Post-196 hoc comparisons using Wilcoxon's signed rank test showed that, as expected, participants reported 197 finding both tasks more difficult under visuo-proprioceptive incongruence (VH incong > VH cong, z(23) 198 = 4.14, p < 0.001; RH incong > RH cong, z(23) = 3.13, p < 0.01). There was no significant difference in 199 reported difficulty between VH cong and RH cong, but the VH incong condition was perceived as 200 significantly more difficult than the RH incong condition (z(23) = 2.52, p < 0.05). These results suggest 201 that, per default, the virtual hand and the real hand instructions were perceived as equally difficult to 202 comply with, and that in both cases the added incongruence increased task difficulty-more strongly 203 so when (artificially shifted) vision needed to be aligned with the target's phase. 204 For the ratings of Q2, a Friedman's test revealed a significant difference between conditions (χ 2 (3,69) = 205 35.83, p < 0.001). Post-hoc comparisons using Wilcoxon's signed rank test showed that, as expected, 206 participants focussed more strongly on the virtual hand during the virtual hand task and more strongly 207 on the real hand during the real hand task. This was the case for congruent (VH cong > RH cong, z(23) 208 = 3.65, p < 0.001) and incongruent (VH incong > RH incong, z(23) = 4.03, p < 0.001) movement trials. 209 There were no significant differences between VH cong vs VH incong, and RH cong vs RH incong, 210 respectively. These results show that participants focused their attention on the instructed target 211 modality, irrespective of whether the current movement block was congruent or incongruent. This 212 supports our assumption that participants would adopt a specific attentional set to prioritize the 213 instructed target modality. 214 Figure 4. Self-reports of task difficulty and intersensory attentional focus given by our participants. The bar plots show the mean ratings for Q1 and Q2 (given on a 7-point visual analogue scale), with associated standard errors of the mean. On average, participants found the VH and RH task more difficult under visuo-proprioceptive incongruence-more strongly so when artificially shifted vision needed to be aligned with the target's phase (VH incong, Q1). Importantly, the average ratings of Q2 showed that participants attended to the instructed modality (irrespective of whether the movements of the virtual hand and the real hand were congruent or incongruent).
Next, we analysed the task performance of our participants; i.e., how well the virtual (or real) hand's 215 grasping movements were phase-matched to the target's oscillation (i.e., the fixation dot's size change) 216 in each condition. Note that under incongruence, better target phase-matching with the virtual hand 217 implies a worse alignment of the real hand's phase with the target, and vice versa. As predicted ( Fig. 1; 218 and as confirmed by the simulation results, Figs. 2-3), we expected an interaction between task and 219 congruence: participants should show a better target phase-matching of the virtual hand under visuo-220 proprioceptive incongruence, if the virtual hand was the instructed target modality (but no such 221 difference should be significant in the congruent movement trials, since virtual and real hand 222 movements were identical in these trials). All of our participants were well trained (see Methods), 223 therefore our task focused on average performance benefits from attention (rather than learning or 224 adaptation effects). 225 The participants' average tracking performance is shown in Figure 5. A repeated-measures ANOVA on 226 virtual hand-target phase-matching revealed significant main effects of task (F(1,22) = 31.69, p = 0.00001) 227 and congruence (F(1,22) = 173.42, p = 3.38e-12) and, more importantly, a significant interaction between 228 task and congruence (F(1,22) = 50.69, p = 0.0000003). Post-hoc t-tests confirmed that there was no 229 significant difference between the VH cong and RH cong conditions (t(23) = 1.19, p = 0.25), but a 230 significant difference between the VH incong and RH incong conditions (t(23) = 6.59, p = 0.000001). In 231 other words, in incongruent conditions participants aligned the phase of the virtual hand's movements 232 significantly better with the dot's phasic size change when given the 'virtual hand' than the 'real hand' 233 instruction. Furthermore, while the phase shift of the real hand's movements was larger during VH 234 incong > VH cong (t(23) = 9.37, p = 0.000000003)-corresponding to the smaller phase shift, and 235 therefore better target phase-matching, of the virtual hand in these conditions-participants also 236 exhibited a significantly larger shift of their real hand's movements during RH incong > RH cong (t(23) 237 = 4.31, p = 0.0003). Together, these results show that participants allocated their attentional resources 238 to the respective instructed modality (vision or proprioception), and that this was accompanied by 239 significantly better target tracking in each case-as expected based on the active inference formulation, as thin lines. In the congruent conditions, the virtual hand's and the real hand's movements were identical, whereas the virtual hand's movements were delayed by 500 ms in the incongruent conditions. Right: The bar plot shows the corresponding average deviation (lag in seconds) of the real hand (red) and the virtual hand (blue) from the target in each condition, with associated standard errors of the mean. Crucially, there was a significant interaction effect between task and congruence; participants aligned the virtual hand's movements better with the target's oscillation in the VH incong > RH incong condition (and correspondingly, the real hand's movements in the RH incong > VH incong condition), in the absence of a significant difference between the congruent conditions. Bonferroni-corrected significance: **p < 0.01, ***p < 0.001.

242
We have shown that behaviour in a manual hand-target phase matching task, under visuo-proprioceptive 243 conflict, benefits from adjusting the balance of visual versus proprioceptive precision by increased 244 attention to either task-relevant modality. Our results generally support a predictive coding formulation 245 of active inference, where visual and proprioceptive cues affect multimodal beliefs that drive action-246 depending on the relative precision afforded to each modality Brown et al., 2013). 247 Firstly, a simulated agent exhibited better phase matching when the expected sensory precision of the 248 instructed 'task-relevant' modality (i.e., attention to vision or proprioception) was increased relative to 249 the 'task-irrelevant' modality. This effect was reversed when attention was increased to the 'task-250 irrelevant' modality, effectively corresponding to cross-modal distraction. These results suggest that 251 more precise sensory prediction errors have a greater impact on belief updating-which in turn guide 252 goal-directed action. Our simulations also suggested that intersensory conflict-and its possible partial 253 resolution-was based on a prior belief that one's hand movements generate matching visual and 254 proprioceptive sensations. In an agent holding the unrealistic belief that visual and proprioceptive 255 postures are per default unrelated, no evidence for an influence of intersensory conflict on target 256 tracking was observed. Secondly, the self-report ratings of attentional allocation and the behaviour 257 exhibited by human participants performing the same task, in a virtual reality environment, suggested 258 an analogous mechanism: Our participants reported shifting their attention to the respective instructed 259 modality (vision or proprioception)-and they were able to correspondingly align either vision or 260 proprioception with an abstract target (oscillatory phase) under intersensory conflict. Together, our 261 results suggest a tight link between precision control, attention, and multisensory integration in action-262 conforming to the principles of reciprocal message passing under hierarchical predictive coding for 263 active inference, whereby the brain can choose how much to rely on specific sensory cues, and how 264 strongly to resolve these prediction errors by action, in a given context. 265 Previous work on causal inference suggests that Bayes-optimal cue integration can explain a variety of 266 multisensory phenomena under intersensory conflict; including the recalibration of the less precise 267 modality onto the more precise one (van Beers et al., 1999;Deneve et al., 2001;Ernst & Banks, 2002;268 Ernst, 2007;Ma & Pouget, 2008;Hospedales & Vijajakumar, 2008;Kayser & Shams, 2015;Samad et 269 al., 2015;Rohe & Noppeney, 2016. Our work advances on these findings by showing that 270 inferring the precision of two conflicting sources of bodily information (i.e., seen or felt hand posture, 271 which were expected to be congruent based on fixed prior model beliefs in a common cause) enhances 272 the accuracy of goal-directed action (i.e., target tracking) with the respective 'attended' modality. 273 'unattended' one. In other words, we showed an interaction between sensory attention and (instructed) 276 behavioural goals in a design that allowed the agent (or participant) to actively change sensory stimuli. 277 In short, this work goes beyond modelling perceptual inference to consider active inference, where the 278 consequences of action affect perceptual inference. In other words, by moving beyond perceptual 279 inference, we were able to model the optimisation of precisionthat underwrites multisensory 280 integrationand relate this to sensory attention and attenuation. Our results therefore extend previous 281 models of sensorimotor control (Körding et al., 2007;Vijajakumar et al., 2011) to address attentional 282 effects on action. Moreover, our results complement previous simulations of oculomotor control within 283 the active inference scheme (Perrinet et al., 2014;Adams et al., 2015Adams et al., , 2016, which use the same 284 implementation of active inference that, in contrast to most work in this area, commits to a 285 neurobiologically plausible implementation scheme. Unlike most normative models based upon causal 286 inference, this means that, in principle, the current model can be validated in relation to evoked neuronal 287

responses. 288
More generally, our results support the notion that an endogenous attentional 'set' (Posner et al., 1978) 289 can influence the precision afforded to vision or proprioception during action, and thus to prioritize 290 either modality for a current behavioural context. Several studies have shown that visuo-proprioceptive 291 recalibration is context dependent in that either vision or proprioception may be the 'dominant' 292 modality-with corresponding recalibration of the 'non-dominant' modality (Warren & Cleaves, 1971;293 Kelso et al., 1975;Posner et al., 1976;Redding et al., 1985;Foulkes & Miall, 2000;Ingram et al., 2000;294 Foxe & Simpson, 2005;Cressman & Henriques, 2009;Rand & Heuer, 2019). Thus, our results lend (at 295 least tentative) support to arguments that visuo-proprioceptive (or visuo-motor) adaptation and 296 recalibration can be enhanced by increasing the precision of visual information (attending to vision; cf. 297 Kelso et al., 1975;Posner et al., 1976). Notably, our results also suggest that the reverse can be true; 298 i.e., that visuo-proprioceptive recalibration can be counteracted by increasing one's attention to 299 proprioception. In sum, our results suggest that updating the predictions of a 'body model' affects goal-300 directed action. However, as it has been suggested that prediction updating may happen without control updating (Mathew et al., 2018), future work could establish whether the effects observed in our study 302 can have long-lasting impact on the (generalizable) learning of motor control. 303 A noteworthy difference between our simulation results and the result of the behavioural experiment 304 was that our participants exhibited a more pronounced shift of their real movements in the 'real hand' 305 condition (which partly aligned the delayed virtual hand with the target's phase). This effect was 306 reminiscent of the behaviour of our simulated agent under 'high distraction' (i.e., attention to the task-307 irrelevant modality) and occurred despite the fact that, as indicated by the ratings, participants focused 308 on their real hand and tried to comply with the task instructions. Interestingly, however, our participants 309 reported the 'real hand' task to be easier than the 'virtual hand' task under visuo-proprioceptive 310 incongruence-which suggests that they did not notice their 'incorrect' behavioural adjustment. In 311 contrast, the simulated agent even showed slightly better RH than VH alignment-this can be explained 312 by the fact that proprioception was 'naturally' the modality driving movement, while the vision was 313 experimentally delayed (which had to be inferred by the agent). 314 One tentative interpretation of the much stronger visual bias in the behavioural experiment is possible 315 in light of predictive coding formulations of shared body representation and self-other distinction; i.e., 316 the relative balance between visual and proprioceptive prediction errors to decide whether 'I am 317 observing an action' or whether 'I am moving' (Kilner et al., 2003(Kilner et al., , 2007Friston, 2012;cf. Vasser et 318 al., 2019). Generally, visual prediction errors have to be attenuated during action observation to prevent 319 actually performing (i.e., mirroring) the observed movement (Friston, 2012). However, several studies 320 have demonstrated 'automatic' imitative tendencies during action observation, reminiscent of 321 'echopraxia', which are extremely hard to inhibit-for example, seeing an incongruent finger or arm 322 movement biases participants' own movement execution (Brass et al., 2001;Kilner et al., 2003). In a 323 predictive coding framework, this can be formalized as an 'automatic' update of multimodal beliefs 324 driving action by precise (not sufficiently attenuated) visual body information (cf. Kilner et al., 2007). 325 Such an interpretation would be in line with speculations that participants in visuo-motor conflict tasks 326 attend to vision, rather than proprioception, if not instructed otherwise (Kelso et al., 1975; a seen visual hand posture that is incongruent (note that this could mean leading or lagging in our case) 329 to the felt one-can account for our behavioural results could be clarified by future work. Likewise, an 330 interesting question is whether these effects could perhaps be reduced by actively ignoring or 'dis-331 attending' (Clark, 2015;Limanowski, 2017) away from vision. An analogous mechanism has been 332 tentatively suggested by observed benefits of proprioceptive attenuation-thereby increasing the 333 relative impact of visual information-during visuo-motor adaptation and visuo-proprioceptive 334 recalibration (Taub & Goldberg, 1974;Ingram et al., 2000;Balslev et al., 2004;Bernier et al., 2009;cf. 335 Limanowski et al., 2015a,b;Zeller et al., 2016). These questions should best be addressed by combined 336 behavioural and brain imaging experiments, to illuminate the neuronal correlates of the (supposedly 337 attentional) precision weighting in the light of recently proposed implementations of predictive coding 338 in the brain (Bastos et al., 2012;Shipp et al., 2013;Shipp, 2016). 339 It should be noted that our results need to be validated by future work using more complicated 340 movement tasks (here, we focused on a simple, well-trained grasping movement), different target 341 modalities (we used a visual, albeit non-spatial target), and more biophysically realistic models of motor 342 (hand movement) control. Moreover, we interpret our simulation and empirical results in terms of 343 evidence for top-down precision modulation, which corresponds to the process of 'attention' within the 344 active inference account of predictive coding (Feldman & Friston, 2010;Edwards et al., 2012;Brown 345 et al., 2013). This interpretation needs to be applied with some caution to the behavioural results, as we 346 can only infer any attentional effects from the participants' self-reports. We assume that participants 347 monitored their behaviour continuously, but with the present data we cannot rule out that movements 348 might have been executed automatically between discrete time points at which behaviour was 349 monitored. Future work could therefore use explicit measures of attention, perhaps supplemented by 350 forms of supervision, to validate behavioural effects. Finally, our experimental design is not able to 351 disentangle the (likely interdependent) effects of sensory noise and attention (i.e., expected precision). 352 Therefore, another important question for future research is the potential attentional compensation of 353 experimentally added sensory noise (e.g., via jittering or blurring the visual hand or via tendon vibration in the proprioceptive domain, cf. Jaeger et al., 1979), whereby it should be remembered that these 355 manipulations may in themselves be 'attention-grabbing' (Beauchamp et al., 2010). 356

357
Task design 358 We used the same task design in the simulations and the behavioural experiment (see Fig. 1). For 359 consistency, we will describe the task as performed by our human participants, but the same principles 360 apply to the simulated agent. We designed our task as a non-spatial modification of a previously used 361 hand-target tracking task (cf. Limanowski et al., 2017;Limanowski & Friston, 2019). The participant 362 (or simulated agent) had to perform repetitive grasping movements paced by sinusoidal fluctuations in 363 the size of a central fixation dot (sinusoidal oscillation at 0.5 Hz). Thus, this task was effectively a phase 364 matching task, which we hoped to be less biased towards the visual modality due to a more abstract 365 target quantity (oscillatory size change vs spatially moving target, as in previous studies). The fixation 366 dot was chosen as the target to ensure that participants had to fixate the centre of the screen (and 367 therefore look at the virtual hand) in all conditions. Participants (or the simulated agent) controlled a 368 virtual hand model via a data glove worn on their unseen right hand (details below). In this way, vision 369 (seen hand position via the virtual hand) could be decoupled from proprioception (felt hand position). 370 In half of the movement trials, temporal delay of 500 ms between visual and proprioceptive hand 371 infromation was introduced by delaying vision (i.e., the seen hand movements) with respect to 372 proprioception (i.e., the unseen hand movements performed by the participant or agent). In other words, 373 the seen and felt hand positions were always incongruent (phase-shifted) in these conditions. Crucially, 374 the participant (agent) had to perform the phase matching task with one of two goals in mind: to match 375 the target's oscillatory phase with the seen virtual hand movements (vision) or with the unseen real 376 hand movements (proprioception). This resulted in a 2 x 2 factorial design with the factors 'visuo-

Simulations 380
We based our simulations on predictive coding formulations of active inference as situated within a free 381 energy principle of brain function, which has been used in many previous publications to simulate 382 perception and action (e.g. Friston et al., 2010;Friston, 2012;Brown & Friston, 2013;Perrinet et al., 383 2014;Adams et al., 2015). Here, we briefly review the basic assumptions of this scheme (please see the 384 above literature for details). 385 Hierarchical predictive coding rests on a probabilistic mapping of hidden causes to sensory 386 input is generated from causes in the environment; where increasingly higher-level beliefs represent 392 increasingly abstract (i.e., hidden or latent) states of the environment. The generative model therefore 393 maps from unobservable causes (hidden states) to observable consequences (sensory states). Model 394 inversion corresponds to inferring the causes of sensations; i.e., mapping from consequences to causes. 395 Operationally, this inversion rests upon the minimisation of free energy or 'surprise' approximated in 396 the form of prediction error. In other words, prediction errors are used to update expectations to 397 accommodate or 'explain away' ascending prediction error. This corresponds to Bayesian filtering or 398 predictive coding (Rao & Ballard, 1999;Friston & Kiebel, 2009;Bastos et al., 2012)-which, and the 399 linear assumptions, is formally identical to linear quadratic control in motor control theory (Todorov, 400 2008). In such an architecture, descending connections convey predictions suppressing activity in the 401 cortical level immediately below, and ascending connections return prediction error (i.e., sensory data 402 not explained by descending predictions). Crucially, the ascending prediction errors are precision-403 weighted (where precision corresponds to the inverse variance), so that a prediction error that is afforded Active inference extends hierarchical predictive coding from the sensory to the motor domain; i.e., by 406 equipping standard Bayesian filtering schemes (a.k.a. predictive coding) with classical reflex arcs that 407 enable action (e.g., a hand movement) to fulfil predictions about hidden states of the world. In brief, 408 desired movements are specified in terms of prior beliefs about state transitions (policies), which are 409 then realised by action; i.e., by sampling or generating sensory data that provide evidence for those 410 beliefs (Perrinet et al., 2014). Thus, action is also driven by optimisation of the model via suppression 411 of prediction error: movement occurs because high-level multi-or amodal prior beliefs about behaviour 412 predict proprioceptive and exteroceptive (visual) states that would ensue if the movement was 413 performed (e.g., a particular limb trajectory). Prediction error is then suppressed throughout a motor 414 hierarchy; ranging from intentions and goals over kinematics to muscle activity (Kilner et al., 2007;cf. 415 Grafton & Hamilton, 2007). At the lowest level of the hierarchy, spinal reflex arcs suppress 416 proprioceptive prediction error by enacting the predicted movement, which also implicitly minimises 417 exteroceptive prediction error; e.g. the predicted visual consequences of the action , 418 Friston 2011. Thus, via embodied interaction with its environment, an agent can 419 reduce its model's free energy ('surprise' or, under specific assumptions, prediction error) or, in other 420 words, maximise Bayesian model evidence. Put succinctly, all action is in the service of self-evidencing 421 (Hohwy, 2016). 422 Following the above notion of active inference, one can describe action and perception as the solution 423 to coupled differential equations describing the dynamics of the real world (boldface) and the behaviour 424 of an agent (italics, cf. Friston et al., 2010 for details). 425 (1) The first pair of coupled stochastic (i.e., subject to random fluctuations ωx, ων) differential equations 426 describes the dynamics of hidden states and causes in the world and how they generate sensory states. Here, (s, x, ν, a) denote sensory input, hidden states, hidden causes and action in the real world, 428 respectively. The second pair of equations corresponds to action and perception, respectively-they 429 constitute a (generalised) gradient descent on variational free energy, known as an evidence bound in 430 machine learning (Winn & Bishop, 2005). The differential equation describing perception corresponds 431 to generalised filtering or predictive coding. The first term is a prediction based upon a differential 432 operator D that returns the generalised motion of conditional (i.e., posterior) expectations about states 433 of the world, including the motor plant (vector of velocity, acceleration, jerk, etc.). Here, the variables 434 (s, μ, a) correspond to generalised sensory input, conditional expectations and action, respectively. 435 Generalised coordinates of motion, denoted by the ~ notation, correspond to a vector representing the 436 different orders of motion (position, velocity, acceleration, etc.) of a variable. The differential equations 437 above are coupled because sensory states depend upon action through hidden states and causes (x, ν) 438 while action a(t) = a(t) depends upon sensory states through internal states μ. Neurobiologically, these 439 equations can be considered to be implemented in terms of predictive coding; i.e., using prediction 440 errors on the motion of hidden states-such as visual or proprioceptive cues about hand position-to 441 update beliefs or expectations about the state of the lived world and embodied kinematics. 442 By explicitly separating hidden real-world states from the agent's expectations as above, one can 443 separate the generative process from the updating scheme that minimises free energy. To perform 444 simulations using this scheme, one solves Eq. 1 to simulate (neuronal) dynamics that encode conditional 445 expectations and ensuing action. The generative model thereby specifies a probability density function 446 over sensory inputs and hidden states and causes, which is needed to define the free energy of sensory 447 inputs: 448 (2) = (1) ( (1) , (1) ) +

⋮
This probability density is specified in terms of nonlinear functions of hidden states and causes (f (i) , g (i) ) 449 that generate dynamics and sensory consequences, and Gaussian assumptions about random 450 fluctuations (ωx (i) , ων (i) ) on the motion of hidden states and causes. These play the role of sensory noise 451 or uncertainty about states. The precisions of these fluctuations are quantified by (Πx (i) , Πν (i) ), which are 452 the inverse of the respective covariance matrices. 453 Given the above form of the generative model (Eq. 2), we can now write down the differential equations 454 (Eq. 1) describing neuronal dynamics in terms of prediction errors on the hidden causes and states as 455 The above equation (Eq. 3) describes recurrent message passing between hierarchical levels to suppress 457 free energy or prediction error (i.e., predictive coding, cf. Friston & Kiebel, 2009;Bastos et al., 2012). 458 Specifically, error units receive predictions from the same hierarchical level and the level above. 459 Conversely, conditional expectations ('beliefs', encoded by the activity of state units) are driven by 460 prediction errors from the same level and the level below. These constitute bottom-up and lateral 461 messages that drive conditional expectations towards a better prediction to reduce the prediction error 462 in the level below-this is the sort of belief updating described in the introduction.
down the motor (proprioceptive) hierarchy to be unpacked into proprioceptive predictions at the level 466 of (pontine) cranial nerve nuclei and spinal cord, which are then 'quashed' by movement so that 467 predicted movements are enacted to 'fulfil' predictions. 468 (4) In our case, the generative process and model used for simulating the target tracking task are 469 straightforward (using just a single level) and can be expressed as follows: 470 The first pair of equations describe the generative process; i.e., a noisy sensory mapping from hidden 471 states and the equations of motion for states in the real world. In our case, the real-world variables 472 becomes a random variable. Therefore, the single movement we simulated in each condition may be 498 interpreted as a participant-specific average over realizations; i.e., in which the effects of random 499 fluctuations are averaged out (cf. Perrinet et al., 2014;Adams et al., 2015). This ensured that our 500 simulations reflect systematic differences depending on the parameter values chosen to reflect 501 alterations of sensory attention via changing parameters of the agent's model (as described below). 502 The second pair of equations describe the agent's generative model of how sensations are generated 503 using the form of Eq. 2. These define the free energy in Eq. 1 and specify behaviour (under active 504 inference). The generative model has the same form as the generative process, with the important hand and the target xtxh. In other words, the agent believes that its grasping movements will follow 507 the target's oscillatory size change, which is itself driven by some unknown force at a constant rate (and 508 thus producing an oscillatory trajectory as in the generative process). This effectively models (the 509 compliance with) the task instruction, under the assumption that participants already know about the 510 oscillatory phase of the target; i.e., they have been well trained. Importantly, this formulation models 511 the 'real hand' instruction; under the 'virtual hand' instruction, the state of the hand was driven by xt -512 (xhv), reflecting the fact that any perceived visual delay (i.e., the inferred displacement of vision from 513 proprioception v) should now also be compensated to keep the virtual hand aligned with the target's 514 oscillatory phase under incongruence; the initial value for v was set to represent the respective 515 information about visuo-proprioceptive congruence, i.e., 0 for congruent movement conditions and 0.35 516 for incongruent movement conditions. We defined the agent's model to entertain a prior belief that 517 visual and proprioceptive cues are normally congruent (or, for comparison, incongruent). This was 518 implemented by setting the prior expectation of the cause v to 0 (indicating congruence of visual and 519 proprioceptive hand posture information), with a log precision of 3 (corresponding to about 20.1). In 520 other words, the hidden cause could vary, a priori, with a standard deviation of about exp(-3/2) = 0.22. 521 This mimicked the learned association between seen and felt hand positions (under a minimal degree 522 of flexibility), which is presumably formed over a lifetime and very hard to overcome and underwrites 523 phenomena like the 'rubber hand illusion' (Botvinick & Cohen, 1998;see Introduction). 524 Crucially, the agent's model included a precision-weighting of the sensory signals-as determined by 525 the active deployment of attention along predictive coding accounts of active inference. This allowed 526 us to manipulate the precision assigned to proprioceptive or visual prediction errors (Πp, Πv) that, per 527 default, were given a log precision of 3 and 4, respectively (corresponding to 20.1 and 54.6, 528 respectively). This reflects the fact that vision usually is afforded a higher precision than proprioception 529 in hand position estimation (e.g. van Beers et al., 1999;cf. Kelso et al., 1975). To implement increases 530 in task-related (selective) attention, we increased the log precision of prediction errors from the 531 instructed modality (vision or proprioception) by 1 in each case (i.e., by a factor of about 2.7); in an modality by increasing the precision of the appropriate prediction errors. We did not simulate increases 534 in both sensory precisions, because our study design was tailored to investigate selective attention as 535 opposed to divided attention. Note that in the task employed, divided attention was precluded, since 536 attentional set was induced via instructed task-relevance; i.e., attempted target phase-matching. In other 537 words, under incongruence, only one modality could be matched to the target. The ensuing generative 538 process and model are, of course, gross simplifications of a natural movement paradigm. However, this 539 formulation is sufficient to solve the active inference scheme in Eq. 1 and examine the agent's behaviour 540 under the different task instructions and, more importantly, under varying degrees of selectively 541 enhanced sensory precision afforded by an attentional set. 542 Experiment 543 26 healthy, right-handed volunteers (15 female, mean age = 27 years, range = 19-37, all with normal or 544 corrected-to-normal vision) participated in the experiment, after providing written informed consent. 545 Two participants were unable to follow the task instructions during training and were excluded from 546 the main experiment, resulting in a final sample size of 24. The experiment was approved by the local 547 research ethics committee (University College London) and conducted in accordance with the usual 548

guidelines. 549
During the experiment, participants sat at a table wearing an MR-compatible data glove (5DT Data 550 Glove MRI, 1 sensor per finger, 8 bit flexure resolution per sensor, 60 Hz sampling rate) on their right 551 hand, which was placed on their lap under the table. The data glove measured the participant's finger 552 flexion via sewn-in optical fibre cables; i.e., each sensor returned a value from 0 to 1 corresponding to 553 minimum and maximum flexion of the respective finger. These raw data were fed to a photorealistic 554 virtual right hand model (cf. Limanowski et al., 2017), whose fingers were thus moveable with one 555 degree of freedom (i.e., flexion-extension) by the participant, in real-time. The virtual reality task 556 environment was instantiated in the open-source 3D computer graphics software Blender 557 (http://www.blender.org) using a Python programming interface, and presented on a computer screen The participants' task was to perform repetitive right-hand grasping movements paced by the oscillatory 560 size change of the central fixation do, which continually decreased-and-increased in size sinusoidally 561 (12 % size change) at a frequency of 0.5 Hz; i.e., this was effectively a phase matching task (Fig. 1). 562 The participants had to follow the size changes with right-hand grasping movements; i.e., to close the 563 hand when the dot shrunk and to open the hand when the dot grew. In half of the movement trials, an 564 incongruence between visual and proprioceptive hand information was introduced by delaying the 565 virtual hand's movements by 500 ms with respect to the movements performed by the participant. In 566 other words, the virtual hand and the real hand were persistently in incongruent (mismatching) postures 567 in these conditions. The delay was clearly perceived by all participants. 568 Participants performed the task in trials of 32 seconds (16 movement cycles; the last movement was 569 signalled by a brief blinking of the fixation dot), separated by 6 second fixation-only periods. The task 570 instructions ('VIRTUAL' / 'REAL') were presented before each respective movement trials for 2 571 seconds. Additionally, participants were informed whether in the upcoming trial the virtual hand's 572 movements would be synchronous ('synch.') or delayed ('delay'). The instructions and the fixation dot 573 in each task were coloured (pink or turquoise, counterbalanced across participants), to help participants 574 remember the current task instruction during each movement trial. Participants practised the task until 575 they felt confident, and then completed two runs of 8 min length. Each of the four conditions 'virtual 576 hand task under congruence' (VH cong), 'virtual hand task under incongruence' (VH incong), 'real 577 hand task under congruence' (RH cong), and 'real hand task under incongruence' (RH incong) was 578 presented 3 times per run, in randomized order. 579 To analyse the behavioural change in terms of deviation from the target (i.e., phase shift from the 580 oscillatory size change), we averaged and normalized the movement trajectories in each condition for 581 each participant (raw data were averaged over the four fingers, no further pre-processing was applied). 582 We then calculated the phase shift as the average angular difference between the raw averaged 583 movements of the virtual or real hand and the target's oscillatory pulsation phase in each condition, 584 using a continuous wavelet transform. The resulting phase shifts for each participant and condition were congruence (congruent, incongruent) to test for statistically significant group-level differences. Post-587 hoc t-tests (two-tailed, with Bonferroni-corrected alpha levels to account for multiple comparisons) 588 were used to compare experimental conditions. 589 After the experiment, participants were asked to indicate-for each of the four conditions separately-590 their answers to the following two questions: "How difficult did you find the task to perform in the 591 following conditions?" (Q1, answered on a 7-point visual analogue scale from "very easy" to "very 592 difficult") and "On which hand did you focus your attention while performing the task?" (Q2, answered 593 on a 7-point visual analogue scale from "I focused on my real hand" to "I focused on the virtual hand"). 594 The questionnaire ratings were evaluated for statistically significant differences using a nonparametric 595 Friedman's test and Wilcoxon's signed-rank test (with Bonferroni-corrected alpha levels to account for 596 multiple comparisons) due to non-normal distribution of the residuals. 597