Training listeners to detect auditory-­‐visual temporal coherence enhances their ability to exploit visual information for auditory scene analysis

10 Listeners engaged in an auditory selective attention task are better able to report brief deviants in a 11 target auditory stream when a task-­‐irrelevant visual stimulus is coherently modulated with the target 12 stream, than when it is coherent with the distractor stream (Maddox et al., 2015). Here, we 13 demonstrate that learning to better discriminate auditory-­‐visual temporal coherence, but not simple 14 exposure to temporally coherent AV stimuli, enhances the ability of listeners to exploit visual 15 information in this task. After 5 short training sessions listeners were able to benefit from auditory-­‐ 16 visual temporal coherence both when the visual stimulus is temporally coherent with the target or with 17 the distractor stream, relative to an independently modulated condition. These findings indicate that 18 training to discriminate cross-­‐modal temporal coherence fundamentally changes how listeners exploit 19 visual information for auditory scene analysis. 20 21 22 23 24 25

While some auditory--visual (AV) correspondences, such as temporal and spatial relations (Spence and 3 Deroy, 2012), seem to be innate or established very early in life, others, such as those that rely on 4 semantic relations, are learned through experience (Navarra et al., 2010). In our previous study (Maddox 5 et al., 2015), human participants performed an auditory selective attention task in which they were 6 required to report brief frequency or timbre deviants in a target auditory stream, while ignoring those 7 occurring in a simultaneous distractor. Listeners were better able to perform this task when changes in 8 the size of a task--irrelevant visual stimulus were temporally coherent with intensity changes in the 9 target auditory stream. Since the visual stimulus conveyed no information about whether or when 10 auditory deviants occurred, we concluded that the only way in which listeners could benefit from AV 11 temporal coherence was if this helped them to better segregate the competing auditory streams. We 12 have since proposed that enhancement in a stimulus dimension (here, pitch or timbre) orthogonal to a 13 cross--modal binding feature (here, temporally coherent changes in auditory intensity and visual size) is 14 strong evidence for cross--modal binding (Bizley et al., 2016) and demonstrated that the integration of 15 visual information into early auditory cortex provides a potential mechanism for these effects (Atilgan et 16 al., 2018). 17 In the present study, we hypothesized that 1) that training participants to better detect AV temporal 18 coherence might facilitate an improved ability to use visual information in the selective attention task, 19 and 2) that the ability of an observer to detect auditory--visual temporal coherence might determine 20 their ability to utilize such information to assist with auditory scene analysis. 21

22
We recruited participants and randomly assigned them to one of three groups, each of which performed 23 a pre--test and post--test which comprised of the timbre variant of the selective attention task in Maddox 24 et al., (2015) and a measurement of their ability to detect AV temporal coherence. In between the pre--25 and post--test, one group trained on an AV temporal coherence task (AVTC group, n=12), one group were 26 trained on an AM rate discrimination task with temporally coherent AV stimuli (AVAM group, n=12), and 27 a third group simply performed the pre--test and post--test (control, n=12; Fig. 1A). 28 We first determined whether the five brief (< 40 minutes per session) training sessions experienced by 1 participants in the training groups were sufficient to improve performance in the trained task. Both 2 groups of participants showed improved performance between session 1 and 5 (Fig. 1B/C, pairwise t--3 test on S1 and S5 AVTC thresholds, t 22 = 2.961, p=0.007; AVAM thresholds: t 22 = 4.529, p<0.001). 4 Pairwise comparison between the AV temporal coherence thresholds measured in the pre--and post--5 test for the three experimental groups, revealed that coherence thresholds were significantly decreased 6 only in the AVTC group (Fig. 1d, t 22 = 3.081, p = 0.005) and not in the AVAM group (t 22 =1.69, p=0.104) or 7 in the control group (t 22 =0.234, p = 0.817). While the change in threshold for the AVTC group was 8 correlated with the change in performance between session 1 and 5 (r = 0.632, p = 0.027), there was no 9 correlation between the changes in AVAM threshold and AV temporal coherence threshold (r = 0.392, p 10 = 0.207). 11 Having confirmed that training was effective, we turned to performance in the selective attention task. 12 While neither training paradigm exposed participants to the deviants they were required to detect in 13 Figure 1: A Experimental design. B Training in AVTC task was effective at driving an improvement in AV temporal coherence discrimination between session (S1) and session 5 (S5). Black line show the mean ± SEM across participants, gray lines are individual subjects. C Training in the AVAM task was effective at driving an improvement in AM rate discrimination between S1 and S5. D AV temporal coherence thresholds values of pretest and post--test for the three groups. E Mean d' across AV coherence condition of pretest and post--test for the three groups. *indicates significant differences (Pairwise t--tests p < 0.05) the selective attention task, both groups were exposed to otherwise similar AV stimuli. We predicted 1 that perceptual learning might drive improvements in performance in both trained groups but 2 hypothesized that only the AVTC group would show a change in their ability to exploit visual stimuli. To 3 detect overall changes in performance, we calculated the across condition sensitivity (d'), and directly 4 compared the pre--and post--test data across the three experimental groups. group, improved their ability to detect timbre deviants in a target auditory stream. 12 To further understand the impact of training on the ability of participants to benefit from temporal 13 coherence between auditory and visual streams, we next consider each group in turn. We calculated d', 14 bias, hit rates, false alarm rates and visual hit rates for each AV condition and conducted two--way 15 repeated measures ANOVA on these values with factors of AV coherence (Target, Masker and Neither) 16 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 5, 2018. . https://doi.org/10.1101/295766 doi: bioRxiv preprint and training (Pre--and Post--test; Table 1) for each experiment group. In the AVTC group, for d', there 1 was a significant effect of training (F (1, 71) = 9.39, p = 0.006), AV coherence (F (2, 71) = 9.13, p<0.001) 2 and a significant interaction ( Fig. 2A, D; F (2, 71) = 7.26, p = 0.002). Post--hoc comparison (p<0.05) across 3 AV coherence condition in the pre--test revealed that participants performed better when the visual 4 stimulus was coherent with the target auditory stream versus the masker auditory stream (Target > 5 Masker; p = 0.0031, Bonferroni corrected α = 0.017). Similar results were obtained for hit rates (see 6   Table 1). In contrast, post--hoc comparisons of the post--test d' scores revealed that, after training, 7 performance was better when the visual stimulus was coherent with either the target or the masker 8 stream than the independent condition (Target> Neither, p = 0.0046; Masker> Neither, p = 0.0055, 9 Bonferroni--corrected α = 0.017), indicating that participants were using visual information in a 10 qualitatively different way after training. Hit rates, false alarm and visual hit rates for three experimental groups. 15 16 Like the participants in the AVTC group, participants in AVAM group were exposed to the target vowel 17 sounds used in the ASA task but were not actively discriminating temporal coherence and were only 18 exposed to temporally coherent stimuli. Training improved their performance in the ASA task (Fig. 2B,  19 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 5, 2018. . https://doi.org/10.1101/295766 doi: bioRxiv preprint E). Both training (F(1,71)=5.31, p = 0.009) and AV coherence condition (F(2,71) = 3.69, p = 0.044) 1 influenced d', but -importantly -there was no interaction (F(2,71)=0.17, p=0.844). Post--hoc 2 comparison (p<0.05) across AV coherence conditions in the pre--and post--test revealed that subjects 3 performed better when the visual stimulus was coherent with the target auditory stream vs the masker 4 auditory stream (Target> Masker, Pre--test: p = 0.0049; Post--test: p= 0.0063). Therefore, this suggests an 5 overall improvement in performance after AVAM group, but no change in the way in which subjects 6 were able to exploit visual cues. 7 Performance in the control group did not differ between pre--and post--test (Fig 2C,G)  B Scatter plot showing the Masker--Neither d' difference for the 36 naïve listeners that completed the pre--test (negative values indicate masker performance is impaired relative to the independent condition) versus AV coherence threshold for the pre-test data. Participants who showed a benefit for the masker--coherent condition had lower AV coherence thresholds.  We explored individual differences in performance to test the hypothesis that the ability of naïve 4 listeners to detect AV temporal coherence would predict their ability to benefit from AV temporal 5 coherence in the ASA task: We correlated each listener's AV temporal coherence threshold with the 6 difference between the d' score in the Target coherent and Masker coherent visual condition (Fig 3A). 7 Contrary to our hypothesis, there was no relationship between these values (r = 0.1543, p = 0.3688), nor 8 was there any relationship between overall performance (across condition d') and AV temporal 9 coherence thresholds (r = 0.2888, p = 0.0882). Having observed that the AVTC group improved their 10 ability to utilize masker--visual stimulus temporal coherence, we considered whether temporal 11 coherence thresholds might be correlated with the magnitude of the impairment that the masker--12 coherent condition had over the independent condition. The Masker--coherent -Neither comparison 13 was weakly negatively correlated with AV temporal coherence thresholds ( Fig.3B; r = 0.339, p = 0.0438) 14 suggesting a trend where participants with better AV temporal coherence thresholds were more able to 15 exploit the temporal coherence between masker and visual stimulus to yield a performance benefit 16 relative to the independent condition. This finding mirrors the effect of training whereby improving AV 17 coherence thresholds led to an improvement in the masker--coherent condition. 18 Finally, participants' change in performance between pre--and post--test was correlated with their 19 change in ability to detect temporal coherence between auditory and visual stimuli. Participants with a 20 larger change in their AV coherence threshold showed larger improvements in overall performance (Fig.  21 3C; r =0.353, p=0.0347). 22

Discussion 23
Here we demonstrate that five short training sessions can improve both a listener's ability to detect AV 24 temporal correspondence and their ability to exploit cross--modal temporal coherence to segregate a 25 sound mixture. This effect is principally driven by an improvement in the ability of listeners to exploit 26 temporal coherence in the masker--coherent condition. We have demonstrated that the enhancement of 27 one sound in a mixture by a temporally coherent visual stimulus is a stimulus driven, attention 28 independent, bottom--up effect supported by the early integration of auditory and visual information in 29 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 5, 2018. . https://doi.org/10.1101/295766 doi: bioRxiv preprint auditory cortex (Atilgan et al., 2018). In keeping with our behavioral data from naïve listeners, such an 1 enhancement seems likely to facilitate selective attention when the temporally coherent stream is a 2 target, and impair it when that sound is a distractor. Nonetheless, if AV temporal coherence allows the 3 representation of each of two competing sounds to be more distinct within sensory cortex then 4 temporal coherence between target or masker should offer an advantage over an independently 5 modulated visual stimulus. That this advantage is weakly present in some naïve listeners (Fig.3B) but 6 appears strongly after training (Fig.1D) implies that it potentially arises from an interaction between the 7 stimulus driven effects we observe in the absence of attention and a top--down process. Substantiating 8 such speculation requires further behavioral and neurophysiological investigation. Previous studies have 9 illustrated that visual cues can assist speech processing in noise (Grant et al., 1998;Schwartz et al., 2004;10 Helfer and Freyman, 2005). While speech reading abilities are strongly predictive of audiovisual benefit 11 for speech reception thresholds (MacLeod and Summerfield, 1987), lip reading can influence auditory 12 streaming (Devergie et al., 2011), supporting they idea that lip reading benefits in noise potentially 13 comprise of both bottom--up sensory effects that facilitate auditory scene analysis (Atilgan et al., 2018) 14 in addition to conveying phonetic information. An important question in interpreting the significance of 15 our findings is whether the benefits in the auditory selective attention task transfer to other more real--16 world tasks such as utilizing speech reading in noisy listening conditions. 17

Materials and methods 18
Subjects 19 42 adults (age range 18-34 years; mean age 28 years; 11 males) with normal hearing and normal or 20 corrected--to--normal vision, participated in the study. Six participants were excluded after the pretest 21 due to poor performance (mean d'<0.8, n=4), or low visual hit rates (<70%, n=2). The remaining 36 22 participants were randomly allocated to three groups. Participants were paid for their participation and 23 gave written informed consent to the study approved by the Ethics Committee of the University College 24

Stimuli and testing procedure 26
For the auditory selective attention task, the stimuli were generated and presented as described in the 27 timbre variant of the previous study (Maddox et al., 2015). That is, on each trial they heard two diotically 28 presented artificial vowels with distinct pitches and timbres whose amplitudes were independently 29 discrimination thresholds but instead used a fixed level of difficulty determined using the individual 1 thresholds measured in Maddox  equally likely to be target or masker. For AVTC and AVAM training and temporal coherence testing we 5 used the same auditory (either [u] or [e], F0 = 175 or 190 Hz, counterbalanced) and visual (a radius--6 modulated gray disc) stimuli. Each pre--and post--test session lasted 90 minutes in total and was 7 separated by a maximum of 2 weeks. Participants in the control group did no training sessions but 8 performed the pre--and post--test within 2 weeks (mean ± SD = 5 ± 3 days). 9 For the AV coherence threshold test, two artificial vowel sounds (duration 5 seconds, with identical 10 pitch and timbre within a trial, randomly drawn across trials, amplitude modulated < 7 Hz) were 11 consecutively presented, each accompanied by a visual stimulus. In one interval, the radius modulation 12 of the visual stimulus was independent of the envelope of the sound, while in the other interval, the 13 auditory and visual stimulus maintained some degree of temporal coherence. The method of constant 14 limits was used to determine the threshold with subjects performing 20 trials at each coherence level. 15 AV stimuli were generated from 100% coherent in 10% steps to 10% coherent by multiplying the 16 temporally coherent envelope with an independent envelope. Participants were required to select the 17 interval (by pressing 1 or 2 on a button box) in which the temporally coherent pair was presented. 18 Feedback was provided on every trial. 19 The stimuli and procedure in the AV coherence training were identical to those used in the threshold 20 test, but with an adaptive three--down one--up rule to determine the coherence level of the stimulus in 21 the next trial, as the goal was require that participants work near to threshold during the training 22 session. In the first training session, the stimuli in the first trial were 100% coherent, and 100% 23 independent. For the first 6 reversals coherency was decreased in 10% steps followed by 5% steps for 24 the following six reversals and by 2.5% steps for the remainder. The procedure was terminated at 18 25 reversals unless a maximum of 150 trials was reached first. For the 2nd--5th training session the first 26 "coherent" stimulus was generated with the average coherence level of the last ten reversals in the 27 previous session. Each training session lasted less than 40 minutes. Feedback was provided on every 28 trial. 29 Stimuli for AM rate discrimination task: Two temporally coherent AV stimuli (duration 5 s) were 30