How Does Having a Good Ear Promote Instructed Second Language Pronunciation Development? Roles of Domain-General Auditory Processing in Choral Repetition Training

Growing evidence suggests that auditory processing ability may a crucial determinant of language learning, including adult second language (L2) speech learning. The current study tested 47 Chinese English-as-a-Foreign-Language students to examine the extent to which two types of auditory processing, i.e., perceptual acuity and audiomotor integration, related to improvements in the comprehensibility and nativelikeness of L2 speech following two weeks of choral repetition training (i.e., shadowing). All participants’ pronunciation proficiency became significantly more comprehensible over time, and the degree of improvement in the nativelikeness of pronunciation was tied to the ability to remember and reproduce sounds (i.e., audio-motor integration). The findings suggest that robust auditory-motor integration may play a key role in the acquisition of advanced-level L2 pronunciation proficiency (i.e., comprehensible and nativelike speech). doi: 10.1002/tesq.3120 Individuals differ widely in terms of how they process domain-general acoustic information such as pitch, duration, and spectral patterns (Kidd, Watson, & Gygi, 2007). On a broader level, this perceptual TESOL QUARTERLY Vol. 0, No. 0, 0000 © 2022 The Authors. TESOL Quarterly published by Wiley Periodicals LLC on behalf of TESOL International Association 1 This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. ability, collectively termed as domain-general auditory processing, comprises two different constructs—(a) the extent to which they can hear very subtle acoustic details of sounds (i.e., auditory acuity) and (b) how well they can convert perceived information into motor action (i.e., audio-motor integration). An emerging paradigm in the field of second language (L2) acquisition suggests that these differences in auditory processing can modulate adult language learning trajectories: individuals with more precise auditory processing abilities may be better able to utilize every input opportunity, resulting in greater gains in the long run (i.e., the auditory precision hypothesis; Mueller, Friederici, & M€annel, 2012; Kachlicka, Saito, & Tierney, 2019; for a comprehensive overview, see Saito & Tierney, forthcoming). Following the auditory account of L2 acquisition, prior crosssectional research has shown that participants’ auditory processing abilities are tied to the outcomes of successful L2 speech learning in both naturalistic (e.g., Saito, Kachlicka, Sun, & Tierney, 2020) and classroom settings (e.g., Saito, Suzukida, Tran, & Tierney, 2021). In the context of 47 Chinese English-as-a-Foreign-Language (EFL) learners, the current longitudinal study tested whether, to what degree, and how different constructs of auditory processing ability (acuity, integration) could predict learners’ multi-dimensional improvement in L2 pronunciation (comprehensibility, nativelikeness) following two weeks of choral repetition training (i.e., shadowing).

ability, collectively termed as domain-general auditory processing, comprises two different constructs-(a) the extent to which they can hear very subtle acoustic details of sounds (i.e., auditory acuity) and (b) how well they can convert perceived information into motor action (i.e., audio-motor integration). An emerging paradigm in the field of second language (L2) acquisition suggests that these differences in auditory processing can modulate adult language learning trajectories: individuals with more precise auditory processing abilities may be better able to utilize every input opportunity, resulting in greater gains in the long run (i.e., the auditory precision hypothesis; Mueller, Friederici, & M€ annel, 2012;Kachlicka, Saito, & Tierney, 2019; for a comprehensive overview, see Saito & Tierney, forthcoming).
Following the auditory account of L2 acquisition, prior crosssectional research has shown that participants' auditory processing abilities are tied to the outcomes of successful L2 speech learning in both naturalistic (e.g.,  and classroom settings (e.g., Saito, Suzukida, Tran, & Tierney, 2021). In the context of 47 Chinese English-as-a-Foreign-Language (EFL) learners, the current longitudinal study tested whether, to what degree, and how different constructs of auditory processing ability (acuity, integration) could predict learners' multi-dimensional improvement in L2 pronunciation (comprehensibility, nativelikeness) following two weeks of choral repetition training (i.e., shadowing).

BACKGROUND Assessing and Teaching L2 Pronunciation
Over the past 50 years, much attention has been given to devising optimal methods for assessing and teaching L2 pronunciation (i.e., the accurate and fluent articulation of new vocalic and consonantal sounds with adequate and varied stress and intonation patterns). While many teachers, students, and textbook developers have emphasized the attainment of nativelike pronunciation skills as an ideal goal (e.g., Derwing, 2003 for Canadian ESL classrooms), research has convincingly shown that adult L2 pronunciation is generally foreignaccented due to the influence of fully established L1 phonetic systems (see Flege & Bohn, 2021 for a theoretical account of the complexities underlying L1-L2 interactions). Therefore, many scholars have argued that what matters most for communicative success is the degree to which an L2 user's speech is comprehensible, intelligible, and communicatively adequate (Levis, 2018).
To date, scholars have conceptualized, discussed, and operationalized L2 pronunciation proficiency from two different angles: comprehensibility and accentedness. 1 Comprehensibility concerns listeners' ease of understanding, while accentedness relates to phonological nativelikeness (Derwing & Munro, 2013). Both constructs are typically measured using listeners' intuitive judgments of L2 speech. During L2 comprehensibility judgments, listeners have been shown to attend to a variety of dimensions of L2 speech, including phonological accuracy, speaking fluency (Trofimovich & Isaacs, 2012), varied prosody (Kang, Rubin, & Pickering, 2010), and the varied, sophisticated, and contextually appropriate use of lexicogrammar (Appel, Trofimovich, Saito, Isaacs, & Webb, 2019). In other words, when rating L2 speech, listeners aim to collect as much linguistic information as possible to grasp an overall picture of what the speaker intends to convey (Saito, Trofimovich, & Isaacs, 2017). When it comes to L2 accentedness judgments, however, it has been shown that listeners preferentially attend to a speaker's degree of phonetic refinement, especially at the segmental level, rather than factors related to understanding speech content (Trofimovich & Isaacs, 2012).
Research in classroom (Nagle, 2018) and naturalistic settings (Derwing & Munro, 2013) has shown that L2 learners can substantially enhance their comprehensibility if they use the target language on a regular basis. In addition, it has been shown that L2 comprehensibility is particularly amenable to improvement via explicit instruction (Saito & Plonsky, 2019). On the other hand, it has been shown that L2 accentedness is resistant to change even if learners engage in explicit instruction (Derwing, Munro, Foote, Waugh, & Fleming, 2014). This is arguably because the primary correlate of foreign accentedness-segmental refinement-requires an extensive amount of L2 immersion experience from an early age (Flege & Bohn, 2021). Furthermore, the acquisition of nativelike L2 phonology appears to be limited to certain individuals with certain perceptual-cognitive abilities (e.g., He et al., 2013 for phonemic coding).
Research on L2 speech development suggests that comprehensibility, rather than accentedness, may be a more ecologically valid goal for most adult L2 learners (Isaacs, Trofimovich, & Foote, 2018). While no learner should be discouraged from attaining nativelike pronunciation 1 "Pronunciation proficiency" is a very difficult phenomenon to define. Following Saito and Plonsky's (2019) model of instructed L2 pronunciation proficiency, listener ratings of comprehensibility and accentedness were used as one index of global L2 pronunciation proficiency in the current manuscript. To support this, Isaacs, Trofimovich, Yu, and Muñoz Chereau (2015) showed that L2 speakers' speaking proficiency scores (measured via IELTS Speaking Scale) were significantly correlated with the comprehensibility and phonological nativelikeness of their spontaneous speech (r =.509 and.585, respectively). proficiency, it is important that teachers inform students that (a) reducing one's foreign accent is a relatively difficult task after puberty; (b) that even foreign-accented speech can be highly comprehensible; and (c) that such comprehensibility can be enhanced via practice (for a meta-analysis, Saito, 2021). In this current paper, advanced L2 speech proficiency is defined as not only comprehensible but also nativelike pronunciation. 2

Shadowing, Tracking, and Choral Repetition Training
While various teaching approaches have been applied to L2 pronunciation, growing attention has been given to the pedagogical potential of tracking and shadowing activities (for a comprehensive overview, Hamada, 2018). According to Celce-Murcia, Brinton, and Goodwin (2010), tracking is where learners listen to native speakers either face-toface or remotely on television, radio, or audiotape while following a transcript or subtitles, and simultaneously reproduce what they hear. Shadowing is similar to tracking in that learners listen and repeat what they hear with a slight delay and can pause the recording at times if necessary.
The common element between tracking and shadowing is the choral repetition of model speech. While this concept is reminiscent of the audio-lingual training methods of years past, it remains popular in many foreign language classrooms all over the world (e.g., Saito & van Poeteren, 2012 for the results of experienced EFL teachers in Japan) and is an integral component of digital speaking training materials (e.g., Rosetta Stone; see Lord, 2015). The primary objective of the method is to help L2 learners increase their control over what they have already heard, learned, and remembered.
There are several reasons why tracking, shadowing, and repetition activities are believed to facilitate L2 pronunciation development. First, 2 In the current investigation, the comprehensibility and accentedness aspects of L2 pronunciation proficiency were considered. Whereas such scores are based on listeners' subjective ratings, Derwing and Munro have argued that what is ultimately important for communicative success is intelligibility, i.e., interlocutors' actual understanding of intended message. In the current investigation, we do not intend to introduce/discuss the literature on intelligibility. This is because the method and interpretation of intelligibility has been substantially different and mixed among scholars. For example, a wide range of methods have been used, such as transcription, comprehension questions, scaler ratings, and reaction time instruments (for a comprehensive review on methodological fuzziness in L2 intelligibility research, see Isaacs, 2008). Thus, we decided not to include the notion of intelligibility in the sections of literature review, research design, and future directions, because there is no methodological consensus on intelligibility in the field of L2 pronunciation research, and intelligibility was not measured in the current study. More research is needed to further explore the mechanisms underlying intelligibility and the development of more empirically robust methods. these activities provide learners with additional input opportunities that learners otherwise lack in foreign language classrooms (Muñoz, 2014). Second, learners are encouraged to notice and fill in the gap between their current proficiency levels and targetlike performance by comparing their output to the input they hear; this alignment process is a key factor of successful L2 speech learning (Trofimovich, Isaacs, Kennedy, Saito, & Crowther, 2016). Third, learners can focus on the imitation of nativelike pronunciation forms which is considered to be an important skill for successful L2 speech learning (Flege, Munro, & MacKay, 1995). Fourth, systematic repetition activities can reduce the frequency of dysfluencies and promote the automatization of L2 speech production (Suzuki, 2021). Considerable attention has been given to examining how and whether shadowing and tracking can be used to improve various dimensions of L2 pronunciation proficiency, such as segmental accuracy (Hamada, 2018), word and sentence stress (Martinsen, Montgomery, & Willardson, 2017), intonation (Hsieh, Dong, & Wang, 2013), intelligibility (Martinsen et al., 2017), and comprehensibility (Hamada, 2018). However, very few of these studies have examined L2 pronunciation development using spontaneous speaking tasks (Martinsen et al., 2017).

Individual Differences in Instructed L2 Speech Learning
Studies conducted in the classroom setting have pointed to several different types of explicit instruction which can help improve L2 pronunciation proficiency (e.g., Sakai & Moorman, 2018 for high variability perception training; Derwing et al., 2014 for prosody-based production training; Suzuki, 2021 for fluency training). Although there is a consensus that the provision of explicit instruction (including shadowing and tracking) facilitates adult L2 pronunciation learning, the effect sizes of explicit instruction are relatively smallto-medium (e.g., d =.078; Saito & Plonsky, 2019). Here, it is important to note that the outcomes of training are subject to a great deal of individual variation. Even if two participants spend the same amount of time practicing a target language on a daily basis in the same setting, their learning patterns may differ to a great degree (Doughty, 2019).
Though few in number, some studies have begun to examine which individual difference factors can help learners gain the most from explicit pronunciation instruction. Evidence collected to date suggests that more advanced L2 learners have greater levels of phonological awareness (Venkatagiri & Levis, 2007), phonemic coding ability (He et al., 2013), and working memory capacity (Darcy, Park, & Yang, 2015). In the current investigation, we would like to introduce domain-general auditory processing as a potential determinant of language learning that may be germane to adult L2 speech learning (Saito & Tierney, forthcoming). Then, we hypothesize that those with more precise and more robust auditory processing abilities will benefit more from shadowing training.
In the following section, we will provide a detailed literature review on (a) what comprises auditory processing (i.e., auditory acuity for input encoding; audio-motor integration for linking auditory input with motor output), (b) what prior research has revealed about the roles of two different auditory abilities in L2 speech learning (i.e., acuity for the attainment of naturalistic L2 speech learning; integration for the optimization of production-based L2 practice in classroom settings), and (c) why one of the components in particular (i.e., integration rather than acuity) could be predictive of certain individuals who can benefit most from shadowing training and achieve more advanced L2 pronunciation. Following the comprehensibility-accentedness model of L2 pronunciation proficiency (Derwing & Munro, 2013), such advanced L2 pronunciation proficiency is defined as comprehensible and nativelike in the current investigation.

Domain-General Auditory Processing in L1 and L2 Acquisition
Oral input is important during every phase of language learning. When receiving linguistic input, learners must encode time and frequency patterns (i.e., auditory acuity) and subsequently remember and integrate them into action (i.e., audio-motor integration). These auditory processing skills have been hypothesized to anchor various dimensions of language learning (i.e., Auditory Precision Hypothesis; Goswami, 2015). For instance, auditory processing has been shown to be relevant to a variety of linguistic skills, including detection of different speech contrasts (Werker & Tees, 1999), prosodic patterns (De Pijper & Sanderman, 1994), lexical boundaries (Cutler & Butterfield, 1992), syntactic constructions (Marslen-Wilson, Tyler, Warren, Grenier, & Lee, 1992), and discourse structures (Wang, Li, & Yang, 2014).
To date, there is evidence among L1 acquisition studies showing a significant association between individual differences in auditory processing and a range of language acquisition phenomena, such as literacy development (White-Schwoch et al., 2015) and reading competency (Boets et al., 2011). Based on these findings, some researchers have suggested that auditory processing ability could be used as a diagnostic tool for certain types of language delay or impairment, such as dyslexia (Hornickel & Kraus, 2013).
Building on the L1 acquisition literature, some scholars have suggested that auditory processing may play a significant role in adult L2 speech learning as well (Kachlicka et al., 2019;Mueller et al., 2012;Saito & Tierney, forthcoming). Unlike L1 acquisition, which is free of the influence of prior language learning experience, adult L2 learners filter input through their L1 acoustic representations. Because of this, they could face a tremendous amount of difficulty identifying and discriminating new acoustic dimensions that they do not usually use when detecting L1 phonological contrasts (e.g., F3 discrimination for the English /r/ and /l/ contrast for Japanese learners; Saito, 2013a).
On a broad level, auditory processing can be divided into two distinctive abilities, auditory acuity and audio-motor integration. Auditory acuity concerns one's ability to hear a very subtle difference in frequency and time characteristics of sounds. This ability has been measured via a psychoacoustic task where participants discriminate synthesized sounds which differ in one particular acoustic domain (e.g., formant, pitch, duration). Audio-motor integration relates to one's ability to link auditory input with motor action. This ability has been measured via reproduction tasks where participants replicate sets of melodic and rhythmic sequences (for comprehensive overviews of the auditory precision hypothesis -L2, see Saito, Suzukida, et al., 2021;Saito & Tierney, forthcoming).
Recent research suggests that these two different types of auditory processing abilities (auditory acuity and audio-motor integration) can uniquely predict successful L2 speech learning in immersion and classroom contexts. In the immersion settings, advanced L2 speakers with extensive length of residence profiles were shown to also have precise auditory acuity (Zheng, Saito, & Tierney, 2021) and audio-motor integration abilities (Kachlicka et al., 2019;Sun, Saito, & Tierney, 2021). By comparison, research conducted in the foreign language classroom setting has shown that L2 speakers with relatively advanced L2 speech profiles are more likely to have higher audio-motor integration abilities (Saito, Suzukida, et al., 2021), but not auditory acuity (cf. Sun et al., 2021). This is arguably because (a) they have limited exposure to communicatively authentic, aural input; (b) most of the input they receive is delivered by teachers who are nonnative speakers; and (c) their L2 use is restricted to only a few hours of traditional production-based instruction per week (e.g., grammar translation, audio-lingual method) (Muñoz, 2014;Saito, Suzukida, et al., 2021).
In summary, whereas auditory processing matters for L2 speech learning, two subcomponents of auditory processing-i.e., acuity and integration-may differentially facilitate the attainment of advanced L2 speech proficiency. Acuity is a crucial variable for the attainment of high-level L2 speech proficiency in naturalistic settings wherein learners access a number of communicatively authentic input opportunities. If one has more precise acuity ability, it will help further encode very subtle details of acoustic signals and refine the quality of their own acoustic representations. Contrastively, integration can serve as a key factor of successful L2 speech learning especially in classroom settings where learners lack enough input but are encouraged to produce motor output based on a limited amount of input. Robust integration skills could assist with the rapid capturing of broad acoustic information and subsequent prompt conversion to motor action.
It is noteworthy that most studies carried out to date have been cross-sectional in nature. These studies have yet to explore how the relationship between auditory and L2 speech abilities change over time. To our knowledge, very few studies have examined to what degree auditory perception skills can help L2 learners process input more effectively and efficiently over time. There is some emerging evidence that L2 learners with more precise acuity abilities benefited more from perception-based training (e.g., Qin, Zhang, & Wang, 2021) and that those with more robust integration abilities demonstrated larger gains from production-based training (Li & DeKeyser, 2017). However, no studies have examined the relative weights of acuity and integration in L2 speech learning. The current study took an exploratory approach toward examining how two different auditory processing skills-i.e., acuity vs. integration-could differentially predict L2 pronunciation learning gains in the context of traditional repetition-based pronunciation training (i.e., shadowing).

Current Study, Research Questions, and Predictions
The current study set out to provide the longitudinal evidence regarding the relationship between auditory processing and L2 speech learning. Here, we followed the definition of "longitudinal data" widely used in Applied Psychology which refers to data collected via "repeated measures of (at least twice) the same variables gathered from the same study participants over time" (Dormann & Guthier, 2019, p. 158). Some scholars have attempted to define what characterizes longitudinal research in the field of L2 acquisition, teaching, and pronunciation. Whereas cross-sectional design focuses on L2 learners' linguistic behaviors at a single data collection point, longitudinal research focuses on how their language develops over time thanks to a range of independent variables (e.g., training and immersion experience; for a comprehensive review, see Nagle, 2021). Following the guidelines initially set by Ortega and Iberri-Shea (2005) and recently extended by Saito, Suzuki, Oyama, and Akiyama (2021), the study could be considered as "longitudinal" in nature, as it meets the three crucial conditions of such research design: (1) multiple sessions (the participants participated in 12 training sessions); (2) multiple data collection points (the participants' speech development was measured via pre-and post-tests); and (3) multiple types of analyses (the outcome measures comprised different types of elicitation tasks and listener judgments). The study was guided by two research questions: 1. The first question examined the extent to which the provision of choral repetition training (i.e., tracking and shadowing) could facilitate improvement on two different dimensions of L2 pronunciation proficiency-comprehensibility and accentedness. 2. The second question examined the extent to which auditory processing (auditory acuity vs. audio-motor integration) was associated with enhanced comprehensibility and reduced accentedness.
As for R1, we predicted that tracking and shadowing would lead to significant improvements in both comprehensibility and accentedness, as shown in Hamada (2018). As for R2, we predicted that those with greater integration (rather than acuity) ability would realize the most benefit. This is because these learners are assumed to be more capable of incorporating auditory information into motor action-a function which plays a vital role in accessing a new language more smoothly, quickly, and effortlessly. Moreover, shadowing requires listeners to rapidly encode auditory patterns extending across multiple segments and formulate the appropriate motor sequence for reproducing these patterns, a process that is directly targeted by the auditory-motor integration tests (Saito, Suzukida, et al., 2021). Finally, it was predicted that the aptitude-acquisition link would be observed most clearly for the most difficult dimension of L2 speech learning, i.e., participants' accentedness reduction at a spontaneous speech level. METHOD Participants L2 learners. A total of 47 Chinese high school EFL learners (22 males and 25 females) participated in the study. All of them were third year students between 17 to 18 years of age. The high school is located in a suburban district in Chengdu, Sichuan. All the participants engaged in several hours of EFL instruction per week taught by Chinese EFL teachers. The content of EFL lessons was based on a mixture of grammar translation, choral repetition, and meaningoriented speaking and listening activities. An electronic flyer was disseminated at the high school, and interested participants contacted the investigator and participated in the project as an extracurricular EFL activity.
Although they had received an extensive amount of EFL experience prior to participation in the study (e.g., 6+ years), none of them had received any formal English pronunciation training nor engaged in immersion experience. The participants were divided into an Experimental group (n = 37) and a Control group (n = 10). Because the main focus of the study lay in the analyses of the individual differences, the number of the participants in the Experimental group was much larger than that of the Control group. All participants reported having normal hearing.

Raters.
A total of five linguistically trained Chinese coders (1 male, 4 females; ages 23 -27, M = 24.8) were recruited in London to act as speech raters. All of them were graduate students in applied linguistics, and had extensive knowledge of, and experience with, L2 speech analysis. All raters reported relatively high IELTS scores (M = 7.5, SD = 0.7), which classifies them as Advanced Users of English (C1, C2) according to CEFR benchmarks (Council of Europe, 2001). Following the evidence that highly advanced L2 and native speaking listeners make similar comprehensibility and accentedness judgments (Derwing & Munro, 2013), we did not consider the rater backgrounds (linguistically trained Chinese listeners) as a confounding factor and we did not make any attempts to recruit native speaking listeners with a view to comparing L2 and L1 listener ratings. Our methodological justification corresponds to the detailed analyses of L1 and L2 raters' similar comprehensibility and accentedness judgments in Saito's (2021) research synthesis. We obtained relatively high inter-rater reliability (a =.88 for comprehensibility; a =.85 for accentedness).
Design. Figure 1 visually summarizes the design of the study. All participants completed the pre-tests in Week 1. In Weeks 2 and 3, participants in the experimental group took the auditory processing tests (acuity, integration) and participated in 12 sessions of fluency training, while participants in the control group participated in 12 sessions of vocabulary, grammar, and writing practice. Finally, all participants completed immediate post-tests in Week 4. Because the same materials were used for the pre-and post-tests, the inclusion of the control group was crucial to control for any test-retest effects. In this way, it was possible to identify whether the gains from L2 speech training were linked to individual differences in acuity and integration abilities.

Shadowing Training (Experimental Group)
As a part of their regular high school EFL activities, participants in the experimental group participated in a total of 12 shadowing training sessions over the course of two weeks (30 minutes 9 12 sessions = 6 hours). In each session, the participants practiced shadowing using a popular application called English Fun Dubbing (EFD), which was downloaded onto iPads borrowed from the school. The EFD app presents short segments of videos along with line-by-line transcripts of the content. The app allows users to watch the videos while reading the scripts, and then imitate and record what they hear. To avoid any confusion about the training procedure, all instructions were delivered in Chinese by the researcher. The training sessions were operationalized as follows: 1. Dubbing materials: A total of three videos were selected from the CBeebies Bedtime Story series produced by the BBC: Follow the Swallow, Mr Big, and What Friends Do Best (for all scripts used in the current study, see Supporting Information-A). The videos were chosen to conform to the nature of the pre-/post-tests (i.e., spontaneous picture description), where participants were asked to describe a four-frame picture with a consistent storyline (for more details, see below the Outcome Measures section). All three videos were practiced during each session. 2. Instructional procedure: All the participants in the experimental group did the dubbing activities together in a single classroom at the high school. At each session, their EFL teacher and the first author instructed, guided, and monitored the participants. The participants were asked to do the assigned shadowing tasks with their iPad by using headphones. Whenever the participants had questions about the procedure, the EFL teacher and the investigator provided individual support. To ensure that each participant had spent the equal amount of time on the assigned task, they were allowed to engage in the shadowing activities only during the training session. The content of shadowing was selected by the investigator for each session so that all the participants worked on the same materials. To avoid the impact of peers' noise on their performance and recording, the participants were scattered across the classroom and along the corridor. We acknowledge that the background noise could not be completely avoided due to space constraint. However, efforts were made to minimize the background noise. As a result, participants' shadowing practice was clearly recorded on their own device. Each participants' performance was checked by the investigator. Participants were first instructed to watch the videos while reading the scripts, and to repeat what they were hearing. During each video, the participants were able to press a button labeled "Start Dubbing" to begin recording their shadowing on a sentence-by-sentence basis. Participants were allowed to listen to the original audio, compare it with their shadowing performance, and re-dub the sentences as many times as they wanted to. 3. Publish and share: After dubbing the entire video, participants were able to preview their recordings and return to the dubbing interface to re-dub any sentences they desired. They then saved their dubbing recordings and shared them with their classmates via the app. No feedback was provided on any of the recordings. The participants were explicitly asked to practice the shadowing activities via EFD as much as possible within each training session (30 minutes). All the participants' recordings were saved by the investigator.
This form of shadowing training is similar to the treatment provided in many of the previous studies reviewed earlier (e.g., Hamada, 2018). However, the affordances of mobile technology allowed the participants to practice the speech shadowing at their own pace; that is, they could repeat each sentence as many times as they wanted to. Although the training was tailored to individuals' proficiency levels, the participants shadowed all the sentences in each video at least three times.

Lexicogrammar Training (Control Group)
Participants in the control group spent a similar amount of time (30 minutes 9 12 sessions) practicing various aspects of English (e.g., vocabulary, grammar, writing) with the researcher over a period of two weeks. Most of the materials comprised follow-up activities based on what they had practiced in their weekly EFL lessons. No speaking practice or training was provided during these sessions. The purpose of the control group was to show the presence/absence of any test-retest effects as the same materials were used in pre-and post-tests.

Speech Outcome Measures
Pre/post-test materials. Both controlled and spontaneous speaking tasks were used to elicit L2 speech performance. For the controlled speaking tasks, participants were asked to read a total of six sentences that were randomly taken from the tracking and shadowing training materials without any preparation time: 1. It was covered in white blossom (Follow a Swallow). 2. The jumpy dolphin swam and leapt and dived (Follow a Swallow). 3. He was so big that anywhere he went, all everyone saw was someone big and scary (Mr Big). 4. It looked all alone, just like him, so he bought it and took it home (Mr Big). 5. Everything looked much the same as it had done a week ago, except Winston, who looked very miserable (What Friends Do Best). 6. Some of the pieces were very tiny and difficult to find (What Friends Do Best).
For the spontaneous speaking task, participants were asked to describe a four-frame picture story under time pressure. Two different versions of the pictures (Version A & B) were chosen from the EIKEN English Test (Pre-1 Level) (EIKEN, 2016). These were counterbalanced across group (experimental/control) and time (pre-/post-test). Version A was about an elderly couple who lived far away from the nearest supermarket, while Version B was about a girl who wanted to buy a smartphone. Two versions were chosen to avoid producing any confounding effects from using different topics. Students were given one minute to prepare their speech and two minutes to narrate the story. To avoid false starts, the first sentence of the narration task was provided for participants (see Supporting Information-B).
Procedure. All speech samples were recorded in a quiet office in the school using the audio recorder on an SM-G9750 mobile device, set to a 48 kHz sampling rate. To reduce rater fatigue, the first 30 seconds of the speech samples were cut and saved as MP3 files. In total, 188 speech samples (47 participants 9 2 tasks 9 2 testing points) were collected for rating.

Comprehensibility and Accentedness Judgments
Procedure. All speech samples were rated by five advanced L2 English users (Derwing & Munro, 2013). Due to the ongoing pandemic, the speech samples and rating guidelines were provided to the raters via an online cloud system. In addition, all rating sessions took place individually using a video-conferencing tool. First, the raters were given detailed explanations of comprehensibility (ease of understanding) and accentedness (linguistic nativelikeness) to ensure that they fully understood each construct. They then assessed the speech samples using a 9-point scale for comprehensibility (1 = difficult to understand, 9 = easy to understand) and accentedness (1 = heavily accented, 9 = no accent) on their own computers. The controlled speech samples were assessed before the spontaneous speech samples.
Reliability. Relatively high Cronbach's alpha levels were reported for the raters' comprehensibility (a =.88) and accentedness (a =.85) scores. The raters' scores were averaged for each construct for further analysis.

Auditory Processing Measures
Two types of auditory processing tests were used to assess auditory acuity and audio-motor integration: discrimination tests and reproduction tests (Moore, 2012). These were delivered via the online platform of psychology experiment named GORILLA (Anwyl-Irvine, Massonni e, Flitton, Kirkham, & Evershed, 2020). In the first week, participants in the experimental group (n = 37) were invited to take the auditory processing tests individually using a laptop and a set of earphones in a quiet room. They first received detailed instructions about the tests before completing them in the following sequence: rhythm reproduction (integration), melody reproduction (integration), formant discrimination (acuity), pitch discrimination (acuity), and duration discrimination (acuity).
• Rhythm Reproduction: As for stimuli, ten rhythmic patterns were created and presented to the participants (3.2 seconds per sample). These rhythms were taken from the rhythmic patterns used in Povel and Essens (1985). Each rhythm consisted of a series of 16 200-ms segments. In segments containing a drum hit, the first 150 ms of the segment contained a conga drum hit obtained from freesound.org. In segments containing a rest, no sound was present throughout the segment. In each trial, after listening to the same stimulus three times, participants were asked to repeat the rhythmic sequences they had just heard by pressing the space bar. Their presses of the spacebar were quantized by changing the inter-press times to the nearest interval set (200, 400, 600, 800ms). Then, the ratio of response accuracy was calculated in terms of the presence of hits or rests every 200ms compared to the ones in the target stimuli.
• Melody Reproduction: A total of 10 melodies were prepared for the melody reproduction test, each containing a sequence of seven notes (300ms per note). All notes were picked out from a scale of five six-harmonic complex tones (equal amplitude across harmonics) with fundamental frequencies of 220, 246.9, 277.2, 311.1, and 329.6Hz, respectively, corresponding to the first five notes of the A major scale. The melodies were created as follows. First, they all began on the third tone (277.2Hz) of the scale. Then, the next note was the closest note to the previous one on the scale, being either one note higher (246.9Hz) or lower (311.1Hz). This pattern was repeated until all seven notes had been randomly selected. As 220 and 329.6Hz were the lower and upper limits of the melodies, respectively, once one tone reached these two limits, the next note would be either one note closer to the center of the scale or the same as the previous one. Participants listened to the melodies (three times per melody) and were asked to reproduce them by clicking five buttons labeled from "5" to "1" (from the highest tone to the lowest), stretching in a line from the top ("5") to the bottom ("1") of the screen. The first seven button presses were recorded and compared to the original melody, and then the mean accuracy ratio was calculated across all ten melodies.
Auditory acuity (formant, pitch, duration). Following the formants developed and detailed in Kachlicka et al. (2019), the discrimination tests consisted of three sub-tests designed to measure the ability to perceive the spectral and temporal details of a sound, including formant, pitch, and duration discrimination thresholds. Formant and pitch thresholds were used to reveal participants' spectral acuity, while duration thresholds reflected their temporal acuity. For each sub-test, onehundred continuous synthesized stimuli were generated using custom MATLAB scripts that differed either in terms of formant frequency (F2 = 1500-1700Hz), pitch (F0 = 330-360Hz), or duration (250-500ms) (see Supporting Information-C for a detailed summary). In each test trial, three different stimuli were presented with an inter-stimulus interval of 0.5s. Participants were asked to choose the tone they thought was different from the other two by clicking the corresponding keyboard labeled with "1" or "3" on the screen.
Using the adaptive threshold procedure proposed by Levitt (1971), the difficulty of the task, (i.e., the size of the sound difference), varied according to participants' ongoing performance, which increased after three consecutive correct responses or decreased after an incorrect response. The test terminated after either 70 trials were made or eight reversals (i.e., a difficulty decrease followed by an increase). The final score, ranging from 0 to 100 points, was calculated by averaging the levels of the reversals, beginning from the third level. This score indicates the smallest difference between stimuli that participants were able to discriminate, i.e., the discrimination threshold. Note that lower scores indicate better performance. For example, a result of 10 out of 100 indicates that the smallest difference that a participant could hear was 20Hz for the formant test, 3Hz for pitch test, and 25ms for the duration test.

RESULTS
The first objective of the statistical analyses was to examine to what degree the experimental group improved thanks to shadowing training via mean-based analyses (ANOVAs). The second objective of the statistical analyses was to explore the extent to which participants' improvements could be related to their auditory processing profiles via variation-based analyses (partial correlations).

Effects of Shadowing Training on Comprehensibility and Accentedness
The results of the descriptive statistics are summarized in Table 1. According to the results of normality tests (Kolmogorov-Smirnov), the rating scores did not significantly differ from a normal distribution (p >.05). The results of independent t-tests on the pre-test scores (with the alpha level being set at p <.05 and adjusted to p <.025 via Bonferroni correction) showed that the experimental and control groups' performance was comparable for most of the contexts: comprehensibility in the spontaneous task (t(45) = À0.149, p =.882, d =.055), accentedness in the controlled task (t(45) = 1.294, p =.202, d =.499), and accentedness in the spontaneous task (t(45) = 0.303, p =.763, d =.120). However, the experimental group's comprehensibility score was higher (more comprehensible) than the control group in the controlled task; and the group difference was marginally significant (t(45) = 2.271, p =.028, d =.920). Individual performance as per group was visually plotted in Figure 2. Because there was some form of preexisting difference (one out of four contexts), the following statistical analyses focused on the extent to which the experimental and control groups improved their proficiency scores throughout the project (Time) but not on how the two groups differed at the time of the post-tests (Group).

Individual Differences in Auditory Processing and Instructional Effectiveness
Next, we examined the extent to which the experimental group's instructional gains were associated with their auditory processing profiles. The participants' raw scores are presented in Table 2. Auditory acuity scores indicate the minimum difference that the participants could hear for three types of basic acoustic information (formant, pitch, and duration). Audio-motor integration scores index how accurately (%) the participants could reproduce novel melodic and rhythmic patterns.
Following the procedures of the previous literature (e.g., Kachlicka et al., 2019), a composite auditory acuity score was computed by standardizing and averaging the formant, pitch, and duration discrimination scores. Likewise, a composite integration score was computed by standardizing and averaging the rhythm and melody reproduction scores. The results of normality tests (Kolmogorov-Smirnov) suggested that the scores did not significantly deviate from a normal distribution (p >.05). Pearson correlation analysis found that the relationship between auditory acuity and integration was not significant (r = À.251, p =.134), which suggests that the tests tapped into two different aspects auditory processing ability. Next, partial correlation analyses were carried out to analyze the relationship between participants' gain scores (post-test scores minus pre-test scores) and their auditory processing profiles while controlling for their pre-test scores. Here, we did not conduct the simple correlations between their raw gain and auditory scores because such relationship could have been influenced by participants' pre-test performance (e.g., those with low pre-test scores likely show larger gains because they have more room for improvement). Table 3 summarizes the correlation coefficients between participants' gain scores and auditory processing after their pre-test scores were factored out. The strength of the correlation coefficients was interpreted with a reference to Plonsky and Oswald's (2014) fieldspecific benchmark (r =.60 for large; r =.40 for medium; r =.25 for small). The results showed that gain scores for the most difficult aspect of L2 speech learning (i.e., foreign accent reduction in the spontaneous speech task) were significantly related to individual differences in audio-motor integration. The strength of the correlation was medium (r =.482). Interestingly, although the link between acuity and accentedness did not reach statistical significance (p =.062), the strength of the correlation coefficients was small-to-medium (r = À.315). No significant correlations were found for comprehensibility. The size of the correlation coefficients was substantially small (r = FIGURE 3. Partial correlations between training gains and auditory processing residuals with pre-test scores partialled out. Note. Gain scores (À1.0 to 1.5) were centered after participants' pre-existing differences in proficiency (i.e., pre-test scores) were stoically controlled for (i.e., residual scores).
|.048-.295|). The results indicated that participants could enhance comprehensibility regardless of the degree of their ability to integrate auditory patterns and motor sequences. The relationship between participants' change in proficiency and auditory profiles was visually plotted in Figure 3. Note that these gain and auditory scores were residual ones (after participants' pre-test performance was factored out). That is, participants' raw gain scores were centered after participants' pre-existing differences in proficiency (i.e., pre-test scores) were statistically controlled for.

DISCUSSION
In the context of 47 Chinese EFL learners, the current study examined the extent to which mobile-assisted repetition training (shadowing) interacted with auditory processing abilities in the improvement of L2 speech proficiency. Echoing the previous literature (Hamada, 2018), the results showed that shadowing training led to notable gains in comprehensibility (medium-to-large effects) and resulted in some reduction in foreign accentedness (medium effects). The size of gains reported here could be considered as promising given that the provision of explicit instruction is generally small-to-medium, d = 0.78 (Saito & Plonsky, 2019), and considering that very few studies have found such a robust impact of instruction on both comprehensibility and accentedness (Saito, 2021). This is arguably ascribed to the fact that the participants in the current study were encouraged to practice shadowing activities via iPad regularly, intensively, and autonomously (Pegrum, 2014). The findings here provide additional support for the pedagogical potential of mobile-based assisted language learning, especially in EFL classrooms, where language exposure is limited in quantity (several hours per week) and quality (the lack of communicatively authentic input) (Muñoz, 2014).
As for the role of individual differences in instructed L2 speech learning, the results of the partial correlation analyses indicated that the degree of speech improvement appeared to be linked to certain auditory processing abilities. Specifically, a significant correlation was found between instructional gains and audio-motor integration ability for accentedness reduction. The findings provide additional longitudinal evidence for the auditory processing hypothesis, which posits that auditory precision serves as a bottleneck for language learning, because it helps learners encode and remember sounds characteristics, helping them make the most of the pedagogical opportunities of each language input (Kachlicka et al., 2019). Thus, higher aptitude learners (i.e., those with more precise auditory acuity or more robust auditory-motor integration) can attain more advanced L2 proficiency than others even if they engage in the same amount of training (Doughty, 2019).
The results specifically showed that audio-motor integration, rather than auditory acuity, was related to the rate of learning success. This echoes existing cross-sectional evidence showing that advanced L2 learners-and especially those in production-focused EFL classrooms -likely have greater audio-motor integration abilities (rather than auditory acuity) (Saito, Suzukida, et al., 2021). The significant role identified for audio-motor integration over perceptual acuity could be explained by the nature of the training. During the shadowing activities, participants were prompted to focus on repeating what they had heard in the video clips. However, this activity did not require participants to encode the perceptual details of auditory stimuli. Instead, this kind of repetition more closely corresponds to the process of audiomotor integration: Participants needed to track auditory patterns across several seconds, spanning multiple speech segments, and rapidly formulate the appropriate motor sequence for reproducing these patterns. This is exactly the skill tested by the rhythm and melody reproduction tests, which present participants with several seconds of frequency/temporal patterns and ask them to reproduce them. Participants who have difficulty remembering and reproducing acoustic patterns-whether these patterns are drawn from speech or musicmay not be able to fully take advantage of shadowing training, either because they struggle to remember acoustic patterns or because they have difficulty translating these patterns into motor output. These reproduction difficulties, then, would result in speech output that is different from the intended speech model, and the comparison between the produced output and the speech target would not facilitate pronunciation learning.
It is interesting that the current study did not find any significant predictive power for auditory acuity in any context. This could be considered as additional evidence that the importance of different auditory processing abilities (acuity vs. integration) varies in accordance with the type of instruction that L2 learners engage in. This is another aspect of individual differences that future studies need to investigate: Certain learning approaches will work better for some individuals, while others would benefit more from alternate approaches. The current findings, for example, suggest that individuals with strong auditory-motor integration skills benefit more from a productionbased, shadowing approach, but that individuals with weaker auditorymotor integration might be better off with alternate types of training. Contrastively, research has indicated that auditory acuity may be an important predictor of success in the context of perception-based training (e.g., Perrachione, Lee, Ha, & Wong, 2011 for high-variability phonetic training). To our knowledge, no empirical studies have investigated the differential effects of auditory acuity and audio-motor integration when L2 learners are engaged in different types of instruction within a single study.
Finally, it needs to be stressed that the aptitude-acquisition link identified here concerned the most difficult instance of L2 speech learning: foreign accentedness reduction at a spontaneous level. In line with Doughty's (2019) aptitude framework, we argue that aptitude is a necessary condition for advanced L2 speech acquisition, which entails successfully learning both easy and difficult L2 speech features. In the current investigation, instructional gains were the greatest when participants speech was elicited using controlled tasks and assessed for comprehensibility. However, aptitude (audio-motor integration) appeared to serve as a crucial deciding factor of nativelike L2 pronunciation attainment when it came to the ability to produce spontaneous speech. It is interesting to note that although the predictive power of auditory acuity failed to reach statistical significance in any contexts (p >.05), the relationship between participants' acuity scores and nativelike pronunciation was considered as small-to-medium (r = À.315; Plonsky & Oswald, 2014). This indicates that different types of aptitude matter when outcome measures concern the relatively difficult aspects of L2 speech learning.

FUTURE DIRECTIONS
With a view toward future replication studies, we would like to raise a number of methodological issues that need to be addressed in the future.
• First and foremost, all the auditory test batteries for acuity (formant, pitch, & duration discrimination) and integration (rhythm & melody reproduction) have been coded in HTML/ JavaScript so that everyone (researchers and practitioners alike) can evaluate different types of auditory processing using their own computer. A ZIP file together with a user manual will soon be shared at our in-house website (www.sla-speech-tools.com). For researchers, auditory processing can be adopted as an additional interesting measure of perceptual-cognitive individual differences. For practitioners, the auditory processing profiles of students can be used as one suggestion regarding what type of speech training they can most benefit from (acuity for perception-based training; integration for production-based training). For more discussion on the scholarly and pedagogical implications of auditory processing, see Saito and Tierney, forthcoming. • The sustainability of the link between auditory processing and shadowing effectiveness should be measured not only via immediate but also via delayed post-tests.
• Notably, the sentences in the pre-and post-tests were extracted from the materials provided to the experimental group. This methodological feature was considered as a limitation as it may have affected the experimental group's improvement. Future studies should explore the extent to which L2 learners can generalize what they have learned from training to novel sentence contexts (but see the results of the spontaneous production tasks in the Results section).
• The findings of the current study should be replicated with a larger sample size and an equal number of participants per group.
• Although the current study found a significant role for auditory processing in instructed L2 speech learning, we acknowledge that the behavioral tasks adopted in the current study inevitably tap into various aspects of cognitive abilities beyond audiomotor integration and auditory acuity, e.g., attentional control and auditory short-term memory (McArthur & Bishop, 2005). In order to tease out the effects of auditory processing in L2 speech acquisition, future studies may need to assess and factor out participants' individual differences in executive functioning (cf. Saito et al., in press).
• The current study (together with precursor research such as Saito, Suzukida, et al., 2021) has suggested that auditory acuity and audio-motor integration represent two different kinds of auditory processing abilities. Extending our above-mentioned discussion, we could argue that the discrimination and memory tests may differ not only on the role of motor output but also on the time scales of the patterns to be detected (precise characteristics of single sounds versus overall pattern across multiple sounds). This could explain why audio-motor integration predicted the rate of success in shadowing training wherein participants were encouraged to track and repeat the broad gist of the overall pattern than to be able to encode precise details of single sounds.
• To further develop a more fine-grained framework of auditory processing, future studies should test how different types of auditory processing can be uniquely tied to L2 speech acquisition when learners receive different types of auditory input. For example, whereas previous studies have mostly examined the role of auditory processing in language-focused training (e.g., highvariability phonetic training and shadowing), few scholars have ever delved into how auditory processing can facilitate L2 speech acquisition when participants receive more communicatively oriented instruction (e.g., recasts; Lee & Lyster, 2016;Saito, 2013b). Such studies could reveal whether, to what degree and how auditory processing (acuity vs. audio-motor integration) can be predictive of L2 speech learning on explicit and implicit modes.
• In the cognitive psychology literature, audio-motor integration ability has been linked to music-training experience (e.g., Tierney et al., 2017). Future studies should survey participants' backgrounds in music training and their impact on auditory processing and L2 speech learning (cf. Zheng et al., 2021).
• Given evidence that focused training can help improve auditory processing ability (Hayes, Warrier, Nicol, Zecker, & Kraus, 2003 for 35-40 hours of commercial auditory processing training programs), it would be intriguing to examine whether the effectiveness of instructed L2 speech learning can be boosted if pronunciation and auditory skills are improved simultaneously. It can be hypothesized that such a combined approach may help optimize the process of L2 speech learning and can lead to the acquisition of advanced-level L2 speech proficiency (cf., see Hayashi, 2019 for the role of working memory training in L2 grammar learning).
• The current study highlights the relationship between the participants' auditory processing and L2 speech learning on a broad level. However, it remains unknown precisely how auditory processing facilitates L2 speech learning at a fine-grained level. Although auditory integration abilities were linked to L2 accentedness in the current study, it remains unclear how auditory integration (rather than acuity) could help participants reduce their degree of accentedness. To remedy this weakness, future studies should look at L2 speech acquisition at a more fine-grained level, i.e., by examining specific dimensions of L2 speech acquisition (e.g., second and third formants and duration for English [r] and [l] by Japanese learners).