Dai, David Wei; (2022) Design and Validation of an L2-Chinese Interactional Competence Test. Doctoral thesis (Ph.D), University of Melbourne.

This dissertation documents the development and validation of a standardized large-scale language test on second language (L2) interactional competence (IC). More specifically, this dissertation investigates L2-Chinese as its target language for an IC test that is delivered in the computer-mediated environment to enhance the practicality of IC assessment. Throughout the process of test development and validation, the researcher defined IC as a theoretically multidimensional construct, which encompasses a speaker’s ability to manage interaction at the sequential, emotional, moral, logical and categorial dimensions. From the assessment perspective, the researcher demonstrated that IC is concurrently theoretically multidimensional and psychometrically unidimensional, which corroborates the assessability of IC as a test construct. Adopting an argument-based approach to test validation, the researcher designed the test following a task-based needs analysis (TBNA), eliciting the perspectives from L2-Chinese speakers, first language (L1) Chinese teachers, and L1-Chinese interactants on what L2-Chinese speakers struggle the most with interpersonal interaction in L2 Chinese. Findings from the TBNA informed the design of a nine-item IC test that targets test-takers’ ability to manage social actions that are disaffiliative in nature, which are actions that can threaten social harmony (e.g., making a complaint about workplace fairness to your employer). The test is delivered on a mobile-phone application, covers three sub-language use domains (everyday life, work, and study), and includes three degrees of interactiveness in terms of task methods (1st pair part voice messaging, 2nd pair part voice messaging, and live video chat). The specification of the IC test construct and development of the rating scale were based on everyday-life linguistic laypersons’ criteria of IC. The use of everyday members’ criteria to define the test construct represents an attempt to democratize research by complementing the perspective of applied linguists with the one of linguistic laypersons, who use the language on a daily basis and are the final arbiters of successful interaction. 36 linguistic laypersons listened to and commented on 22 pilot test-takers’ performances on the test. Thematic analysis on linguistic laypersons’ interview transcripts and written comments returned five indigenous categories that formed the five rating categories in an indigenous IC scale. When analysing pilot test-taker discourse in terms of IC, the researcher proposed the use of Sequential-Categorical Analysis, which combines Conversation Analysis and Membership Categorization Analysis to allow for the investigation of both the sequential and categorical aspects of interaction. The analysis of test-taker discourse through Sequential-Categorical Analysis assisted the researcher to theorize the a-theoretical indigenous rating scale into a theorized IC scale, which has 1) disaffiliation control, 2) affiliation promotion, 3) morality, 4) reasoning, and 5) social role management as its five rating categories. The IC construct developed in this dissertation moves beyond constructs in existing IC scholarship as it illustrates the multidimensional nature of IC, encompassing not only the sequential dimension of interaction, but also the emotional, moral, logical and categorical dimensions. 105 test-takers from 26 different countries participated in the main testing study, whose performances were rated in a fully-crossed design by two raters using the theorized IC rating scale. Many Facet Rasch Analysis on rating results showed that item performance, rater reliability and rating scale functioning were satisfactory. Rasch Principal Component Analysis demonstrated the unidimensionality of IC as a test construct. Correlational analyses on test-takers’ IC test performance and an external measure of proficiency revealed that proficiency has limited predictive strength on IC: r (104)= .42, p<0.05. Lower proficiency L2 speakers could outperform higher proficiency L2 speakers and even L1 speakers. This finding challenges the longstanding native-speakerism in language teaching and testing as it shows that L1 speakers are not the gold standard of IC. It also highlights that IC does not develop automatically as L2 speakers’ proficiency increases. Therefore, IC is an ability that needs to be taught and assessed separately from proficiency for L1 and L2 speakers alike. The extrapolative power of the IC test was ascertained through a self-assessment questionnaire and a peer-assessment questionnaire. Test-takers’ self-assessment and peer-assessment were correlated with their test performance, the results of which showed that the IC test can reasonably predict test-takers’ IC in non-assessment, real-life settings. Test-takers’ attitude towards the test was also elicited in the questionnaire, which was favourable and supportive of the use of IC items in general speaking assessment. Although the role-play IC test in this dissertation is based on L2 Chinese, findings from this dissertation are potentially applicable to other test tasks and target languages due to the highly theorized nature of the IC construct in this dissertation, which is not specific to any context, language, or task type. The theorized IC rating scale embodies a holistic IC construct that goes beyond the mechanics of interaction and ventures into the assessment of speakers’ ability at managing affect, logic, morality and categorization in and for interaction. This IC model is theoretically robust as it is consistent with other more holistic models on interpersonal interaction such as Dell Hymes’s original conceptualization of communicative competence and Aristotle’s three artistic proofs. Future research can adapt the current IC test and localize the theorized IC scale to see if findings from this study still obtain when applied to other target languages, task types and assessment contexts. The use of computer-mediated communication platforms for test delivery in this dissertation increases the practicality, accessibility and affordability of IC assessment, especially during times when the COVID-19 pandemic made face-to-face assessment unfeasible. The computer-delivered nature of this IC test also allowed test-takers from a wide range of backgrounds to participate, which differed from the typical participants in applied linguistics research, who tend to be affluent middle-class university students. This design in the dissertation helped to promote greater fairness, democracy and inclusivity in applied linguistics and language testing research. Alternate abstract: 本论文记录了作者开发并验证一个互动能力(interactional competence, IC)考试的过程。作者首先在理论层面定义了互动能力具有多维特征,包含控制序列(sequential),情感(emotional),道德(moral),理性(logical)和人物关系分类(categorial)能力。同时作者从测试的角度证明互动能力同时具有理论多维性(theoretically multidimensional)和测量单维性(psychometrically unidimensional)的特征,为测量互动能力提供了方向。本研究所开发的互动能力考试是一个面向汉语二语考生设计的计算机辅助考试。依据Michael Kane的基于论证的效度检验框架,作者首先采取任务式教学的需求分析,了解汉语二语者在汉语二语人际交往中感觉最困难的地方。作者在需求分析中收集了三方不同的观点,包括汉语二语者,汉语母语语言老师,和经常与汉语二语者交流的汉语母语者。根据任务式教学需求分析的结果,作者开发了一项互动能力考试,共包含九个题目。该测试在手机软件上进行,主要测量考生处理各种非亲和性社会行为的能力,考题涵盖日常交流,职场沟通和学术对话三个语言使用领域。依据互动程度的高低,考题任务包括三种类型:相邻对第一部分(1st pair part)语音短信,相邻对第二部分(2nd pair part)语音短信和即时视频对话。 接下来,作者基于日常生活中非专业人士(linguistic laypersons)关于互动能力的标准,设计了测试构念(construct)说明和评分量表。 考虑到非专业人士日常使用语言并是语言能力的最终裁决方,使用他们的标准来定义测试构念能与语言学专业人士和非专业人士的视角形成互补,促进应用语言学研究民主化。作者收集了36位语言学非专业人士对参加预测的22位考生样本的评论笔记,并对这些非专业人士进行了访谈。通过对评论笔记和访谈文本的主题分析(thematic analysis), 作者获取了5个本地标准(indigenous criteria),进而由此设计了一个包含五个评分维度的本地互动能力量表(indigenous IC scale)。随后,作者从互动能力的角度对22位预测考生的话语数据进行定性分析,在此过程中作者定义并提出一种新的研究方法:序列分类分析法(Sequential-Categorial Analysis)。序列分类分析法结合了会话分析(Conversation Analysis)和社会成员分类分析(Membership Categorization Analysis),从序列(sequential)和分类(categorial)双重角度来分析人际互动和人际关系。通过对考生话语数据进行序列分类分析,作者将非理论的(a-theoretical)本地互动能力量表进行理论化。该理论化的互动能力量表包括五个评分维度:1)控制冲突(disaffiliation control),2)拉近关系(affiliation promotion),3)人品素质(morality),4)理性思辨(reasoning),5)身份意识(social role management)。以往文献对互动能力的定义多停留于时间序列层面(sequential),而本研究中对互动能力测试构念的界定不仅基于序列,还 涉及到人际互动中情感(emotional),道德(moral),理性(logical)和人物关系分类(categorial)等层面。这凸显了互动能力在理论层面多维度(multidimensional)的特征。接下来,来自26个国家的105位考生参加了正式的互动能力测试。作者聘请两位评分员使用理论化的互动能力量表分别对全部考生的表现进行打分。对评分结果的多面Rasch分析(Many facet Rasch analysis)结果显示测试题目维度(如难度、区分度),评分员信度和评分量表功效都较为理想。Rasch主成分分析证明了互动能力的测量单维性。对考生互动能力和语言水平(proficiency)的相关分析结果显示语言水平能力对互动能力的预测力很有限: r (104)= .42, p<0.05。研究发现语言水平较低的二语考生往往比语言水平较高的二语考生和一语考生展现出更强的互动能力,这说明母语使用者并不是互动能力的黄金标准,该结果对语言教学和语言测试中长期存在的母语中心主义提出了挑战。研究结果也证实互动能力并不会随着二语使用者语言水平的提高而提高。因此,对于互动能力的教授和测评需要与语言水平区分开来,且互动能力对于母语和二语使用者来说均是需要学习和考核的能力。最后,作者使用了自我评测问卷和同行评测问卷对所开发的互动能力测试的外推力(extrapolative power)进行验证。作者将考生的自我评测得分和同行评测得分与互动能力测试得分进行相关分析,结果显示互动能力测试得分能在一定程度上外推考生在日常生活中的互动能力表现。考生自我评测问卷中也收集了考生对本互动能力测试的态度,结果显示考生对该测试表现出较为积极和支持的态度,希望今后的考试中能增加对互动能力的考核。虽然本研究所开发的角色扮演互动能力测试基于二语中文,由于互动能力构念的高理论化特征,本研究所采用的研究方法和研究结果对于其他测试任务类型,目标语言,测试场景也具有广泛的适用性。此外,作者提出的理论化的互动能力量表打破了原有对互动能力定义的局限,扩展到测量一个人如何通过控制情感、逻辑、道德和人物关系分类,从而实现更好地互动。本研究中对互动能力构念的全新界定与戴尔海姆斯(Dell Hymes)提出的交际能力概念(communicative competence)和亚里士多德对修辞策略(artistic proofs)的定义有着异曲同工之处,这展现出该互动能力构念的理论稳健性。未来的研究可以尝试对所开发的互动能力测试进行适当调整,将理论化的互动能力量表本地化,进而考察本研究结果在其他目标语言,测试任务和测试场景中的推普性。最后,该测试采用计算机辅助技术的测试方法,使得该测试的实用性、可及性和可负担性得到了提升,特别在目前因受到新冠疫情的影响现场测试难以实施的情况下,这一测试方法的优势更加凸显。计算机辅助的特点也为各种不同背景的考生提供了参与到本研究中的机会,因此本研究招募的样本不同于其他应用语言学研究中的大学生样本,他们大多来自富裕中产阶层,从这个意义上讲,本研究在推进应用语言学领域中的公平性、民主性和包容性方面做出了一定的贡献。

Title: Design and Validation of an L2-Chinese Interactional Competence Test
