Calibrating Generative AI for Second Language Writing Assessment: Combining Statistical Validation with Prompt Design

Authors

    Reza Farzi * Official Languages and Bilingualism Institute, University of Ottawa, Ottawa, Canada rfarzi@uottawa.ca

Keywords:

generative artificial intelligence, second language writing assessment, calibration, prompt design, inter-rater reliability, statistical validation

Abstract

Generative artificial intelligence (GenAI) is emerging as a powerful tool in second language writing assessment, offering the potential for rapid, consistent, and scalable evaluation. However, its value depends on whether its scoring reflects the nuanced judgments of experienced human raters. This study introduces the concept of calibration in the context of second language writing assessment, defined as the deliberate and iterative refinement of AI prompts, guided by statistical evidence, to align AI scoring with human evaluative reasoning. 60 essays produced by 30 upper intermediate learners of English were evaluated independently by an experienced human rater and by ChatGPT 3.5, using the English for Academic Purposes (EAP) Writing Assessment Rubric. Statistical analyses assessed inter rater agreement, score consistency, and systematic bias. In the initial baseline stage, ChatGPT 3.5 tended to act as a strict marker, applying the rubric literally and assigning lower scores than the human rater. Across three calibration stages, which included clarifying rubric descriptors, refining interpretive guidance, and incorporating representative scoring examples, the AI scoring moved closer to the human benchmark. Agreement improved from a Cohen’s kappa of 0.52 to 0.89, correlation from .76 to .94, and the mean score difference narrowed from -2.45 to - 0.95, the latter no longer statistically significant. Qualitative analysis showed a shift from a narrow emphasis on surface errors to a more balanced consideration of accuracy, organization, development, and communicative effectiveness. The results suggest that calibration offers a replicable and evidence-based approach to integrating generative AI into second language writing assessment, enhancing the fairness, validity, and reliability of AI assisted evaluation.

Downloads

Download data is not yet available.

Downloads

Published

2024-12-10

Submitted

2024-09-06

Revised

2024-11-18

Accepted

2024-11-24

How to Cite

Farzi, R. (2024). Calibrating Generative AI for Second Language Writing Assessment: Combining Statistical Validation with Prompt Design. Assessment and Practice in Educational Sciences, 2(4), 1-13. https://www.journalapes.com/index.php/apes/article/view/91

Similar Articles

1-10 of 67

You may also start an advanced similarity search for this article.