Calibrating Generative AI for Second Language Writing Assessment: Combining Statistical Validation with Prompt Design
Keywords:
generative artificial intelligence, second language writing assessment, calibration, prompt design, inter-rater reliability, statistical validationAbstract
Generative artificial intelligence (GenAI) is emerging as a powerful tool in second language writing assessment, offering the potential for rapid, consistent, and scalable evaluation. However, its value depends on whether its scoring reflects the nuanced judgments of experienced human raters. This study introduces the concept of calibration in the context of second language writing assessment, defined as the deliberate and iterative refinement of AI prompts, guided by statistical evidence, to align AI scoring with human evaluative reasoning. 60 essays produced by 30 upper intermediate learners of English were evaluated independently by an experienced human rater and by ChatGPT 3.5, using the English for Academic Purposes (EAP) Writing Assessment Rubric. Statistical analyses assessed inter rater agreement, score consistency, and systematic bias. In the initial baseline stage, ChatGPT 3.5 tended to act as a strict marker, applying the rubric literally and assigning lower scores than the human rater. Across three calibration stages, which included clarifying rubric descriptors, refining interpretive guidance, and incorporating representative scoring examples, the AI scoring moved closer to the human benchmark. Agreement improved from a Cohen’s kappa of 0.52 to 0.89, correlation from .76 to .94, and the mean score difference narrowed from -2.45 to - 0.95, the latter no longer statistically significant. Qualitative analysis showed a shift from a narrow emphasis on surface errors to a more balanced consideration of accuracy, organization, development, and communicative effectiveness. The results suggest that calibration offers a replicable and evidence-based approach to integrating generative AI into second language writing assessment, enhancing the fairness, validity, and reliability of AI assisted evaluation.
Downloads
References
1. Shi Y, Aryadoust V. Large language models in language assessment: Opportunities and challenges. Assessing Writing. 2024;52:100688.
2. Jonäll K. Artificial intelligence in academic grading: A mixed-methods study: University of Gothenburg; 2024.
3. Lee GG, Latif E, Wu X, Liu N, Zhai X. Applying large language models and chain-of-thought for automatic scoring. arXiv. 2023.
4. Hicke Y, Tian T, Jha K, Kim CH. Automated essay scoring in argumentative writing: DeBER TeachingAssistant. 2023.
5. Bachman L, Palmer A. Language assessment in practice: Developing language assessments and justifying their use in the real world: Oxford University Press; 2022.
6. Kunnan AJ. Test fairness In - M. Milanovic & C. Weir (Eds.), European language testing in a global context. Cambridge University Press; 2004. p. 27-48.
7. McNamara T, Knoch U, Fan J. Fairness, justice, and language assessment. Language Testing. 2019;36(1):1-8.
8. Hyland K. Teaching and researching writing: Routledge; 2016.
9. Ferris D. Responding to student writing: Teachers' philosophies and practices. Assessing Writing. 2014;19:6-23. doi: 10.1016/j.asw.2013.09.004.
10. Page EB. Project Essay Grade: PEG In - M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring. Lawrence Erlbaum; 2003. p. 43-54.
11. Burstein J. The e-rater scoring engine: Automated essay scoring with natural language processing In - M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring. Lawrence Erlbaum; 2003. p. 113-21.
12. Dikli S. An overview of automated scoring of essays. The Journal of Technology, Learning, and Assessment. 2006;5(1).
13. Shermis MD, Burstein JC. Handbook of automated essay evaluation: Routledge; 2013.
14. Ramineni C, Williamson DM. Understanding automated scoring through the lens of the scoring process. Assessing Writing. 2013;18(4):244-72. doi: 10.1016/j.asw.2012.10.004.
15. Barkaoui K. Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly. 2010;7(1):54-74. doi: 10.1080/15434300903464418.
16. Lumley T. Assessing second language writing: The rater's perspective: Peter Lang; 2005.
17. Brookhart SM. Appropriate criteria: Key to effective rubrics. Frontiers in Education. 2018;3:22. doi: 10.3389/feduc.2018.00022.
18. Cumming A, Kantor R, Powers D, Santos T, Taylor C. TOEFL iBT writing framework: A working paper. 2005.
19. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al., editors. Attention is all you need. Advances in neural information processing systems; 2017.
20. Chorowski J, Bahdanau D, Cho K, Bengio Y, editors. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv; 2014.
21. Ke Z, Ng V, editors. Automated essay scoring: A survey of the state of the art. Proceedings of the International Joint Conference on Artificial Intelligence; 2019.
22. Kolen MJ, Brennan RL. Test equating, scaling, and linking: Methods and practices: Springer; 2014.
23. Warrens MJ, van der Hoef H, Heiser WJ. A comparison of reliability coefficients for ordinal rating scales. Frontiers in Psychology. 2021;12:736184. doi: 10.3389/fpsyg.2021.736184.
24. Eckes T. Introduction to many-facet Rasch measurement: Peter Lang; 2015.
25. Council of E. Common European framework of reference for languages: Learning, teaching, assessment - Companion volume: Council of Europe Publishing; 2020.
26. Reynolds L, McDonell J, editors. Prompt programming for large language models: Beyond the few-shot paradigm. arXiv; 2021.
27. Hamp-Lyons L. Scoring procedures for ESL contexts In - L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts. Ablex; 1991. p. 241-76.
28. Weigle SC. Assessing writing: Cambridge University Press; 2002.
Downloads
Published
Submitted
Revised
Accepted
Issue
Section
License
Copyright (c) 2024 Reza Farzi

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.