Muhammad Ihsan Zul; Suhaila Mohd. Yasin; Ivan Chatisa; Fikri Muhaffizh Imani; Siti Syahidatul Helma; Dadang Syarif Sihabudin Sahid

Quality Evaluation of Indonesian Student-generated User Stories: Insight from Human and ChatGPT Evaluation

PDF (1970KB), PP.82-99

Views: 0 Downloads: 0

Author(s)

Muhammad Ihsan Zul ^1,2,* Suhaila Mohd. Yasin ¹ Ivan Chatisa ³ Fikri Muhaffizh Imani ³ Siti Syahidatul Helma ³ Dadang Syarif Sihabudin Sahid ³

1. Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Malaysia

2. Department of Information Technology, Politeknik Caltex Riau, Indonesia

3. Department of Information Technology, Politeknik Caltex Riau, Pekanbaru, Indonesia

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2026.02.05

Received: 21 Jul. 2025 / Revised: 18 Nov. 2025 / Accepted: 27 Jan. 2026 / Published: 8 Apr. 2026

Index Terms

User Story Quality, Agile Software Development, ChatGPT Evaluation, Quality User Story Framework, Software Requirement

Abstract

User stories are essential in agile software development for capturing software requirements, yet concerns over their quality persist globally. While prior studies have evaluated user story quality using practitioners and artificial intelligence, they primarily focus on general settings. This study addresses a gap by evaluating the quality of student-generated user stories in an educational context, specifically in Indonesia. The objective of this study is to compare evaluations by human evaluators and ChatGPT using the Quality User Story (QUS) Framework and evaluate the quality of the student-generated user story compared to the global studies. A total of 951 user stories from 103 student software projects were analyzed. Evaluations were conducted by three human evaluators and ChatGPT (GPT-4o). Percentage Agreement and Cohen’s Kappa measured inter-rater agreement, while the McNemar Test assessed statistical significance, and effect sizes were examined using Cohen’s g. Results show generally high agreement between human and ChatGPT evaluations, but lower consistency in several criteria, such as Conceptually Sound, Independent, and Unambiguous. Only four of the thirteen criteria—Conflict-Free, Unique, Well-Formed, and Atomic—showed no significant differences. Most criteria showed small to medium effect sizes, whereas Complete exhibited a large practical difference. Common quality issues among students included Uniform, Independent, and Complete (set criteria), Atomic, Conceptually Sound, and Unambiguous (individual criteria), with overlap observed in global studies. This study shows that ChatGPT can support user story evaluation in educational settings when guided by clear rubrics and validated by humans. It also offers practical insights for educators by identifying criteria that require stronger emphasis in teaching, particularly in software engineering education in Indonesia.

Cite This Paper

Muhammad Ihsan Zul, Suhaila Mohd. Yasin, Ivan Chatisa, Fikri Muhaffizh Imani, Siti Syahidatul Helma, Dadang Syarif Sihabudin Sahid, "Quality Evaluation of Indonesian Student-generated User Stories: Insight from Human and ChatGPT Evaluation", International Journal of Modern Education and Computer Science(IJMECS), Vol.18, No.2, pp. 82-99, 2026. DOI:10.5815/ijmecs.2026.02.05

Reference

[1]A. R. Amna and G. Poels, “Ambiguity in user stories: A systematic literature review,” May 01, 2022, Elsevier B.V. doi: 10.1016/j.infsof.2022.106824.
[2]A. R. Amna and G. Poels, “Systematic Literature Mapping of User Story Research,” 2022, Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/ACCESS.2022.3173745.
[3]F. Dalpiaz, I. van der Schalk, S. Brinkkemper, F. B. Aydemir, and G. Lucassen, “Detecting terminological ambiguity in user stories: Tool and experimentation,” Inf. Softw. Technol., vol. 110, pp. 3–16, Jun. 2019, doi: 10.1016/j.infsof.2018.12.007.
[4]Y. A. Kustiawan and T. Y. Lim, “User Stories in Requirements Elicitation: A Systematic Literature Review,” in 2023 IEEE 8th International Conference On Software Engineering and Computer Systems (ICSECS), IEEE, Aug. 2023, pp. 211–216. doi: 10.1109/ICSECS58457.2023.10256364.
[5]G. Lucassen, F. Dalpiaz, J. M. E. M. van der Werf, and S. Brinkkemper, “The Use and Effectiveness of User Stories in Practice,” in REFSQ 2016: Proceedings of the 22nd International Working Conference on Requirements Engineering: Foundation for Software, 2016, pp. 205–222. doi: 10.1007/978-3-319-30282-9_14.
[6]G. Lucassen, F. Dalpiaz, J. M. E. M. van der Werf, and S. Brinkkemper, “Improving agile requirements: the Quality User Story framework and tool,” Requir. Eng., vol. 21, no. 3, pp. 383–403, Sep. 2016, doi: 10.1007/s00766-016-0250-x.
[7]B. Kumar, U. Tiwari, and D. C. Dobhal, “User Story Splitting in Agile Software Development using Machine Learning Approach,” in 2022 Seventh International Conference on Parallel, Distributed and Grid Computing (PDGC), IEEE, Nov. 2022, pp. 167–171. doi: 10.1109/PDGC56933.2022.10053226.
[8]S. S. do Nascimento, J. M. Abe, L. R. Forçan, C. C. de Oliveira, K. Nakamatsu, and A. Ari, “Improving the Process of Evaluating User Stories Using the Paraconsistent Annotated Evidential Logic Eτ,” in New Approaches for Multidimensional Signal Processing. NAMSP 2022. Smart Innovation, Systems and Technologies, vol. 332, Springer, Singapore, 2023, pp. 133–142. doi: 10.1007/978-981-19-7842-5_12.
[9]S. Jiménez, A. Alanis, C. Beltrán, R. Juárez‐Ramírez, A. Ramírez‐Noriega, and C. Tona, “USQA: A User Story Quality Analyzer prototype for supporting software engineering students,” Computer Applications in Engineering Education, vol. 31, no. 4, pp. 1014–1024, Jul. 2023, doi: 10.1002/cae.22620.
[10]Z. Xuan, T. Wang, C. Wang, and T. Li, “A Tool for Automatically Identifying Semantic Conflicts in User Stories by Combining NLP and BERT Model,” in 2024 IEEE 32nd International Requirements Engineering Conference (RE), IEEE, Jun. 2024, pp. 484–487. doi: 10.1109/RE59067.2024.00057.
[11]K. Ronanki, B. Cabrero-Daniel, and C. Berger, “ChatGPT as a Tool for User Story Quality Evaluation: Trustworthy Out of the Box?,” in Agile Processes in Software Engineering and Extreme Programming - Workshops, P. Kruchten and P. Gregory, Eds., Amsterdam: Springer, Cham, 2024, ch. AI-assisted Agile, pp. 173–181. doi: 10.1007/978-3-031-48550-3_17.
[12]C. Ankora and A. D., “Integrating User Stories in the Design of Augmented Reality Application,” International Journal of Information Technologies and Systems Approach, vol. 15, no. 1, pp. 1–19, Jul. 2022, doi: 10.4018/IJITSA.304809.
[13]A. Brockenbrough and D. Salinas, “Using Generative AI to Create User Stories in the Software Engineering Classroom,” in 2024 36th International Conference on Software Engineering Education and Training (CSEE&T), IEEE, Jul. 2024, pp. 1–5. doi: 10.1109/CSEET62301.2024.10662994.
[14]M. Cohn, User Stories for Agile Software Development. Addison-Wesley, 2004.
[15]F. Dwitama and A. Rusli, “User stories collection via interactive chatbot to support requirements gathering,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 18, no. 2, pp. 890–898, Apr. 2020, doi: 10.12928/TELKOMNIKA.V18I2.14866.
[16]C. O’hEocha and K. Conboy, “The Role of the User Story Agile Practice in Innovation,” in International Conference on Lean Enterprise Software and Systems, Helsinki: Springer, 2010, pp. 20–30. doi: 10.1007/978-3-642-16416-3_3.
[17]S. Nasiri, Y. Rhazali, M. Lahmer, and N. Chenfour, “Towards a Generation of Class Diagram from User Stories in Agile Methods,” in Procedia Computer Science, Elsevier B.V., 2020, pp. 831–837. doi: 10.1016/j.procs.2020.03.148.
[18]Y. Wautelet, S. Heng, M. Kolp, and I. Mirbel, “Unifying and Extending User Story Models,” in CAiSE 2014: Advanced Information Systems Engineering pp 211–225, Thessaloniki: Springer, Champ, Jun. 2014, pp. 211–225. doi: 10.1007/978-3-319-07881-6_15.
[19]B. Yang, H. Guo, and H. Liu, “Evaluation and assessment of machine learning based user story grouping: A framework and empirical studies,” Sci. Comput. Program., vol. 227, Apr. 2023, doi: 10.1016/j.scico.2023.102943.
[20]X. Xu, Y. Dou, L. Qian, Z. Zhang, Y. Ma, and Y. Tan, “A Requirement Quality Assessment Method Based on User Stories,” Electronics (Basel)., vol. 12, no. 10, p. 2155, May 2023, doi: 10.3390/electronics12102155.
[21]K. Tsilionis, J. Maene, S. Heng, Y. Wautelet, and S. Poelmans, “Conceptual Modeling Versus User Story Mapping: Which is the Best Approach to Agile Requirements Engineering?,” 2021, pp. 356–373. doi: 10.1007/978-3-030-75018-3_24.
[22]B. Wake, “INVEST in Good Stories, and SMART Tasks,” XP123: Exploring Extreme Programming.
[23]L. Buglione and A. Abran, “Improving the user story Agile technique using the INVEST criteria,” in Proceedings - Joint Conference of the 23rd International Workshop on Software Measurement and the 8th International Conference on Software Process and Product Measurement, IWSM-MENSURA 2013, IEEE Computer Society, 2013, pp. 49–53. doi: 10.1109/IWSM-Mensura.2013.18.
[24]E. Jharko, “Some Issues in Using the Model of Determining the User Stories Quality to Reduce Software Development Risks,” in 2024 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), IEEE, May 2024, pp. 658–663. doi: 10.1109/ICIEAM60818.2024.10553873.
[25]K. A. Alam, H. Asif, I. Inayat, and S.-U.-R. Khan, “Automated Quality Concerns Extraction from User Stories and Acceptance Criteria for Early Architectural Decisions,” in Software Architecture. ECSA 2024. Lecture Notes in Computer Science, vol. 14889, Springer, Cham, 2024, pp. 359–367. doi: 10.1007/978-3-031-70797-1_24.
[26]M. A. Kuhail and S. Lauesen, “User Story Quality in Practice: A Case Study,” Software, vol. 1, no. 3, pp. 223–243, Jun. 2022, doi: 10.3390/software1030010.
[27]S. N. F. N. B. Mustaffa, J. Bin Sallim, and R. B. Mohamed, “Enhancing High-Quality User Stories with AQUSA: An Overview Study of Data Cleaning Process,” in 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM), IEEE, Aug. 2021, pp. 295–300. doi: 10.1109/ICSECS52883.2021.00060.
[28]T. Wang, C. Wang, T. Li, Z. Liu, and Y. Zhai, “User Story Quality Assessment Based on Multi-dimensional Perspective: A Preliminary Framework,” in CEUR Workshop Proceedings: 15th International iStar Workshop, A. Mate, T. Li, and E. Goncalves, Eds., Hyderabad, India: ceur-ws.org, Oct. 2022, pp. 7–13.
[29]Z. Zhang, M. Rayhan, T. Herda, M. Goisauf, and P. Abrahamsson, “LLM-Based Agents for Automating the Enhancement of User Story Quality: An Early Report,” in 25th International Conference on Agile Software Development XP 2024, W. van der Aalst, S. Ram, M. Rosemann, C. Szyperski, and G. Guizzardi, Eds., Bozen Bolzano: Springer, Jun. 2024, pp. 117–126. doi: 10.1007/978-3-031-61154-4_8.
[30]A. Atasoy and S. M. N. Arani, “ChatGPT: A reliable assistant for the evaluation of students’ written texts?,” Educ. Inf. Technol. (Dordr)., Apr. 2025, doi: 10.1007/s10639-025-13553-1.
[31]M. T. Hicks, J. Humphries, and J. Slater, “ChatGPT is bullshit,” Ethics Inf. Technol., vol. 26, no. 2, p. 38, Jun. 2024, doi: 10.1007/s10676-024-09775-5.
[32]O. Abed, K. Nebe, and A. B. Abdellatif, “AI-Generated User Stories Supporting Human-Centred Development: An Investigation on Quality,” in HCI International 2024 Posters. HCII 2024. Communications in Computer and Information Science, vol. 2120, Springer, Cham, 2024, pp. 3–13. doi: 10.1007/978-3-031-62110-9_1.
[33]X. Xu, Y. Dou, L. Qian, J. Jiang, K. Yang, and Y. Tan, “Quality improvement method for high-end equipment’s functional requirements based on user stories,” Advanced Engineering Informatics, vol. 56, Apr. 2023, doi: 10.1016/j.aei.2023.102017.
[34]D.-M. Yoo and J. J. Han, “Inter-rater reliability and content validity of the measurement tool for portfolio assessments used in the Introduction to Clinical Medicine course at Ewha Womans University College of Medicine: a methodological study,” J. Educ. Eval. Health Prof., vol. 21, p. 39, Dec. 2024, doi: 10.3352/jeehp.2024.21.39.
[35]M. R. Lynn, “Determination and Quantification of Content Validity,” Nurs. Res., vol. 35, no. 6, pp. 382–386, Nov. 1986, doi: 10.1097/00006199-198611000-00017.
[36]S. G. Sireci, “Gathering and Analyzing Content Validity Data,” Educational Assessment, vol. 5, no. 4, pp. 299–321, Oct. 1998, doi: 10.1207/s15326977ea0504_2.
[37]V. Zamanzadeh, A. Ghahramanian, M. Rassouli, A. Abbaszadeh, H. Alavi-Majd, and A.-R. Nikanfar, “Design and Implementation Content Validity Study: Development of an instrument for measuring Patient-Centered Communication,” J. Caring Sci., vol. 4, no. 2, pp. 165–178, Jun. 2015, doi: 10.15171/jcs.2015.017.
[38]R. Montes, C. Zuheros, J. Morales, N. Zermeño, J. Duran, and F. Herrera, “Design and consensus content validity of the questionnaire for b-learning education: A 2-Tuple Fuzzy Linguistic Delphi based Decision Support Tool,” Appl. Soft Comput., vol. 147, p. 110755, Nov. 2023, doi: 10.1016/j.asoc.2023.110755.
[39]M. L. McHugh, “Interrater reliability: the kappa statistic,” Biochem. Med. (Zagreb)., pp. 276–282, 2012, doi: 10.11613/BM.2012.031.
[40]J. R. Landis and G. G. Koch, “The Measurement of Observer Agreement for Categorical Data,” Biometrics, vol. 33, no. 1, p. 159, Mar. 1977, doi: 10.2307/2529310.
[41]R. H. Riffenburgh, Statistics in Medicine, 2nd ed. Elsevier, 2006. doi: 10.1016/B978-0-12-088770-5.X5036-9.
[42]S. D. Moore, G. P. McCabe, and B. A. Craig, Introduction to the Practice of Statistics, 9th ed. W. H. Freeman, 2016.
[43]A. Agresti, Statistical Methods for the Social Sciences, 5th ed. Pearson, 2018.
[44]J. Cohen, Statistical Power Analysis for the Behavioral Sciences. Routledge, 2013. doi: 10.4324/9780203771587.
[45]M. W. Fagerland, S. Lydersen, and P. Laake, “The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional,” BMC Med. Res. Methodol., vol. 13, no. 1, p. 91, Dec. 2013, doi: 10.1186/1471-2288-13-91.
[46]G. Lucassen, F. Dalpiaz, J. M. E. M. van der Werf, and S. Brinkkemper, “Forging high-quality User Stories: Towards a discipline for Agile Requirements,” in 2015 IEEE 23rd International Requirements Engineering Conference (RE), IEEE, Aug. 2015, pp. 126–135. doi: 10.1109/RE.2015.7320415.
[47]A. R. Feinstein and D. V. Cicchetti, “High agreement but low Kappa: I. the problems of two paradoxes,” J. Clin. Epidemiol., vol. 43, no. 6, pp. 543–549, Jan. 1990, doi: 10.1016/0895-4356(90)90158-L.
[48]T. Byrt, J. Bishop, and J. B. Carlin, “Bias, prevalence and kappa,” J. Clin. Epidemiol., vol. 46, no. 5, pp. 423–429, May 1993, doi: 10.1016/0895-4356(93)90018-V.
[49]P. Pokharel and P. Vaidya, “A Study of User Story in Practice,” in 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy, ICDABI 2020, Institute of Electrical and Electronics Engineers Inc., Oct. 2020. doi: 10.1109/ICDABI51230.2020.9325670.
[50]C. A. Peláez and A. Solano, “A practice for specifying user stories in multimedia system design: An approach to reduce ambiguity.,” Interaction Design and Architecture(s), no. 60, pp. 214–236, Mar. 2024, doi: 10.55612/s-5002-060-009.

International Journal of Modern Education and Computer Science (IJMECS)