Abstract:
Large Language Models (LLMs) are increasingly applied in automated test case generation. Unlike traditional testing, quality evaluation of their outputs requires novel approaches due to the non-deterministic nature of LLMs and the risks of bias and hallucination. This study conducts a systematic literature review of 25 publications (2020–2025) to examine quality criteria, evaluation metrics, and heuristics in LLM -based test case generation. Findings show that correctness/accuracy and coverage metrics remain dominant, yet there is a growing focus on bias and hallucination detection. Effective heuristics include human-in-the-loop evaluation, gold standard comparison, and multi-agent frameworks. The review highlights emerging frameworks integrating ethical and safety criteria, underscoring the need for comprehensive evaluation to ensure reliability and fairness in automated test generation.
Published in: 2025 IEEE International Conference on Data and Software Engineering (ICoDSE)
