From GPT to BERT: Benchmarking Large Language Models for Automated Quiz Generation
This study evaluates the effectiveness of four leading large language models (LLMs), GPT-3, GPT-4, GPT-O, and BERT, in generating quiz questions for Java and Python programming courses. We aim to recognize how LLMs can effectively produce educationally valuable questions that meet specific pedagogical criteria. To this end, each model was prompted to generate 200 Java and 200 Python quiz questions, totaling 1600 unique questions. These questions are currently being evaluated based on technical precision, relevance to course objectives, linguistic clarity, and pedagogical appropriateness. The analysis involves quantitative and qualitative assessments by a team of computer science educators. The preliminary findings will assess the performance of the model in generating contextually appropriate and educationally useful questions, offering insights into their potential integration into computer science curricula. This work seeks to contribute to the broader discourse on the utility of LLMs in educational settings, specifically within the scope of automated content creation to enhance teaching and assessment methodologies in computer science education. The final results are intended to guide educators in selecting the most appropriate LLMs for curriculum development and to provide recommendations for improving question-generation processes using AI.