Evaluating the Quality of GenAI Applications in Software Engineering: A Multi-case Study

Submitted in Empirical Software Engineering, 2025

Recommended citation: L Yu, E Alégroth, P Chatzipetrou, T Gorschek (2025). "Evaluating the Quality of GenAI Applications in Software Engineering: A Multi-case Study." Empirical Software Engineering.

Generative AI (GenAI) is increasingly adopted in software development for tasks such as document generation, data analysis, and code generation. However, evaluating the quality of GenAI applications becomes challenging, as traditional quality measurements may not be fully applicable. In this study, we explore how practitioners evaluate the quality of GenAI applications and investigate quality evaluation techniques. We conducted a multi-case study in three industrial projects from software development companies. We examined four GenAI application domains: document generation, data analysis and insight generation, customer service, and code generation. Data were collected through three workshops and 23 semi-structured interviews with industrial practitioners. We identified fourteen GenAI use cases and 28 metrics currently used to evaluate the quality of GenAI applications’ outputs. We synthesized the identified metrics’ usage patterns and challenges based on the collected data. This study presents practical insights into using metrics to measure GenAI-based system qualities in real industrial settings. Our findings indicate that practitioners use custom-built and context‑specific metrics; combining these with academic metrics can strengthen GenAI system quality evaluation.

Download paper here