Authors: Warren Li, Melanie Kurimchak, and John Whitmer
https://doi.org/10.51388/20.500.12265/300
Abstract: The rapid integration of Artificial Intelligence into educational technology necessitates a rigorous approach to model evaluation, as the consequences of unchecked AI systems fall heavily on students, educators, and organizations. When AI tools are deployed to individualize learning, score student work, or generate feedback without robust assessment, they risk undermining learning outcomes, particularly for underresourced populations. This field note emphasizes that effective AI evaluation must be anchored in human judgment and established education measurement standards, specifically focusing on validity, reliability, and fairness. While edtech developers face increasing pressure to demonstrate the efficacy of their tools, many lack shared standards for evaluating AI outputs. This document provides a practical framework for assessing AI quality, synthesizing various approaches including human expert review, custom automated evaluation, and standardized benchmarks.As AI evaluation emerges as a top priority across the sector, adopting these structured, evidence-based practices is essential for building trustworthy educational tools that genuinely support teaching and learning, ultimately ensuring that technological innovation advances rather than hinders student success.
Suggested Citation: Li, W., Kurimchak, M., & Whitmer, J. (2026). ATS Field Note: AI Model Evaluation. ATS
Hub Field Note Series. https://doi.org/10.51388/20.500.12265/300
This material was developed under The Institute of Education Sciences Award R305N250006 from the U.S. Department of Education. Any opinions, findings, conclusions, or recommendations are those of the authors and do not necessarily reflect the views of IES.