tech
March 31, 2026
AI benchmarks are broken. Here’s what we need instead.
One-off tests don’t measure AI’s true impact. We’re better off shifting to more human-centered, context-specific methods.

TL;DR
- Traditional AI benchmarks focus on comparing AI performance against humans on isolated tasks, which is easy to standardize but doesn't reflect real-world use.
- Real-world AI deployment involves complex environments where AI interacts with multiple people within organizational workflows over extended periods.
- Current benchmarks fail to capture the systemic risks and economic/social consequences of AI, leading to a gap between benchmark and real-world performance.
- The proposed HAIC benchmarks shift the focus from individual, single-task performance to team and workflow performance, from one-off testing to long-term impacts, from correctness and speed to organizational outcomes, and from isolated outputs to system effects.
- Implementing HAIC benchmarks involves assessing how AI functions as a productive participant within human teams and generates sustained collective value, considering factors like coordination quality and error detectability over time.
Continue reading the original article