AI benchmarks are broken. Here’s what we need instead.

March 31, 2026

TL;DR

Traditional AI benchmarks focus on comparing AI performance against humans on isolated tasks, which is easy to standardize but doesn't reflect real-world use.
Real-world AI deployment involves complex environments where AI interacts with multiple people within organizational workflows over extended periods.
Current benchmarks fail to capture the systemic risks and economic/social consequences of AI, leading to a gap between benchmark and real-world performance.
The proposed HAIC benchmarks shift the focus from individual, single-task performance to team and workflow performance, from one-off testing to long-term impacts, from correctness and speed to organizational outcomes, and from isolated outputs to system effects.
Implementing HAIC benchmarks involves assessing how AI functions as a productive participant within human teams and generates sustained collective value, considering factors like coordination quality and error detectability over time.

Continue reading the original article