Cua-Bench – a benchmark for AI agents in GUI environments

1 points by francedot


Cua-Bench ( https://github.com/trycua/cua-bench ), an open-source framework for evaluating and training computer-use agents across different environments.

Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.

The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance.

Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.

With Cua-Bench, you can:

All of this works on macOS, Linux, Windows, and Android, and is self-hostable.

To get started:

Install cua-bench: % pip install cua-bench

Run a basic evaluation: % cb run dataset datasets/cua-bench-basic --agent demo

Open the monitoring dashboard: % cb run watch <run_id>

For parallelized evaluations across multiple workers:

% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8

Want to test across different OS variations? Just specify the environment:

% cb run task slack_message --agent your-agent --env windows_xp % cb run task slack_message --agent your-agent --env macos_sonoma

Generate new tasks from prompts:

% cb task generate "book a flight on kayak.com" Validate environments with oracle implementations:

% cb run dataset datasets/cua-bench-basic --oracle

The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.

We're seeing teams use Cua-Bench for:

Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.

GitHub: https://github.com/trycua/cua Docs: https://cua.ai/docs/cuabench Technical Report: https://cuabench.ai

We'll be here to answer any technical questions and look forward to your comments!