Cua-Bench – a benchmark for AI agents in GUI environments

1 points by francedot

Cua-Bench ( https://github.com/trycua/cua-bench ), an open-source framework for evaluating and training computer-use agents across different environments.

Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.

The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance.

Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.

With Cua-Bench, you can:

Evaluate agents across multiple benchmarks with one CLI (native tasks + OSWorld + Windows Agent Arena adapters)
Test the same agent on different OS variations (Windows 11/XP/Vista, macOS themes, Linux, Android via QEMU)
Generate new tasks from natural language prompts
Create simulated environments for RL training (shell apps like Spotify, Slack with programmatic rewards)
Run oracle validations to verify environments before agent evaluation
Monitor agent runs in real-time with traces and screenshots

All of this works on macOS, Linux, Windows, and Android, and is self-hostable.

To get started:

Install cua-bench: % pip install cua-bench

Run a basic evaluation: % cb run dataset datasets/cua-bench-basic --agent demo

Open the monitoring dashboard: % cb run watch <run_id>

For parallelized evaluations across multiple workers:

% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8

Want to test across different OS variations? Just specify the environment:

% cb run task slack_message --agent your-agent --env windows_xp % cb run task slack_message --agent your-agent --env macos_sonoma

Generate new tasks from prompts:

% cb task generate "book a flight on kayak.com" Validate environments with oracle implementations:

% cb run dataset datasets/cua-bench-basic --oracle

The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.

We're seeing teams use Cua-Bench for:

Training computer-use models on mobile and desktop environments
Generating large-scale training datasets (working with labs on millions of screenshots across OS variations)
RL fine-tuning with shell app simulators
Systematic evaluation across OS themes and browser versions
Building task registries (collaborating with Snorkel AI on task design and data curation, similar to their Terminal-Bench work)

Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.

GitHub: https://github.com/trycua/cua Docs: https://cua.ai/docs/cuabench Technical Report: https://cuabench.ai

We'll be here to answer any technical questions and look forward to your comments!