Run full Computer Agent evals in one line (hud + ComputerAgent)

Published on August 27, 2025 by Dillon DuPont

You can now benchmark any GUI-capable agent on real computer-use tasks with a single line of Python, powered by our new integration with hud (the evaluation platform for computer-use agents).

If yesterday’s 0.4 release made it easy to compose planning and grounding models, today’s update makes it easy to measure them. Pick your model string, press go, and watch traces live in HUD.

What you get

One-line evals on OSWorld (and more) for OpenAI, Anthropic, Hugging Face, and composed GUI models.
Live traces at app.hud.so to see every click, type, and screenshot.
Zero glue code needed — we wrapped the interface for you.
With Cua’s Agent SDK, trying the latest GUI model is as simple as changing the model string.

Try it

python
from agent.integrations.hud import run_full_dataset

# You can swap "hud-evals/OSWorld-Verified-XLang" -> "hud-evals/SheetBench-V2" to test SheetBench.

await run_full_dataset(
    dataset="hud-evals/OSWorld-Verified-XLang",
    model="openai/computer-use-preview+openai/gpt-5-nano",  # any supported model string
    split="train[:3]"  # try a few tasks to start
)

Prefer a quick smoke test? Use run_single_task(...) with a task_id.

Learn more

Notebook with end‑to‑end examples: https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb
Docs: https://docs.trycua.com/docs/agent-sdk/integrations/hud

That’s it — benchmark what you build, not your glue code.

Share this post

←Back to all posts