HUD Evals
Use ComputerAgent with HUD for benchmarking and evaluation
The HUD integration allows an agent to be benchmarked using the HUD framework. Through the HUD integration, the agent controls a computer inside HUD, where tests are run to evaluate the success of each task.
Installation
First, install the required package:
pip install "cua-agent[hud]"
## or install hud-python directly
# pip install hud-python==0.4.12
Environment Variables
Before running any evaluations, you’ll need to set up your environment variables for HUD and your model providers:
# HUD access
export HUD_API_KEY="your_hud_api_key"
# Model provider keys (at least one required)
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
Running a Single Task
You can run a single task from a HUD dataset for quick verification.
Example
from agent.integrations.hud import run_single_task
await run_single_task(
dataset="hud-evals/OSWorld-Verified", # or another HUD dataset
model="openai/computer-use-preview+openai/gpt-5-nano", # any supported model string
task_id=155, # e.g., reopen last closed tab
)
Parameters
task_id
(int
): Default:0
Index of the task to run from the dataset.
Running a Full Dataset
To benchmark your agent at scale, you can run an entire dataset (or a subset) in parallel.
Example
from agent.integrations.hud import run_full_dataset
results = await run_full_dataset(
dataset="hud-evals/OSWorld-Verified", # can also pass a Dataset or list[dict]
model="openai/computer-use-preview",
split="train[:3]", # try a few tasks to start
max_concurrent=20, # tune to your infra
max_steps=50 # safety cap per task
)
Parameters
job_name
(str
|None
): Optional human-readable name for the evaluation job (shows up in HUD UI).max_concurrent
(int
): Default:30
Number of tasks to run in parallel. Scale this based on your infra.max_steps
(int
): Default:50
Safety cap on steps per task to prevent infinite loops.split
(str
): Default:"train"
Dataset split or subset to run. Uses the Hugging Face split format, e.g.,"train[:10]"
for the first 10 tasks.
Additional Parameters
Both single-task and full-dataset runs share a common set of configuration options. These let you fine-tune how the evaluation runs.
dataset
(str
|Dataset
|list[dict]
): Required HUD dataset name (e.g."hud-evals/OSWorld-Verified"
), a loadedDataset
, or a list of tasks.model
(str
): Default:"computer-use-preview"
Model string, e.g."openai/computer-use-preview+openai/gpt-5-nano"
. Supports composition with+
(planning + grounding).allowed_tools
(list[str]
): Default:["openai_computer"]
Restrict which tools the agent may use.tools
(list[Any]
): Extra tool configs to inject.custom_loop
(Callable
): Optional custom agent loop function. If provided, overrides automatic loop selection.only_n_most_recent_images
(int
): Default:5
for full dataset,None
for single task. Retain only the last N screenshots in memory.callbacks
(list[Any]
): Hook functions for logging, telemetry, or side effects.verbosity
(int
): Logging level. Set2
for debugging every call/action.trajectory_dir
(str
|dict
): Save local copies of trajectories for replay/analysis.max_retries
(int
): Default:3
Number of retries for failed model/tool calls.screenshot_delay
(float
|int
): Default:0.5
Delay (seconds) between screenshots to avoid race conditions.use_prompt_caching
(bool
): Default:False
Cache repeated prompts to reduce API calls.max_trajectory_budget
(float
|dict
): Limit on trajectory size/budget (e.g., tokens, steps).telemetry_enabled
(bool
): Default:True
Whether to send telemetry/traces to HUD.**kwargs
(any
): Any additional keyword arguments are passed through to the agent loop or model provider.
Available Benchmarks
HUD provides multiple benchmark datasets for realistic evaluation.
- OSWorld-Verified – Benchmark on 369+ real-world desktop tasks across Chrome, LibreOffice, GIMP, VS Code, etc. Best for: evaluating full computer-use agents in realistic environments. Verified variant: fixes 300+ issues from earlier versions for reliability.
Coming soon: SheetBench (spreadsheet automation) and other specialized HUD datasets.
See the HUD docs for more eval environments.
Tips
- Debugging: set
verbosity=2
to see every model call and tool action. - Performance: lower
screenshot_delay
for faster runs; raise it if you see race conditions. - Safety: always set
max_steps
(defaults to 50) to prevent runaway loops. - Custom tools: pass extra
tools=[...]
into the agent config if you need beyondopenai_computer
.