HUD Evals

A corresponding Jupyter Notebook is available for this documentation.

The HUD integration allows an agent to be benchmarked using the HUD framework. Through the HUD integration, the agent controls a computer inside HUD, where tests are run to evaluate the success of each task.

Installation

First, install the required package:

pip install "cua-agent[hud]"
## or install hud-python directly
# pip install hud-python==0.4.12

Environment Variables

Before running any evaluations, you’ll need to set up your environment variables for HUD and your model providers:

# HUD access
export HUD_API_KEY="your_hud_api_key"

# Model provider keys (at least one required)
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"

Running a Single Task

You can run a single task from a HUD dataset for quick verification.

Example

from agent.integrations.hud import run_single_task

await run_single_task(
    dataset="hud-evals/OSWorld-Verified",   # or another HUD dataset
    model="openai/computer-use-preview+openai/gpt-5-nano",  # any supported model string
    task_id=155,  # e.g., reopen last closed tab
)

Parameters

task_id (int): Default: 0 Index of the task to run from the dataset.

Running a Full Dataset

To benchmark your agent at scale, you can run an entire dataset (or a subset) in parallel.

Example

from agent.integrations.hud import run_full_dataset

results = await run_full_dataset(
    dataset="hud-evals/OSWorld-Verified",   # can also pass a Dataset or list[dict]
    model="openai/computer-use-preview",
    split="train[:3]",           # try a few tasks to start
    max_concurrent=20,            # tune to your infra
    max_steps=50                  # safety cap per task
)

Parameters

job_name (str | None): Optional human-readable name for the evaluation job (shows up in HUD UI).
max_concurrent (int): Default: 30 Number of tasks to run in parallel. Scale this based on your infra.
max_steps (int): Default: 50 Safety cap on steps per task to prevent infinite loops.
split (str): Default: "train" Dataset split or subset to run. Uses the Hugging Face split format, e.g., "train[:10]" for the first 10 tasks.

Additional Parameters

Both single-task and full-dataset runs share a common set of configuration options. These let you fine-tune how the evaluation runs.

dataset (str | Dataset | list[dict]): Required HUD dataset name (e.g. "hud-evals/OSWorld-Verified"), a loaded Dataset, or a list of tasks.
model (str): Default: "computer-use-preview" Model string, e.g. "openai/computer-use-preview+openai/gpt-5-nano". Supports composition with + (planning + grounding).
allowed_tools (list[str]): Default: ["openai_computer"] Restrict which tools the agent may use.
tools (list[Any]): Extra tool configs to inject.
custom_loop (Callable): Optional custom agent loop function. If provided, overrides automatic loop selection.
only_n_most_recent_images (int): Default: 5 for full dataset, None for single task. Retain only the last N screenshots in memory.
callbacks (list[Any]): Hook functions for logging, telemetry, or side effects.
verbosity (int): Logging level. Set 2 for debugging every call/action.
trajectory_dir (str | dict): Save local copies of trajectories for replay/analysis.
max_retries (int): Default: 3 Number of retries for failed model/tool calls.
screenshot_delay (float | int): Default: 0.5 Delay (seconds) between screenshots to avoid race conditions.
use_prompt_caching (bool): Default: False Cache repeated prompts to reduce API calls.
max_trajectory_budget (float | dict): Limit on trajectory size/budget (e.g., tokens, steps).
telemetry_enabled (bool): Default: True Whether to send telemetry/traces to HUD.
**kwargs (any): Any additional keyword arguments are passed through to the agent loop or model provider.

Available Benchmarks

HUD provides multiple benchmark datasets for realistic evaluation.

OSWorld-Verified – Benchmark on 369+ real-world desktop tasks across Chrome, LibreOffice, GIMP, VS Code, etc. Best for: evaluating full computer-use agents in realistic environments. Verified variant: fixes 300+ issues from earlier versions for reliability.

Coming soon: SheetBench (spreadsheet automation) and other specialized HUD datasets.

See the HUD docs for more eval environments.

Tips

Debugging: set verbosity=2 to see every model call and tool action.
Performance: lower screenshot_delay for faster runs; raise it if you see race conditions.
Safety: always set max_steps (defaults to 50) to prevent runaway loops.
Custom tools: pass extra tools=[...] into the agent config if you need beyond openai_computer.

On this page