OSWorld-Verified

OSWorld-Verified is a curated subset of OSWorld tasks that can be run using the HUD framework. Use ComputerAgent with HUD to benchmark on these tasks.

Setup

pip install hud-python==0.2.10

Set environment variables:

export HUD_API_KEY="your_hud_key"
export ANTHROPIC_API_KEY="your_anthropic_key"  # For Claude
export OPENAI_API_KEY="your_openai_key"        # For OpenAI

Quick Start

import asyncio
from hud import gym, load_taskset
from agent.integrations.hud import ComputerAgent

async def run_osworld():
    # Load taskset
    taskset = await load_taskset("OSWorld-Verified")
    test = taskset[144]  # Example task
    
    # Create environment (~2.5 min startup)
    env = await gym.make(test)
    
    # Create agent
    agent = ComputerAgent(
        model="anthropic/claude-3-5-sonnet-20241022", # any ComputerAgent model string
        environment="linux"
    )
    
    # Run benchmark
    obs, _ = await env.reset()
    for i in range(100):
        action, done = await agent.predict(obs)
        obs, reward, terminated, info = await env.step(action)
        if done or terminated:
            break
    
    # Evaluate results
    result = await env.evaluate()
    await env.close()
    
    return result

# Run benchmark
result = asyncio.run(run_osworld())
print(f"Success: {result.get('success', False)}")

Parallel Execution

Run all tasks in parallel using run_job:

from agent.integrations.hud import run_job
from hud import load_taskset
from hud.taskset import TaskSet
import logging

# Load taskset
taskset = await load_taskset("OSWorld-Verified")
taskset = TaskSet(tasks=taskset[:10]) # limit to 10 tasks instead of all 370

# Run benchmark job
job = await run_job(
    model="openai/computer-use-preview",
    task_or_taskset=taskset,
    job_name="test-computeragent-job",
    max_concurrent_tasks=5,
    # add any extra ComputerAgent kwargs:
    verbosity=logging.INFO,  # Enable logging
    # trajectory_dir=".."       # Save trajectories locally
)

# Get results OR view them at app.hud.so
print(await job.get_analytics())
print(f"View results at: https://app.hud.so/jobs/{job.id}")

Setup

Quick Start

Parallel Execution

On this page