Build Your Own Operator on macOS - Part 1
Published on March 31, 2025 by The Cua Team
In this first blogpost, we'll learn how to build our own Computer-Use Operator using OpenAI's computer-use-preview
model. But first, let's understand what some common terms mean:
- A Virtual Machine (VM) is like a computer within your computer - a safe, isolated environment where the AI can work without affecting your main system.
- computer-use-preview is OpenAI's specialized language model trained to understand and interact with computer interfaces through screenshots.
- A Computer-Use Agent is an AI agent that can control a computer just like a human would - clicking buttons, typing text, and interacting with applications.
Our Operator will run in an isolated macOS VM, by making use of our cua-computer package and lume virtualization CLI.
Check out what it looks like to use your own Operator from a Gradio app:
What You'll Learn
By the end of this tutorial, you'll be able to:
- Set up a macOS virtual machine for AI automation
- Connect OpenAI's computer-use model to your VM
- Create a basic loop for the AI to interact with your VM
- Handle different types of computer actions (clicking, typing, etc.)
- Implement safety checks and error handling
Prerequisites:
- macOS Sonoma (14.0) or later
- 8GB RAM minimum (16GB recommended)
- OpenAI API access (Tier 3+)
- Basic Python knowledge
- Familiarity with terminal commands
Estimated Time: 45-60 minutes
Introduction to Computer-Use Agents
Last March OpenAI released a fine-tuned version of GPT-4o, namely CUA, introducing pixel-level vision capabilities with advanced reasoning through reinforcement learning. This fine-tuning enables the computer-use model to interpret screenshots and interact with graphical user interfaces on a pixel-level such as buttons, menus, and text fields - mimicking human interactions on a computer screen. It scores a remarkable 38.1% success rate on OSWorld - a benchmark for Computer-Use agents on Linux and Windows. This is the 2nd available model after Anthropic's Claude 3.5 Sonnet to support computer-use capabilities natively with no external models (e.g. accessory SoM (Set-of-Mark) and OCR runs).
Professor Ethan Mollick provides an excellent explanation of computer-use agents in this article: When you give a Claude a mouse.
ChatGPT Operator
OpenAI's computer-use model powers ChatGPT Operator, a Chromium-based interface exclusively available to ChatGPT Pro subscribers. Users leverage this functionality to automate web-based tasks such as online shopping, expense report submission, and booking reservations by interacting with websites in a human-like manner.
Benefits of Custom Operators
Why Build Your Own?
While OpenAI's Operator uses a controlled Chromium VM instance, there are scenarios where you may want to use your own VM with full desktop capabilities. Here are some examples:
- Automating native macOS apps like Finder, Xcode
- Managing files, changing settings, and running terminal commands
- Testing desktop software and applications
- Creating workflows that combine web and desktop tasks
- Automating media editing in apps like Final Cut Pro and Blender
This gives you more control and flexibility to automate tasks beyond just web browsing, with full access to interact with native applications and system-level operations. Additionally, running your own VM locally provides better privacy for sensitive user files and delivers superior performance by leveraging your own hardware instead of renting expensive Cloud VMs.
Access Requirements
Model Availability
As we speak, the computer-use-preview model has limited availability:
- Only accessible to OpenAI tier 3+ users
- Additional application process may be required even for eligible users
- Cannot be used in the OpenAI Playground
- Outside of ChatGPT Operator, usage is restricted to the new Responses API
Understanding the OpenAI API
Responses API Overview
Let's start with the basics. In our case, we'll use OpenAI's Responses API to communicate with their computer-use model.
Think of it like this:
- We send the model a screenshot of our VM and tell it what we want it to do
- The model looks at the screenshot and decides what actions to take
- It sends back instructions (like "click here" or "type this")
- We execute those instructions in our VM
The Responses API is OpenAI's newest way to interact with their AI models. It comes with several built-in tools:
- Web search: Let the AI search the internet
- File search: Help the AI find documents
- Computer use: Allow the AI to control a computer (what we'll be using)
As we speak, the computer-use model is only available through the Responses API.
Responses API Examples
Let's look at some simple examples. We'll start with the traditional way of using OpenAI's API with Chat Completions, then show the new Responses API primitive.
Chat Completions:
python# The old way required managing conversation history manually messages = [{"role": "user", "content": "Hello"}] response = client.chat.completions.create( model="gpt-4", messages=messages # We had to track all messages ourselves ) messages.append(response.choices[0].message) # Manual message tracking
Responses API:
python# Example 1: Simple web search # The API handles all the complexity for us response = client.responses.create( model="gpt-4", input=[{ "role": "user", "content": "What's the latest news about AI?" }], tools=[{ "type": "web_search", # Tell the API to use web search "search_query": "latest AI news" }] ) # Example 2: File search # Looking for specific documents becomes easy response = client.responses.create( model="gpt-4", input=[{ "role": "user", "content": "Find documents about project X" }], tools=[{ "type": "file_search", "query": "project X", "file_types": ["pdf", "docx"] # Specify which file types to look for }] )
Computer-Use Model Setup
For our operator, we'll use the computer-use model. Here's how we set it up:
python# Set up the computer-use model to control our VM response = client.responses.create( model="computer-use-preview", # Special model for computer control tools=[{ "type": "computer_use_preview", "display_width": 1024, # Size of our VM screen "display_height": 768, "environment": "mac" # Tell it we're using macOS. }], input=[ { "role": "user", "content": [ # What we want the AI to do {"type": "input_text", "text": "Open Safari and go to google.com"}, # Current screenshot of our VM {"type": "input_image", "image_url": f"data:image/png;base64,{screenshot_base64}"} ] } ], truncation="auto" # Let OpenAI handle message length )
Understanding the Response
When we send a request, the API sends back a response that looks like this:
json"output": [ { "type": "reasoning", # The AI explains what it's thinking "id": "rs_67cc...", "summary": [ { "type": "summary_text", "text": "Clicking on the browser address bar." } ] }, { "type": "computer_call", # The actual action to perform "id": "cu_67cc...", "call_id": "call_zw3...", "action": { "type": "click", # What kind of action (click, type, etc.) "button": "left", # Which mouse button to use "x": 156, # Where to click (coordinates) "y": 50 }, "pending_safety_checks": [], # Any safety warnings to consider "status": "completed" # Whether the action was successful } ]
Each response contains:
- Reasoning: The AI's explanation of what it's doing
- Action: The specific computer action to perform
- Safety Checks: Any potential risks to review
- Status: Whether everything worked as planned
CUA-Computer Interface
Architecture Overview
Let's break down the main components of our system and how they work together:
-
The Virtual Machine (VM)
- Think of this as a safe playground for our AI
- It's a complete macOS system running inside your computer
- Anything the AI does stays inside this VM, keeping your main system safe
- We use
lume
to create and manage this VM
-
The Computer Interface (CUI)
- This is how we control the VM
- It can move the mouse, type text, and take screenshots
- Works like a remote control for the VM
- Built using our
cua-computer
package
-
The OpenAI Model
- This is the brain of our operator
- It looks at screenshots of the VM
- Decides what actions to take
- Sends back instructions like "click here" or "type this"
Here's how they all work together:
mermaid
The diagram above shows how information flows through our system:
- You start the operator
- The Computer Interface creates a virtual macOS
- Then it enters a loop:
- Take a picture of the VM screen
- Send it to OpenAI with instructions
- Get back an action to perform
- Execute that action in the VM
- Repeat until the task is done
This design keeps everything organized and safe. The AI can only interact with the VM through our controlled interface, and the VM keeps the AI's actions isolated from your main system.
Implementation Guide
Prerequisites
-
Lume CLI Setup For installing the standalone lume binary, run the following command from a terminal, or download the latest pkg.
bashsudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
Once installed, start the lume daemon:
bashlume serve
Next, pull the pre-built macOS VM image that contains all required dependencies:
bash# Pull the latest macOS Sequoia image optimized for CUA lume pull macos-sequoia-cua:latest --no-cache
Important Storage Notes:
- Initial download requires 80GB of free space
- After first run, space usage reduces to ~30GB due to macOS's sparse file system
- VMs are stored in
~/.lume
- Cached images are stored in
~/.lume/cache
Pro Tip: Remove
--no-cache
to save the image locally for reuse (requires 2x storage temporarily).You can check your downloaded VM images anytime:
bashlume ls
Example output:
name os cpu memory disk display status ip vnc macos-sequoia-cua:latest macOS 12 16.00G 64.5GB/80.0GB 1024x768 running 192.168.64.78 vnc://:[email protected]:56085 After checking your available images, you can run the VM to ensure everything is working correctly:
bashlume run macos-sequoia-cua:latest
-
Python Environment Setup Note: The
cua-computer
package requires Python 3.10 or later. We recommend creating a dedicated Python environment:Using venv:
bashpython -m venv cua-env source cua-env/bin/activate
Using conda:
bashconda create -n cua-env python=3.10 conda activate cua-env
Then install the required packages:
bashpip install openai pip install cua-computer
Ensure you have an OpenAI API key (set as an environment variable or in your OpenAI configuration).
Building the Operator
Importing Required Modules
With the prerequisites installed and configured, we're ready to build our first operator. The following example uses asynchronous Python (async/await). You can run it either in a VS Code Notebook or as a standalone Python script.
pythonimport asyncio import base64 import openai from computer import Computer
Mapping API Actions to CUA Methods
The following helper function converts a computer_call
action from the OpenAI Responses API into corresponding commands on the CUI interface. For example, if the API instructs a click
action, we move the cursor and perform a left click on the lume VM Sandbox. We will use the computer interface to execute the actions.
pythonasync def execute_action(computer, action): action_type = action.type if action_type == "click": x = action.x y = action.y button = action.button print(f"Executing click at ({x}, {y}) with button '{button}'") await computer.interface.move_cursor(x, y) if button == "right": await computer.interface.right_click() else: await computer.interface.left_click() elif action_type == "type": text = action.text print(f"Typing text: {text}") await computer.interface.type_text(text) elif action_type == "scroll": x = action.x y = action.y scroll_x = action.scroll_x scroll_y = action.scroll_y print(f"Scrolling at ({x}, {y}) with offsets (scroll_x={scroll_x}, scroll_y={scroll_y})") await computer.interface.move_cursor(x, y) await computer.interface.scroll(scroll_y) # Using vertical scroll only elif action_type == "keypress": keys = action.keys for key in keys: print(f"Pressing key: {key}") # Map common key names to CUA equivalents if key.lower() == "enter": await computer.interface.press_key("return") elif key.lower() == "space": await computer.interface.press_key("space") else: await computer.interface.press_key(key) elif action_type == "wait": wait_time = action.time print(f"Waiting for {wait_time} seconds") await asyncio.sleep(wait_time) elif action_type == "screenshot": print("Taking screenshot") # This is handled automatically in the main loop, but we can take an extra one if requested screenshot = await computer.interface.screenshot() return screenshot else: print(f"Unrecognized action: {action_type}")
Implementing the Computer-Use Loop
This section defines a loop that:
- Initializes the cua-computer instance (connecting to a macOS sandbox).
- Captures a screenshot of the current state.
- Sends the screenshot (with a user prompt) to the OpenAI Responses API using the
computer-use-preview
model. - Processes the returned
computer_call
action and executes it using our helper function. - Captures an updated screenshot after the action (this example runs one iteration, but you can wrap it in a loop).
For a full loop, you would repeat these steps until no further actions are returned.
pythonasync def cua_openai_loop(): # Initialize the lume computer instance (macOS sandbox) async with Computer( display="1024x768", memory="4GB", cpu="2", os="macos" ) as computer: await computer.run() # Start the lume VM # Capture the initial screenshot screenshot = await computer.interface.screenshot() screenshot_base64 = base64.b64encode(screenshot).decode('utf-8') # Initial request to start the loop response = openai.responses.create( model="computer-use-preview", tools=[{ "type": "computer_use_preview", "display_width": 1024, "display_height": 768, "environment": "mac" }], input=[ { "role": "user", "content": [ {"type": "input_text", "text": "Open Safari, download and install Cursor."}, {"type": "input_image", "image_url": f"data:image/png;base64,{screenshot_base64}"} ] } ], truncation="auto" ) # Continue the loop until no more computer_call actions while True: # Check for computer_call actions computer_calls = [item for item in response.output if item and item.type == "computer_call"] if not computer_calls: print("No more computer calls. Loop complete.") break # Get the first computer call call = computer_calls[0] last_call_id = call.call_id action = call.action print("Received action from OpenAI Responses API:", action) # Handle any pending safety checks if call.pending_safety_checks: print("Safety checks pending:", call.pending_safety_checks) # In a real implementation, you would want to get user confirmation here acknowledged_checks = call.pending_safety_checks else: acknowledged_checks = [] # Execute the action await execute_action(computer, action) await asyncio.sleep(1) # Allow time for changes to take effect # Capture new screenshot after action new_screenshot = await computer.interface.screenshot() new_screenshot_base64 = base64.b64encode(new_screenshot).decode('utf-8') # Send the screenshot back as computer_call_output response = openai.responses.create( model="computer-use-preview", tools=[{ "type": "computer_use_preview", "display_width": 1024, "display_height": 768, "environment": "mac" }], input=[{ "type": "computer_call_output", "call_id": last_call_id, "acknowledged_safety_checks": acknowledged_checks, "output": { "type": "input_image", "image_url": f"data:image/png;base64,{new_screenshot_base64}" } }], truncation="auto" ) # End the session await computer.stop() # Run the loop if __name__ == "__main__": asyncio.run(cua_openai_loop())
You can find the full code in our notebook.
Request Handling Differences
The first request to the OpenAI Responses API is special in that it includes the initial screenshot and prompt. Subsequent requests are handled differently, using the computer_call_output
type to provide feedback on the executed action.
Initial Request Format
- We use
role: "user"
withcontent
that contains bothinput_text
(the prompt) andinput_image
(the screenshot)
Subsequent Request Format
- We use
type: "computer_call_output"
instead of the user role - We include the
call_id
to link the output to the specific previous action that was executed - We provide any
acknowledged_safety_checks
that were approved - We include the new screenshot in the
output
field
This structured approach allows the API to maintain context and continuity throughout the interaction session.
Note: For multi-turn conversations, you should include the previous_response_id
in your initial requests when starting a new conversation with prior context. However, when using computer_call_output
for action feedback, you don't need to explicitly manage the conversation history - OpenAI's API automatically tracks the context using the call_id
. The previous_response_id
is primarily important when the user provides additional instructions or when starting a new request that should continue from a previous session.
Conclusion
Summary
This blogpost demonstrates a single iteration of a OpenAI Computer-Use loop where:
- A macOS sandbox is controlled using the CUA interface.
- A screenshot and prompt are sent to the OpenAI Responses API.
- The returned action (e.g. a click or type command) is executed via the CUI interface.
In a production setting, you would wrap the action-response cycle in a loop, handling multiple actions and safety checks as needed.
Next Steps
In the next blogpost, we'll introduce our Agent framework which abstracts away all these tedious implementation steps. This framework provides a higher-level API that handles the interaction loop between OpenAI's computer-use model and the macOS sandbox, allowing you to focus on building sophisticated applications rather than managing the low-level details we've explored here. Can't wait? Check out the cua-agent package!