Set-of-Mark

A corresponding Python example is available for this documentation.

Overview

The SOM library provides visual element detection and interaction capabilities. It is based on the Set-of-Mark research paper and the OmniParser model.

API Documentation

OmniParser Class

class OmniParser:
    def __init__(self, device: str = "auto"):
        """Initialize the parser with automatic device detection"""

    def parse(
        self,
        image: PIL.Image,
        box_threshold: float = 0.3,
        iou_threshold: float = 0.1,
        use_ocr: bool = True,
        ocr_engine: str = "easyocr"
    ) -> ParseResult:
        """Parse UI elements from an image"""

ParseResult Object

@dataclass
class ParseResult:
    elements: List[UIElement]      # Detected elements
    visualized_image: PIL.Image    # Annotated image
    processing_time: float         # Time in seconds

    def to_dict(self) -> dict:
        """Convert to JSON-serializable dictionary"""

    def filter_by_type(self, elem_type: str) -> List[UIElement]:
        """Filter elements by type ('icon' or 'text')"""

UIElement

class UIElement(BaseModel):
    id: Optional[int] = Field(None)     # Element ID (1-indexed)
    type: Literal["icon", "text"]       # Element type
    bbox: BoundingBox                   # Bounding box coordinates { x1, y1, x2, y2 }
    interactivity: bool = Field(default=False)   # Whether the element is interactive
    confidence: float = Field(default=1.0)       # Detection confidence

Overview

API Documentation

OmniParser Class

ParseResult Object

UIElement

On this page