Update README.md
Browse files
README.md
CHANGED
|
@@ -10,7 +10,7 @@ tags:
|
|
| 10 |
- autonomous-agent
|
| 11 |
- computer-use
|
| 12 |
spaces:
|
| 13 |
-
- racineai/
|
| 14 |
metrics:
|
| 15 |
- accuracy
|
| 16 |
- mean_average_precision
|
|
@@ -21,7 +21,7 @@ metrics:
|
|
| 21 |
library_name: transformers
|
| 22 |
pipeline_tag: object-detection
|
| 23 |
model-index:
|
| 24 |
-
- name:
|
| 25 |
results:
|
| 26 |
- task:
|
| 27 |
type: object-detection
|
|
@@ -35,13 +35,13 @@ model-index:
|
|
| 35 |
name: Click Accuracy
|
| 36 |
|
| 37 |
---
|
| 38 |
-
#
|
| 39 |
|
| 40 |
## Model Description
|
| 41 |
|
| 42 |
-
**
|
| 43 |
|
| 44 |
-
[](https://huggingface.co/spaces/racineai/
|
| 45 |
|
| 46 |
**Key Features:**
|
| 47 |
- **70.8% accuracy** on WebClick benchmark (vs 58.8% for OmniParser)
|
|
@@ -54,7 +54,7 @@ model-index:
|
|
| 54 |
|
| 55 |
## Methodology Revision Notice
|
| 56 |
|
| 57 |
-
**Important:** This paper presents revised benchmark results following a methodological correction. Our initial evaluation used default YOLO detection parameters and baseline prompts, which do not reflect optimal performance conditions for either model. We subsequently re-evaluated both
|
| 58 |
|
| 59 |
## Training Architecture
|
| 60 |
|
|
@@ -89,7 +89,7 @@ Evaluation performed on **1,639 samples** across three categories using **Gemini
|
|
| 89 |
#### Technical Parameters
|
| 90 |
|
| 91 |
**Detection Configuration:**
|
| 92 |
-
- **
|
| 93 |
- Confidence threshold: 0.35
|
| 94 |
- Model: RF-DETR-Medium
|
| 95 |
- **OmniParser**:
|
|
@@ -110,7 +110,7 @@ The LLM is prompted to analyze the image and respond with a tool call in the for
|
|
| 110 |
{"name": "click", "parameters": {"id": <box_id>}}
|
| 111 |
```
|
| 112 |
|
| 113 |
-
Note that the full
|
| 114 |
|
| 115 |
<img src="./img/annotation_bbc.jpg" width="800" alt="BBC News Annotation Example">
|
| 116 |
|
|
@@ -122,18 +122,18 @@ Note that the full CU-1 agent supports multiple actions (`click`, `type`, `scrol
|
|
| 122 |
|
| 123 |
#### Results
|
| 124 |
|
| 125 |
-
| Metric |
|
| 126 |
|--------|------------------|------------|-------------|
|
| 127 |
| **Overall Accuracy** | **70.8%** | 58.8% | **+20%** |
|
| 128 |
| Agent Browse | 66% | 58% | +14% |
|
| 129 |
| Calendars | 64% | 46% | **+39%** |
|
| 130 |
| Human Browse | 83% | 73% | +14% |
|
| 131 |
|
| 132 |
-
*Table 1: Performance comparison between
|
| 133 |
|
| 134 |
**Methodology Note:**
|
| 135 |
|
| 136 |
-
Initial evaluation used default YOLO detection parameters, yielding OmniParser accuracy of 40.7%. Following parameter optimization (confidence threshold 0.05, IOU threshold 0.1 from official deployment configurations) and refined prompts, OmniParser improved to 58.8%.
|
| 137 |
|
| 138 |
<img src="./img/overall_old.png" width="700" alt="Initial vs Optimized Performance">
|
| 139 |
|
|
@@ -148,43 +148,43 @@ Initial evaluation used default YOLO detection parameters, yielding OmniParser a
|
|
| 148 |
- **Calendars**: Date selection interfaces with dense grid layouts of small, similar-looking elements
|
| 149 |
- **Human Browse**: Real-world web browsing scenarios with diverse UI patterns and complex page structures
|
| 150 |
|
| 151 |
-
|
| 152 |
|
| 153 |
**Detection Statistics:**
|
| 154 |
-
- **Average elements detected per image**:
|
| 155 |
-
- **Processing time**:
|
| 156 |
|
| 157 |
### Visual Performance Comparison
|
| 158 |
|
| 159 |
-
### Examples showing
|
| 160 |
|
| 161 |
<img src="./img/boxes_calendar.png" width="900" alt="Detection Example 1">
|
| 162 |
|
| 163 |
*Figure 5: Calendar date selection interface with dual-month view (April-May 2026).*
|
| 164 |
-
*
|
| 165 |
|
| 166 |
<img src="./img/boxes_musique.png" width="900" alt="Detection Example 2">
|
| 167 |
|
| 168 |
*Figure 6: Spotify music streaming platform showing search results for artist "Gojira".*
|
| 169 |
-
*
|
| 170 |
|
| 171 |
-
### WebClick benchmark click decision examples with Gemini Pro 2.5 (green box: ground truth, blue:
|
| 172 |
|
| 173 |
<img src="./img/click_calendar.png" width="600" alt="Click Decision 1">
|
| 174 |
|
| 175 |
*Figure 7: Travel booking website with flight search and calendar date picker (April-May 2025).*
|
| 176 |
***Query:** Click task on calendar interface*
|
| 177 |
-
*
|
| 178 |
|
| 179 |
<img src="./img/click_hotel.png" width="600" alt="Click Decision 2">
|
| 180 |
|
| 181 |
*Figure 8: Booking.com accommodation search with stay duration selector.*
|
| 182 |
***Query:** Select stay duration option*
|
| 183 |
-
*
|
| 184 |
|
| 185 |
**Benchmark Context:**
|
| 186 |
|
| 187 |
-
The WebClick benchmark evaluates single-click accuracy on web UI tasks. While our agent achieves 70.8% accuracy (compared to 58.8% for OmniParser), it's important to note that
|
| 188 |
|
| 189 |
- **Sequential Actions**: Screenshot before/after each action for context awareness
|
| 190 |
- **Complex Workflows**: Navigate through multi-step processes autonomously
|
|
@@ -192,13 +192,13 @@ The WebClick benchmark evaluates single-click accuracy on web UI tasks. While ou
|
|
| 192 |
|
| 193 |
<video src="https://cdn-uploads.huggingface.co/production/uploads/659826211ec4d9b9a1f2ef3a/V6PTmBzligR4Fo43hZK-5.qt" width="800" controls alt="Demo Agent Execution"></video>
|
| 194 |
|
| 195 |
-
*Video 1: Example of
|
| 196 |
|
| 197 |
## Agent Architecture & Capabilities
|
| 198 |
|
| 199 |
### Visual Processing Pipeline
|
| 200 |
|
| 201 |
-
|
| 202 |
|
| 203 |
```python
|
| 204 |
# From agent_cv.py - Core processing loop
|
|
@@ -206,7 +206,7 @@ async def run_agent(user_query: str):
|
|
| 206 |
# 1. Capture screenshot
|
| 207 |
screenshot = capture_screenshot()
|
| 208 |
|
| 209 |
-
# 2. Process with RF-DETR (
|
| 210 |
boxes, annotated, atlas = process_image(screenshot)
|
| 211 |
|
| 212 |
# 3. Multiple methods to communicate detected elements to the LLM
|
|
@@ -218,13 +218,13 @@ async def run_agent(user_query: str):
|
|
| 218 |
|
| 219 |
## Agent System Architecture
|
| 220 |
|
| 221 |
-
|
| 222 |
|
| 223 |
### Key Differentiators
|
| 224 |
|
| 225 |
**Beyond Single-Click Benchmarks:**
|
| 226 |
|
| 227 |
-
While WebClick evaluates single-click accuracy,
|
| 228 |
|
| 229 |
```python
|
| 230 |
# Example: Multi-step form submission
|
|
@@ -312,12 +312,12 @@ ECE, a multi-program, multi-campus, and multi-sector engineering school speciali
|
|
| 312 |
### Model Citation
|
| 313 |
```bibtex
|
| 314 |
@misc{cu1-computer-use-agent-2025,
|
| 315 |
-
author = {
|
| 316 |
-
title = {
|
| 317 |
year = {2025},
|
| 318 |
publisher = {Hugging Face},
|
| 319 |
journal = {Hugging Face Model Hub},
|
| 320 |
-
howpublished = {\url{https://huggingface.co/
|
| 321 |
}
|
| 322 |
```
|
| 323 |
|
|
|
|
| 10 |
- autonomous-agent
|
| 11 |
- computer-use
|
| 12 |
spaces:
|
| 13 |
+
- racineai/UI-DETR-1
|
| 14 |
metrics:
|
| 15 |
- accuracy
|
| 16 |
- mean_average_precision
|
|
|
|
| 21 |
library_name: transformers
|
| 22 |
pipeline_tag: object-detection
|
| 23 |
model-index:
|
| 24 |
+
- name: UI-DETR-1 RF-DETR-M UI Detection
|
| 25 |
results:
|
| 26 |
- task:
|
| 27 |
type: object-detection
|
|
|
|
| 35 |
name: Click Accuracy
|
| 36 |
|
| 37 |
---
|
| 38 |
+
# UI-DETR-1: RF-DETR-M for Computer Use Agent
|
| 39 |
|
| 40 |
## Model Description
|
| 41 |
|
| 42 |
+
**UI-DETR-1** (Computer Use Agent v1) is a fine-tuned implementation of RF-DETR-M specifically optimized for autonomous computer interaction. This model serves as the visual perception backbone for our computer use agent, enabling real-time UI element detection and multi-action task automation across diverse graphical interfaces.
|
| 43 |
|
| 44 |
+
[](https://huggingface.co/spaces/racineai/UI-DETR-1)
|
| 45 |
|
| 46 |
**Key Features:**
|
| 47 |
- **70.8% accuracy** on WebClick benchmark (vs 58.8% for OmniParser)
|
|
|
|
| 54 |
|
| 55 |
## Methodology Revision Notice
|
| 56 |
|
| 57 |
+
**Important:** This paper presents revised benchmark results following a methodological correction. Our initial evaluation used default YOLO detection parameters and baseline prompts, which do not reflect optimal performance conditions for either model. We subsequently re-evaluated both UI-DETR-1 and OmniParser V2 using their respective optimized detection thresholds (0.35 for UI-DETR-1, 0.05 for OmniParser V2 from official sources) and refined prompts for improved task instruction clarity. Both sets of results are presented for transparency, with the optimized evaluation better representing real-world deployment scenarios where parameters and prompts are tuned for specific use cases.
|
| 58 |
|
| 59 |
## Training Architecture
|
| 60 |
|
|
|
|
| 89 |
#### Technical Parameters
|
| 90 |
|
| 91 |
**Detection Configuration:**
|
| 92 |
+
- **UI-DETR-1**:
|
| 93 |
- Confidence threshold: 0.35
|
| 94 |
- Model: RF-DETR-Medium
|
| 95 |
- **OmniParser**:
|
|
|
|
| 110 |
{"name": "click", "parameters": {"id": <box_id>}}
|
| 111 |
```
|
| 112 |
|
| 113 |
+
Note that the full UI-DETR-1 agent supports multiple actions (`click`, `type`, `scroll`, `press`, `right_click`, `double_click`, etc.), but for benchmark consistency, only the `click` action is evaluated. This tests the fundamental capability of correctly identifying and selecting UI elements.
|
| 114 |
|
| 115 |
<img src="./img/annotation_bbc.jpg" width="800" alt="BBC News Annotation Example">
|
| 116 |
|
|
|
|
| 122 |
|
| 123 |
#### Results
|
| 124 |
|
| 125 |
+
| Metric | UI-DETR-1 (RF-DETR-M) | OmniParser | Improvement |
|
| 126 |
|--------|------------------|------------|-------------|
|
| 127 |
| **Overall Accuracy** | **70.8%** | 58.8% | **+20%** |
|
| 128 |
| Agent Browse | 66% | 58% | +14% |
|
| 129 |
| Calendars | 64% | 46% | **+39%** |
|
| 130 |
| Human Browse | 83% | 73% | +14% |
|
| 131 |
|
| 132 |
+
*Table 1: Performance comparison between UI-DETR-1 and OmniParser across WebClick benchmark categories (optimized parameters)*
|
| 133 |
|
| 134 |
**Methodology Note:**
|
| 135 |
|
| 136 |
+
Initial evaluation used default YOLO detection parameters, yielding OmniParser accuracy of 40.7%. Following parameter optimization (confidence threshold 0.05, IOU threshold 0.1 from official deployment configurations) and refined prompts, OmniParser improved to 58.8%. UI-DETR-1 improved from 67.5% to 70.8% solely through enhanced system prompts, maintaining its threshold of 0.35 throughout both evaluations.
|
| 137 |
|
| 138 |
<img src="./img/overall_old.png" width="700" alt="Initial vs Optimized Performance">
|
| 139 |
|
|
|
|
| 148 |
- **Calendars**: Date selection interfaces with dense grid layouts of small, similar-looking elements
|
| 149 |
- **Human Browse**: Real-world web browsing scenarios with diverse UI patterns and complex page structures
|
| 150 |
|
| 151 |
+
UI-DETR-1 shows particularly strong performance on **Calendar** tasks (+39% improvement), demonstrating superior ability to distinguish between densely packed, visually similar UI elements - a critical capability for autonomous agents.
|
| 152 |
|
| 153 |
**Detection Statistics:**
|
| 154 |
+
- **Average elements detected per image**: UI-DETR-1 detects 82.3 elements vs OmniParser's 50.6 elements
|
| 155 |
+
- **Processing time**: UI-DETR-1 averages 0.82s per image vs OmniParser's 0.77s
|
| 156 |
|
| 157 |
### Visual Performance Comparison
|
| 158 |
|
| 159 |
+
### Examples showing UI-DETR-1 (blue boxes) vs OmniParser (orange boxes) detection capabilities:
|
| 160 |
|
| 161 |
<img src="./img/boxes_calendar.png" width="900" alt="Detection Example 1">
|
| 162 |
|
| 163 |
*Figure 5: Calendar date selection interface with dual-month view (April-May 2026).*
|
| 164 |
+
*UI-DETR-1 detects 103 interactive elements including individual calendar dates for both months, navigation arrows, date input fields, and action buttons (Reset, Cancel, Apply), while OmniParser only identifies 47 elements, missing numerous calendar dates and form controls.*
|
| 165 |
|
| 166 |
<img src="./img/boxes_musique.png" width="900" alt="Detection Example 2">
|
| 167 |
|
| 168 |
*Figure 6: Spotify music streaming platform showing search results for artist "Gojira".*
|
| 169 |
+
*UI-DETR-1 identifies 98 elements including navigation tabs (Tracks, Albums, Playlists, Artists, Episodes, Profiles), individual track rows with action buttons (play, like, more options), artist information, and media controls, compared to OmniParser's 60 detections that miss several interactive elements and granular controls.*
|
| 170 |
|
| 171 |
+
### WebClick benchmark click decision examples with Gemini Pro 2.5 (green box: ground truth, blue: UI-DETR-1 selection, orange: OmniParser selection):
|
| 172 |
|
| 173 |
<img src="./img/click_calendar.png" width="600" alt="Click Decision 1">
|
| 174 |
|
| 175 |
*Figure 7: Travel booking website with flight search and calendar date picker (April-May 2025).*
|
| 176 |
***Query:** Click task on calendar interface*
|
| 177 |
+
*UI-DETR-1 correctly identifies and clicks the target date element (May 27th) within the dense calendar grid, while OmniParser fails to locate the correct date element.*
|
| 178 |
|
| 179 |
<img src="./img/click_hotel.png" width="600" alt="Click Decision 2">
|
| 180 |
|
| 181 |
*Figure 8: Booking.com accommodation search with stay duration selector.*
|
| 182 |
***Query:** Select stay duration option*
|
| 183 |
+
*UI-DETR-1 demonstrates superior fine-grained detection by accurately identifying both the "A month" text label and its associated radio button as separate interactive elements, enabling precise selection. OmniParser fails to detect these subtle UI components, missing the granular structure of the duration selector interface.*
|
| 184 |
|
| 185 |
**Benchmark Context:**
|
| 186 |
|
| 187 |
+
The WebClick benchmark evaluates single-click accuracy on web UI tasks. While our agent achieves 70.8% accuracy (compared to 58.8% for OmniParser), it's important to note that UI-DETR-1 is designed for **multi-action sequences** beyond the single-click paradigm:
|
| 188 |
|
| 189 |
- **Sequential Actions**: Screenshot before/after each action for context awareness
|
| 190 |
- **Complex Workflows**: Navigate through multi-step processes autonomously
|
|
|
|
| 192 |
|
| 193 |
<video src="https://cdn-uploads.huggingface.co/production/uploads/659826211ec4d9b9a1f2ef3a/V6PTmBzligR4Fo43hZK-5.qt" width="800" controls alt="Demo Agent Execution"></video>
|
| 194 |
|
| 195 |
+
*Video 1: Example of UI-DETR-1 agent performing a multi-step task requiring several sequential actions to achieve the final result.*
|
| 196 |
|
| 197 |
## Agent Architecture & Capabilities
|
| 198 |
|
| 199 |
### Visual Processing Pipeline
|
| 200 |
|
| 201 |
+
UI-DETR-1 powers a sophisticated computer use agent with multiple detection modes:
|
| 202 |
|
| 203 |
```python
|
| 204 |
# From agent_cv.py - Core processing loop
|
|
|
|
| 206 |
# 1. Capture screenshot
|
| 207 |
screenshot = capture_screenshot()
|
| 208 |
|
| 209 |
+
# 2. Process with RF-DETR (UI-DETR-1)
|
| 210 |
boxes, annotated, atlas = process_image(screenshot)
|
| 211 |
|
| 212 |
# 3. Multiple methods to communicate detected elements to the LLM
|
|
|
|
| 218 |
|
| 219 |
## Agent System Architecture
|
| 220 |
|
| 221 |
+
UI-DETR-1 serves as the visual perception foundation for an autonomous computer use agent capable of complex multi-step interactions across any graphical interface.
|
| 222 |
|
| 223 |
### Key Differentiators
|
| 224 |
|
| 225 |
**Beyond Single-Click Benchmarks:**
|
| 226 |
|
| 227 |
+
While WebClick evaluates single-click accuracy, UI-DETR-1 excels at complex multi-action sequences:
|
| 228 |
|
| 229 |
```python
|
| 230 |
# Example: Multi-step form submission
|
|
|
|
| 312 |
### Model Citation
|
| 313 |
```bibtex
|
| 314 |
@misc{cu1-computer-use-agent-2025,
|
| 315 |
+
author = {UI-DETR-1 Team},
|
| 316 |
+
title = {UI-DETR-1: RF-DETR-M for Computer Use Agent},
|
| 317 |
year = {2025},
|
| 318 |
publisher = {Hugging Face},
|
| 319 |
journal = {Hugging Face Model Hub},
|
| 320 |
+
howpublished = {\url{https://huggingface.co/UI-DETR-1/rf-detr-computer-use}}
|
| 321 |
}
|
| 322 |
```
|
| 323 |
|