paulml commited on
Commit
0f0dda5
·
verified ·
1 Parent(s): 81c3a65

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -29
README.md CHANGED
@@ -10,7 +10,7 @@ tags:
10
  - autonomous-agent
11
  - computer-use
12
  spaces:
13
- - racineai/CU-1
14
  metrics:
15
  - accuracy
16
  - mean_average_precision
@@ -21,7 +21,7 @@ metrics:
21
  library_name: transformers
22
  pipeline_tag: object-detection
23
  model-index:
24
- - name: CU-1 RF-DETR-M UI Detection
25
  results:
26
  - task:
27
  type: object-detection
@@ -35,13 +35,13 @@ model-index:
35
  name: Click Accuracy
36
 
37
  ---
38
- # CU-1: RF-DETR-M for Computer Use Agent
39
 
40
  ## Model Description
41
 
42
- **CU-1** (Computer Use Agent v1) is a fine-tuned implementation of RF-DETR-M specifically optimized for autonomous computer interaction. This model serves as the visual perception backbone for our computer use agent, enabling real-time UI element detection and multi-action task automation across diverse graphical interfaces.
43
 
44
- [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/racineai/CU-1)
45
 
46
  **Key Features:**
47
  - **70.8% accuracy** on WebClick benchmark (vs 58.8% for OmniParser)
@@ -54,7 +54,7 @@ model-index:
54
 
55
  ## Methodology Revision Notice
56
 
57
- **Important:** This paper presents revised benchmark results following a methodological correction. Our initial evaluation used default YOLO detection parameters and baseline prompts, which do not reflect optimal performance conditions for either model. We subsequently re-evaluated both CU-1 and OmniParser V2 using their respective optimized detection thresholds (0.35 for CU-1, 0.05 for OmniParser V2 from official sources) and refined prompts for improved task instruction clarity. Both sets of results are presented for transparency, with the optimized evaluation better representing real-world deployment scenarios where parameters and prompts are tuned for specific use cases.
58
 
59
  ## Training Architecture
60
 
@@ -89,7 +89,7 @@ Evaluation performed on **1,639 samples** across three categories using **Gemini
89
  #### Technical Parameters
90
 
91
  **Detection Configuration:**
92
- - **CU-1**:
93
  - Confidence threshold: 0.35
94
  - Model: RF-DETR-Medium
95
  - **OmniParser**:
@@ -110,7 +110,7 @@ The LLM is prompted to analyze the image and respond with a tool call in the for
110
  {"name": "click", "parameters": {"id": <box_id>}}
111
  ```
112
 
113
- Note that the full CU-1 agent supports multiple actions (`click`, `type`, `scroll`, `press`, `right_click`, `double_click`, etc.), but for benchmark consistency, only the `click` action is evaluated. This tests the fundamental capability of correctly identifying and selecting UI elements.
114
 
115
  <img src="./img/annotation_bbc.jpg" width="800" alt="BBC News Annotation Example">
116
 
@@ -122,18 +122,18 @@ Note that the full CU-1 agent supports multiple actions (`click`, `type`, `scrol
122
 
123
  #### Results
124
 
125
- | Metric | CU-1 (RF-DETR-M) | OmniParser | Improvement |
126
  |--------|------------------|------------|-------------|
127
  | **Overall Accuracy** | **70.8%** | 58.8% | **+20%** |
128
  | Agent Browse | 66% | 58% | +14% |
129
  | Calendars | 64% | 46% | **+39%** |
130
  | Human Browse | 83% | 73% | +14% |
131
 
132
- *Table 1: Performance comparison between CU-1 and OmniParser across WebClick benchmark categories (optimized parameters)*
133
 
134
  **Methodology Note:**
135
 
136
- Initial evaluation used default YOLO detection parameters, yielding OmniParser accuracy of 40.7%. Following parameter optimization (confidence threshold 0.05, IOU threshold 0.1 from official deployment configurations) and refined prompts, OmniParser improved to 58.8%. CU-1 improved from 67.5% to 70.8% solely through enhanced system prompts, maintaining its threshold of 0.35 throughout both evaluations.
137
 
138
  <img src="./img/overall_old.png" width="700" alt="Initial vs Optimized Performance">
139
 
@@ -148,43 +148,43 @@ Initial evaluation used default YOLO detection parameters, yielding OmniParser a
148
  - **Calendars**: Date selection interfaces with dense grid layouts of small, similar-looking elements
149
  - **Human Browse**: Real-world web browsing scenarios with diverse UI patterns and complex page structures
150
 
151
- CU-1 shows particularly strong performance on **Calendar** tasks (+39% improvement), demonstrating superior ability to distinguish between densely packed, visually similar UI elements - a critical capability for autonomous agents.
152
 
153
  **Detection Statistics:**
154
- - **Average elements detected per image**: CU-1 detects 82.3 elements vs OmniParser's 50.6 elements
155
- - **Processing time**: CU-1 averages 0.82s per image vs OmniParser's 0.77s
156
 
157
  ### Visual Performance Comparison
158
 
159
- ### Examples showing CU-1 (blue boxes) vs OmniParser (orange boxes) detection capabilities:
160
 
161
  <img src="./img/boxes_calendar.png" width="900" alt="Detection Example 1">
162
 
163
  *Figure 5: Calendar date selection interface with dual-month view (April-May 2026).*
164
- *CU-1 detects 103 interactive elements including individual calendar dates for both months, navigation arrows, date input fields, and action buttons (Reset, Cancel, Apply), while OmniParser only identifies 47 elements, missing numerous calendar dates and form controls.*
165
 
166
  <img src="./img/boxes_musique.png" width="900" alt="Detection Example 2">
167
 
168
  *Figure 6: Spotify music streaming platform showing search results for artist "Gojira".*
169
- *CU-1 identifies 98 elements including navigation tabs (Tracks, Albums, Playlists, Artists, Episodes, Profiles), individual track rows with action buttons (play, like, more options), artist information, and media controls, compared to OmniParser's 60 detections that miss several interactive elements and granular controls.*
170
 
171
- ### WebClick benchmark click decision examples with Gemini Pro 2.5 (green box: ground truth, blue: CU-1 selection, orange: OmniParser selection):
172
 
173
  <img src="./img/click_calendar.png" width="600" alt="Click Decision 1">
174
 
175
  *Figure 7: Travel booking website with flight search and calendar date picker (April-May 2025).*
176
  ***Query:** Click task on calendar interface*
177
- *CU-1 correctly identifies and clicks the target date element (May 27th) within the dense calendar grid, while OmniParser fails to locate the correct date element.*
178
 
179
  <img src="./img/click_hotel.png" width="600" alt="Click Decision 2">
180
 
181
  *Figure 8: Booking.com accommodation search with stay duration selector.*
182
  ***Query:** Select stay duration option*
183
- *CU-1 demonstrates superior fine-grained detection by accurately identifying both the "A month" text label and its associated radio button as separate interactive elements, enabling precise selection. OmniParser fails to detect these subtle UI components, missing the granular structure of the duration selector interface.*
184
 
185
  **Benchmark Context:**
186
 
187
- The WebClick benchmark evaluates single-click accuracy on web UI tasks. While our agent achieves 70.8% accuracy (compared to 58.8% for OmniParser), it's important to note that CU-1 is designed for **multi-action sequences** beyond the single-click paradigm:
188
 
189
  - **Sequential Actions**: Screenshot before/after each action for context awareness
190
  - **Complex Workflows**: Navigate through multi-step processes autonomously
@@ -192,13 +192,13 @@ The WebClick benchmark evaluates single-click accuracy on web UI tasks. While ou
192
 
193
  <video src="https://cdn-uploads.huggingface.co/production/uploads/659826211ec4d9b9a1f2ef3a/V6PTmBzligR4Fo43hZK-5.qt" width="800" controls alt="Demo Agent Execution"></video>
194
 
195
- *Video 1: Example of CU-1 agent performing a multi-step task requiring several sequential actions to achieve the final result.*
196
 
197
  ## Agent Architecture & Capabilities
198
 
199
  ### Visual Processing Pipeline
200
 
201
- CU-1 powers a sophisticated computer use agent with multiple detection modes:
202
 
203
  ```python
204
  # From agent_cv.py - Core processing loop
@@ -206,7 +206,7 @@ async def run_agent(user_query: str):
206
  # 1. Capture screenshot
207
  screenshot = capture_screenshot()
208
 
209
- # 2. Process with RF-DETR (CU-1)
210
  boxes, annotated, atlas = process_image(screenshot)
211
 
212
  # 3. Multiple methods to communicate detected elements to the LLM
@@ -218,13 +218,13 @@ async def run_agent(user_query: str):
218
 
219
  ## Agent System Architecture
220
 
221
- CU-1 serves as the visual perception foundation for an autonomous computer use agent capable of complex multi-step interactions across any graphical interface.
222
 
223
  ### Key Differentiators
224
 
225
  **Beyond Single-Click Benchmarks:**
226
 
227
- While WebClick evaluates single-click accuracy, CU-1 excels at complex multi-action sequences:
228
 
229
  ```python
230
  # Example: Multi-step form submission
@@ -312,12 +312,12 @@ ECE, a multi-program, multi-campus, and multi-sector engineering school speciali
312
  ### Model Citation
313
  ```bibtex
314
  @misc{cu1-computer-use-agent-2025,
315
- author = {CU-1 Team},
316
- title = {CU-1: RF-DETR-M for Computer Use Agent},
317
  year = {2025},
318
  publisher = {Hugging Face},
319
  journal = {Hugging Face Model Hub},
320
- howpublished = {\url{https://huggingface.co/CU-1/rf-detr-computer-use}}
321
  }
322
  ```
323
 
 
10
  - autonomous-agent
11
  - computer-use
12
  spaces:
13
+ - racineai/UI-DETR-1
14
  metrics:
15
  - accuracy
16
  - mean_average_precision
 
21
  library_name: transformers
22
  pipeline_tag: object-detection
23
  model-index:
24
+ - name: UI-DETR-1 RF-DETR-M UI Detection
25
  results:
26
  - task:
27
  type: object-detection
 
35
  name: Click Accuracy
36
 
37
  ---
38
+ # UI-DETR-1: RF-DETR-M for Computer Use Agent
39
 
40
  ## Model Description
41
 
42
+ **UI-DETR-1** (Computer Use Agent v1) is a fine-tuned implementation of RF-DETR-M specifically optimized for autonomous computer interaction. This model serves as the visual perception backbone for our computer use agent, enabling real-time UI element detection and multi-action task automation across diverse graphical interfaces.
43
 
44
+ [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/racineai/UI-DETR-1)
45
 
46
  **Key Features:**
47
  - **70.8% accuracy** on WebClick benchmark (vs 58.8% for OmniParser)
 
54
 
55
  ## Methodology Revision Notice
56
 
57
+ **Important:** This paper presents revised benchmark results following a methodological correction. Our initial evaluation used default YOLO detection parameters and baseline prompts, which do not reflect optimal performance conditions for either model. We subsequently re-evaluated both UI-DETR-1 and OmniParser V2 using their respective optimized detection thresholds (0.35 for UI-DETR-1, 0.05 for OmniParser V2 from official sources) and refined prompts for improved task instruction clarity. Both sets of results are presented for transparency, with the optimized evaluation better representing real-world deployment scenarios where parameters and prompts are tuned for specific use cases.
58
 
59
  ## Training Architecture
60
 
 
89
  #### Technical Parameters
90
 
91
  **Detection Configuration:**
92
+ - **UI-DETR-1**:
93
  - Confidence threshold: 0.35
94
  - Model: RF-DETR-Medium
95
  - **OmniParser**:
 
110
  {"name": "click", "parameters": {"id": <box_id>}}
111
  ```
112
 
113
+ Note that the full UI-DETR-1 agent supports multiple actions (`click`, `type`, `scroll`, `press`, `right_click`, `double_click`, etc.), but for benchmark consistency, only the `click` action is evaluated. This tests the fundamental capability of correctly identifying and selecting UI elements.
114
 
115
  <img src="./img/annotation_bbc.jpg" width="800" alt="BBC News Annotation Example">
116
 
 
122
 
123
  #### Results
124
 
125
+ | Metric | UI-DETR-1 (RF-DETR-M) | OmniParser | Improvement |
126
  |--------|------------------|------------|-------------|
127
  | **Overall Accuracy** | **70.8%** | 58.8% | **+20%** |
128
  | Agent Browse | 66% | 58% | +14% |
129
  | Calendars | 64% | 46% | **+39%** |
130
  | Human Browse | 83% | 73% | +14% |
131
 
132
+ *Table 1: Performance comparison between UI-DETR-1 and OmniParser across WebClick benchmark categories (optimized parameters)*
133
 
134
  **Methodology Note:**
135
 
136
+ Initial evaluation used default YOLO detection parameters, yielding OmniParser accuracy of 40.7%. Following parameter optimization (confidence threshold 0.05, IOU threshold 0.1 from official deployment configurations) and refined prompts, OmniParser improved to 58.8%. UI-DETR-1 improved from 67.5% to 70.8% solely through enhanced system prompts, maintaining its threshold of 0.35 throughout both evaluations.
137
 
138
  <img src="./img/overall_old.png" width="700" alt="Initial vs Optimized Performance">
139
 
 
148
  - **Calendars**: Date selection interfaces with dense grid layouts of small, similar-looking elements
149
  - **Human Browse**: Real-world web browsing scenarios with diverse UI patterns and complex page structures
150
 
151
+ UI-DETR-1 shows particularly strong performance on **Calendar** tasks (+39% improvement), demonstrating superior ability to distinguish between densely packed, visually similar UI elements - a critical capability for autonomous agents.
152
 
153
  **Detection Statistics:**
154
+ - **Average elements detected per image**: UI-DETR-1 detects 82.3 elements vs OmniParser's 50.6 elements
155
+ - **Processing time**: UI-DETR-1 averages 0.82s per image vs OmniParser's 0.77s
156
 
157
  ### Visual Performance Comparison
158
 
159
+ ### Examples showing UI-DETR-1 (blue boxes) vs OmniParser (orange boxes) detection capabilities:
160
 
161
  <img src="./img/boxes_calendar.png" width="900" alt="Detection Example 1">
162
 
163
  *Figure 5: Calendar date selection interface with dual-month view (April-May 2026).*
164
+ *UI-DETR-1 detects 103 interactive elements including individual calendar dates for both months, navigation arrows, date input fields, and action buttons (Reset, Cancel, Apply), while OmniParser only identifies 47 elements, missing numerous calendar dates and form controls.*
165
 
166
  <img src="./img/boxes_musique.png" width="900" alt="Detection Example 2">
167
 
168
  *Figure 6: Spotify music streaming platform showing search results for artist "Gojira".*
169
+ *UI-DETR-1 identifies 98 elements including navigation tabs (Tracks, Albums, Playlists, Artists, Episodes, Profiles), individual track rows with action buttons (play, like, more options), artist information, and media controls, compared to OmniParser's 60 detections that miss several interactive elements and granular controls.*
170
 
171
+ ### WebClick benchmark click decision examples with Gemini Pro 2.5 (green box: ground truth, blue: UI-DETR-1 selection, orange: OmniParser selection):
172
 
173
  <img src="./img/click_calendar.png" width="600" alt="Click Decision 1">
174
 
175
  *Figure 7: Travel booking website with flight search and calendar date picker (April-May 2025).*
176
  ***Query:** Click task on calendar interface*
177
+ *UI-DETR-1 correctly identifies and clicks the target date element (May 27th) within the dense calendar grid, while OmniParser fails to locate the correct date element.*
178
 
179
  <img src="./img/click_hotel.png" width="600" alt="Click Decision 2">
180
 
181
  *Figure 8: Booking.com accommodation search with stay duration selector.*
182
  ***Query:** Select stay duration option*
183
+ *UI-DETR-1 demonstrates superior fine-grained detection by accurately identifying both the "A month" text label and its associated radio button as separate interactive elements, enabling precise selection. OmniParser fails to detect these subtle UI components, missing the granular structure of the duration selector interface.*
184
 
185
  **Benchmark Context:**
186
 
187
+ The WebClick benchmark evaluates single-click accuracy on web UI tasks. While our agent achieves 70.8% accuracy (compared to 58.8% for OmniParser), it's important to note that UI-DETR-1 is designed for **multi-action sequences** beyond the single-click paradigm:
188
 
189
  - **Sequential Actions**: Screenshot before/after each action for context awareness
190
  - **Complex Workflows**: Navigate through multi-step processes autonomously
 
192
 
193
  <video src="https://cdn-uploads.huggingface.co/production/uploads/659826211ec4d9b9a1f2ef3a/V6PTmBzligR4Fo43hZK-5.qt" width="800" controls alt="Demo Agent Execution"></video>
194
 
195
+ *Video 1: Example of UI-DETR-1 agent performing a multi-step task requiring several sequential actions to achieve the final result.*
196
 
197
  ## Agent Architecture & Capabilities
198
 
199
  ### Visual Processing Pipeline
200
 
201
+ UI-DETR-1 powers a sophisticated computer use agent with multiple detection modes:
202
 
203
  ```python
204
  # From agent_cv.py - Core processing loop
 
206
  # 1. Capture screenshot
207
  screenshot = capture_screenshot()
208
 
209
+ # 2. Process with RF-DETR (UI-DETR-1)
210
  boxes, annotated, atlas = process_image(screenshot)
211
 
212
  # 3. Multiple methods to communicate detected elements to the LLM
 
218
 
219
  ## Agent System Architecture
220
 
221
+ UI-DETR-1 serves as the visual perception foundation for an autonomous computer use agent capable of complex multi-step interactions across any graphical interface.
222
 
223
  ### Key Differentiators
224
 
225
  **Beyond Single-Click Benchmarks:**
226
 
227
+ While WebClick evaluates single-click accuracy, UI-DETR-1 excels at complex multi-action sequences:
228
 
229
  ```python
230
  # Example: Multi-step form submission
 
312
  ### Model Citation
313
  ```bibtex
314
  @misc{cu1-computer-use-agent-2025,
315
+ author = {UI-DETR-1 Team},
316
+ title = {UI-DETR-1: RF-DETR-M for Computer Use Agent},
317
  year = {2025},
318
  publisher = {Hugging Face},
319
  journal = {Hugging Face Model Hub},
320
+ howpublished = {\url{https://huggingface.co/UI-DETR-1/rf-detr-computer-use}}
321
  }
322
  ```
323