Articles in this series:
- Part 1 - Practical Applications of AI in Test Automation — Context, Demo with UI-Tars LLM & Midscene
- Part 2 - Data: UI-Tars VS GPT-4o in Midscene
- Part 3 - Stage Conclusion: UI-Tars + RAG = Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent
Reading Part 1 first could help to learn the context of the following data.
This is a series documentation. In Part 1, We have generally understood what UI-Tars
LLM is and how Midscene
orchestrates it. In this article, I mainly want to delve deeper into comparing UI-Tars
and GPT-4o
in AI-powered automation testing, via using Midscene
, to identify the differences, pros, and cons.
By default, Midscene
supports gpt-4o
, qwen-2.5VL
, and ui-tars
since Feb 2025.
1. Comparing GPT-4o and UI-Tars results when having the same test case as inputs
We will analyze and compare different kinds of test steps.
1.1 Step 1 - AI Assertion
await aiWaitFor('The country selection popup is visible')
GPT-4o | UI-Tars:7B | |
---|---|---|
return_format | Json_Object | Plain text, but Json String |
Steps | 1 | 1 |
duration | 3.09s | 2.27s 👍 |
cost | 0.0035775$ | 0.0011$ 👍 |
token | 1390 | 2924 |
temperature | 0.1 | 0 |
Message | Browser Screenshot, Instructions | Browser Screenshot, Instructions |
result | {"pass":true, "thought":"The screenshot clearly shows a country selection popup with a list of countries, confirming the assertion." } | { "pass": true, "thought": "The country selection popup is visible on the screen, as indicated by the title 'Where do you live?' and the list of countries with their respective flags." } |
1.2 Step 2 - Simple Action Step
Select France as the country that I'm living
GPT-4o | UI-Tars:7B | |
---|---|---|
return_format | Json_Object | Plain text, formated |
Steps | 1 | 2 |
duration | 3.22s 👍 | 4.54s(2.54s + 1.99) |
cost | 0.0109475$ | 0.0023$ 👍 |
token | 4184 | 14987(5110 + 9877) |
temperature | 0.1 | 0 |
Message | Screenshot, Instructions, part of HTML Tree | Screenshot, Instructions |
result | { "actions":[ { "thought":"..", "type":"Tap", "param":null, "locate":{ "id":"cngod", "prompt":"..." } } ], "finish":true, "log":"...", "error":null } | Iterated 2 steps, every time it return Thought:.... Action: <click\ |
To click "France" in the country popup, GPT-4o only requires 1 LLM call due to GPT-4o returning the
action
andfinish
in one reply. But UI-Tars requires 2 LLM calls based on its reasoning, the first call returns the action, and then after theaction
is executed, the 2nd LLM call sends a screenshot again to *check whether the "user's instruction" is finished. *
1.3 Step - Complicated Step
Click Search bar, then Search 'Chanel', and press the Enter
GPT-4o | UI-Tars:7B | |
---|---|---|
return_format | Json_Object | Plain text, formated |
Steps | 1 | 4 |
duration | 3.52s 👍 | 12.16s (4+2.77+2.36+3.03) |
cost | 0.01268$ | 0.00608$ 👍 |
token | 4646 | 49123 |
temperature | 0.1 | 0 |
Message | Screenshot, Instructions, part of HTML Tree | Screenshot, Instructions |
result | Returns 3 actions in one reply { "actions":[ {...}, {...} ], "finish":true, "log":"...", "error":null } | Iterated 4 steps, every time it return Thought:.... Action: <click\ |
This step is a bit complicated for both, but they both can handle it, but with a big difference. There is no reasoning in GPT-4o's result, it generated 3
actions
and markedfinish
as true before the action started... however UI-Tars is quite slow relatively due to its reasoning - it makes a single action, then reflects it and then decide next action, in total, it also generates 3actions
, plus onecheck
at the end to verify the current stage meet the user's expectation based on the given "user's instruction"
1.4 Step - Scrolling to an unpredictable position
Scroll down to the 1st product
GPT-4o | UI-Tars:7B | |
---|---|---|
return_format | Json_Object | Plain text, formated |
Steps | 2 | 2 |
duration | 7.58 (3.01+4.57) | 5.47s (3.18+2.29) 👍 |
cost | 0.01268$ | 0.002735$ 👍 |
token | 12557 | 14994 |
temperature | 0.1 | 0 |
Message | Screenshot, Instructions, part of HTML Tree | Screenshot, Instructions |
result | { "actions":[ {...} ], "finish":true, "log":"...", "error":null } | Iterated 4 steps, every time it return Thought:.... Action: <click\ |
Wow! Due to there being no reasoning in GPT-4o usages, to scroll to an unknown position requires 2 GPT-4o calls, 1st call generates a
scrolling action
, and the 2nd callchecks
the screenshot and then decides no need to scroll anymore.
But when playing with UI-Tars, it just works as normal, 1st call makes the decision of action, and the 2nd to validates again from the new screenshot after the browser action is executed.
1.5 “Vision” Comparision
UI-Tars and GPT-4o both have their own vision.
GPT-4o | UI-Tars:7B | |
---|---|---|
Vision | ![]() |
![]() |
For some small or overlapping elements,
GPT-4o
is unable to recognize them, whereas UI-Tars:7B can achieve almost complete recognition. Unfortunatelygpt-4o
even cannot identify the "login" button in the top banner.
2. Summary
2.1 Ability to handle complicated case
*GPT-4o: * Because there is no reasoning or no system-2 reasoning, gpt-4o
can handle some straightforward test steps, but for some very general steps such as i login
, it roughly cannot handle this case at all.
UI-Tars:7B Because it has reasoning, it can support up to 15 steps, meaning 14 actions + 1 final check, which is his reasoning top limit.
2.2 Speed
In most cases, You have to double the LLM calls if you use UI-Tars
comparing using GPT-4o, because
UI-Tars` requires a final check for each "user instruction".
But gpt-4o
in most cases doesn't require validation for "user instructions" after it generates actions.
So gpt-4o
is faster, but may be dangerous for the test result.
2.3 Level of Trust
UI-Tars
always check whether the previously planned action achieved the "user instruction", so UI-Tars
traded time for accuracy, UI-Tars
only marks finish=true
when it checks the screenshot again after the action, but gpt-4o
directly return finish=true
even before the generated actions are executed...
2.4 Perception of GUI Element
From section 1.4, we can clearly notice that UI-Tars
can even identify the small elements on the page and overlapped elements, but it failed for gpt-4o
.
2.5 Additional input for LLM
UI-Tars
only makes a decision based on the screenshot(mimicking human vision), but gpt-4o
requires building a partial HTML tree, which may slow down gpt-4o
due to the increasing size of the HTML code.
2.6 Costs
If we deploy UI-Tars
to our own infrastructure, then to achieve the same result as gpt-4o
or an even better result, you can save 50% - 75% costs.
3. Conclusions
Overall, if we plan to apply AI in real day-to-day work, I believe UI-Tars
can do a better job than gpt-4o
in the above context.
However, how to speed up UI-Tars
's reasoning, will be one of the challenges in the near future.
Top comments (0)