DEV Community

Zhaopeng Xuan
Zhaopeng Xuan

Posted on • Edited on

Data - UI-Tars VS GPT-4o in Midscene (Part 2)

Articles in this series:

Reading Part 1 first could help to learn the context of the following data.

This is a series documentation. In Part 1, We have generally understood what UI-Tars LLM is and how Midscene orchestrates it. In this article, I mainly want to delve deeper into comparing UI-Tars and GPT-4o in AI-powered automation testing, via using Midscene, to identify the differences, pros, and cons.

By default, Midscene supports gpt-4o, qwen-2.5VL, and ui-tars since Feb 2025.

1. Comparing GPT-4o and UI-Tars results when having the same test case as inputs

We will analyze and compare different kinds of test steps.

1.1 Step 1 - AI Assertion

    await aiWaitFor('The country selection popup is visible')
Enter fullscreen mode Exit fullscreen mode
GPT-4o UI-Tars:7B
return_format Json_Object Plain text, but Json String
Steps 1 1
duration 3.09s 2.27s 👍
cost 0.0035775$ 0.0011$ 👍
token 1390 2924
temperature 0.1 0
Message Browser Screenshot, Instructions Browser Screenshot, Instructions
result {"pass":true, "thought":"The screenshot clearly shows a country selection popup with a list of countries, confirming the assertion." } { "pass": true, "thought": "The country selection popup is visible on the screen, as indicated by the title 'Where do you live?' and the list of countries with their respective flags." }

1.2 Step 2 - Simple Action Step

    Select France as the country that I'm living
Enter fullscreen mode Exit fullscreen mode
GPT-4o UI-Tars:7B
return_format Json_Object Plain text, formated
Steps 1 2
duration 3.22s 👍 4.54s(2.54s + 1.99)
cost 0.0109475$ 0.0023$ 👍
token 4184 14987(5110 + 9877)
temperature 0.1 0
Message Screenshot, Instructions, part of HTML Tree Screenshot, Instructions
result { "actions":[ { "thought":"..", "type":"Tap", "param":null, "locate":{ "id":"cngod", "prompt":"..." } } ], "finish":true, "log":"...", "error":null } Iterated 2 steps, every time it return Thought:.... Action: <click\

To click "France" in the country popup, GPT-4o only requires 1 LLM call due to GPT-4o returning the action and finish in one reply. But UI-Tars requires 2 LLM calls based on its reasoning, the first call returns the action, and then after the action is executed, the 2nd LLM call sends a screenshot again to *check whether the "user's instruction" is finished. *

1.3 Step - Complicated Step

Click Search bar, then Search 'Chanel', and press the Enter
Enter fullscreen mode Exit fullscreen mode
GPT-4o UI-Tars:7B
return_format Json_Object Plain text, formated
Steps 1 4
duration 3.52s 👍 12.16s (4+2.77+2.36+3.03)
cost 0.01268$ 0.00608$ 👍
token 4646 49123
temperature 0.1 0
Message Screenshot, Instructions, part of HTML Tree Screenshot, Instructions
result Returns 3 actions in one reply { "actions":[ {...}, {...} ], "finish":true, "log":"...", "error":null } Iterated 4 steps, every time it return Thought:.... Action: <click\

This step is a bit complicated for both, but they both can handle it, but with a big difference. There is no reasoning in GPT-4o's result, it generated 3 actions and marked finish as true before the action started... however UI-Tars is quite slow relatively due to its reasoning - it makes a single action, then reflects it and then decide next action, in total, it also generates 3 actions, plus one check at the end to verify the current stage meet the user's expectation based on the given "user's instruction"

1.4 Step - Scrolling to an unpredictable position

Scroll down to the 1st product
Enter fullscreen mode Exit fullscreen mode
GPT-4o UI-Tars:7B
return_format Json_Object Plain text, formated
Steps 2 2
duration 7.58 (3.01+4.57) 5.47s (3.18+2.29) 👍
cost 0.01268$ 0.002735$ 👍
token 12557 14994
temperature 0.1 0
Message Screenshot, Instructions, part of HTML Tree Screenshot, Instructions
result { "actions":[ {...} ], "finish":true, "log":"...", "error":null } Iterated 4 steps, every time it return Thought:.... Action: <click\

Wow! Due to there being no reasoning in GPT-4o usages, to scroll to an unknown position requires 2 GPT-4o calls, 1st call generates a scrolling action, and the 2nd call checks the screenshot and then decides no need to scroll anymore.
But when playing with UI-Tars, it just works as normal, 1st call makes the decision of action, and the 2nd to validates again from the new screenshot after the browser action is executed.

1.5 “Vision” Comparision

UI-Tars and GPT-4o both have their own vision.

GPT-4o UI-Tars:7B
Vision Image description Image description

For some small or overlapping elements, GPT-4o is unable to recognize them, whereas UI-Tars:7B can achieve almost complete recognition. Unfortunately gpt-4o even cannot identify the "login" button in the top banner.

2. Summary

2.1 Ability to handle complicated case

*GPT-4o: * Because there is no reasoning or no system-2 reasoning, gpt-4o can handle some straightforward test steps, but for some very general steps such as i login, it roughly cannot handle this case at all.

UI-Tars:7B Because it has reasoning, it can support up to 15 steps, meaning 14 actions + 1 final check, which is his reasoning top limit.

2.2 Speed

In most cases, You have to double the LLM calls if you use UI-Tars comparing using GPT-4o, becauseUI-Tars` requires a final check for each "user instruction".

But gpt-4o in most cases doesn't require validation for "user instructions" after it generates actions.

So gpt-4o is faster, but may be dangerous for the test result.

2.3 Level of Trust

UI-Tars always check whether the previously planned action achieved the "user instruction", so UI-Tars traded time for accuracy, UI-Tars only marks finish=true when it checks the screenshot again after the action, but gpt-4o directly return finish=true even before the generated actions are executed...

2.4 Perception of GUI Element

From section 1.4, we can clearly notice that UI-Tars can even identify the small elements on the page and overlapped elements, but it failed for gpt-4o.

2.5 Additional input for LLM

UI-Tars only makes a decision based on the screenshot(mimicking human vision), but gpt-4o requires building a partial HTML tree, which may slow down gpt-4o due to the increasing size of the HTML code.

2.6 Costs

If we deploy UI-Tars to our own infrastructure, then to achieve the same result as gpt-4o or an even better result, you can save 50% - 75% costs.

3. Conclusions

Overall, if we plan to apply AI in real day-to-day work, I believe UI-Tars can do a better job than gpt-4o in the above context.

However, how to speed up UI-Tars's reasoning, will be one of the challenges in the near future.

Top comments (0)