Zhaopeng Xuan

Posted on Feb 24 • Edited on Feb 27

Data - UI-Tars VS GPT-4o in Midscene (Part 2)

#webdev #ai #testing #uitars

Articles in this series:

Part 1 - Practical Applications of AI in Test Automation — Context, Demo with UI-Tars LLM & Midscene
Part 2 - Data: UI-Tars VS GPT-4o in Midscene
Part 3 - Stage Conclusion: UI-Tars + RAG = Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent

Reading Part 1 first could help to learn the context of the following data.

This is a series documentation. In Part 1, We have generally understood what UI-Tars LLM is and how Midscene orchestrates it. In this article, I mainly want to delve deeper into comparing UI-Tars and GPT-4o in AI-powered automation testing, via using Midscene, to identify the differences, pros, and cons.

By default, Midscene supports gpt-4o, qwen-2.5VL, and ui-tars since Feb 2025.

1. Comparing GPT-4o and UI-Tars results when having the same test case as inputs

We will analyze and compare different kinds of test steps.

1.1 Step 1 - AI Assertion

    await aiWaitFor('The country selection popup is visible')

	GPT-4o	UI-Tars:7B
return_format	Json_Object	Plain text, but Json String
Steps	1	1
duration	3.09s	2.27s 👍
cost	0.0035775$	0.0011$ 👍
token	1390	2924
temperature	0.1	0
Message	Browser Screenshot, Instructions	Browser Screenshot, Instructions
result	{"pass":true, "thought":"The screenshot clearly shows a country selection popup with a list of countries, confirming the assertion." }	{ "pass": true, "thought": "The country selection popup is visible on the screen, as indicated by the title 'Where do you live?' and the list of countries with their respective flags." }

1.2 Step 2 - Simple Action Step

    Select France as the country that I'm living

	GPT-4o	UI-Tars:7B
return_format	Json_Object	Plain text, formated
Steps	1	2
duration	3.22s 👍	4.54s(2.54s + 1.99)
cost	0.0109475$	0.0023$ 👍
token	4184	14987(5110 + 9877)
temperature	0.1	0
Message	Screenshot, Instructions, part of HTML Tree	Screenshot, Instructions
result	{ "actions":[ { "thought":"..", "type":"Tap", "param":null, "locate":{ "id":"cngod", "prompt":"..." } } ], "finish":true, "log":"...", "error":null }	Iterated 2 steps, every time it return Thought:.... Action: <click\

To click "France" in the country popup, GPT-4o only requires 1 LLM call due to GPT-4o returning the action and finish in one reply. But UI-Tars requires 2 LLM calls based on its reasoning, the first call returns the action, and then after the action is executed, the 2nd LLM call sends a screenshot again to *check whether the "user's instruction" is finished. *

1.3 Step - Complicated Step

Click Search bar, then Search 'Chanel', and press the Enter

	GPT-4o	UI-Tars:7B
return_format	Json_Object	Plain text, formated
Steps	1	4
duration	3.52s 👍	12.16s (4+2.77+2.36+3.03)
cost	0.01268$	0.00608$ 👍
token	4646	49123
temperature	0.1	0
Message	Screenshot, Instructions, part of HTML Tree	Screenshot, Instructions
result	Returns 3 actions in one reply { "actions":[ {...}, {...} ], "finish":true, "log":"...", "error":null }	Iterated 4 steps, every time it return Thought:.... Action: <click\

This step is a bit complicated for both, but they both can handle it, but with a big difference. There is no reasoning in GPT-4o's result, it generated 3 actions and marked finish as true before the action started... however UI-Tars is quite slow relatively due to its reasoning - it makes a single action, then reflects it and then decide next action, in total, it also generates 3 actions, plus one check at the end to verify the current stage meet the user's expectation based on the given "user's instruction"

1.4 Step - Scrolling to an unpredictable position

Scroll down to the 1st product

	GPT-4o	UI-Tars:7B
return_format	Json_Object	Plain text, formated
Steps	2	2
duration	7.58 (3.01+4.57)	5.47s (3.18+2.29) 👍
cost	0.01268$	0.002735$ 👍
token	12557	14994
temperature	0.1	0
Message	Screenshot, Instructions, part of HTML Tree	Screenshot, Instructions
result	{ "actions":[ {...} ], "finish":true, "log":"...", "error":null }	Iterated 4 steps, every time it return Thought:.... Action: <click\

Wow! Due to there being no reasoning in GPT-4o usages, to scroll to an unknown position requires 2 GPT-4o calls, 1st call generates a scrolling action, and the 2nd call checks the screenshot and then decides no need to scroll anymore.
But when playing with UI-Tars, it just works as normal, 1st call makes the decision of action, and the 2nd to validates again from the new screenshot after the browser action is executed.

1.5 “Vision” Comparision

UI-Tars and GPT-4o both have their own vision.

	GPT-4o	UI-Tars:7B
Vision

For some small or overlapping elements, GPT-4o is unable to recognize them, whereas UI-Tars:7B can achieve almost complete recognition. Unfortunately gpt-4o even cannot identify the "login" button in the top banner.

2. Summary

2.1 Ability to handle complicated case

*GPT-4o: * Because there is no reasoning or no system-2 reasoning, gpt-4o can handle some straightforward test steps, but for some very general steps such as i login, it roughly cannot handle this case at all.

UI-Tars:7B Because it has reasoning, it can support up to 15 steps, meaning 14 actions + 1 final check, which is his reasoning top limit.

2.2 Speed

In most cases, You have to double the LLM calls if you use UI-Tars comparing using GPT-4o, becauseUI-Tars` requires a final check for each "user instruction".

But gpt-4o in most cases doesn't require validation for "user instructions" after it generates actions.

So gpt-4o is faster, but may be dangerous for the test result.

2.3 Level of Trust

UI-Tars always check whether the previously planned action achieved the "user instruction", so UI-Tars traded time for accuracy, UI-Tars only marks finish=true when it checks the screenshot again after the action, but gpt-4o directly return finish=true even before the generated actions are executed...

2.4 Perception of GUI Element

From section 1.4, we can clearly notice that UI-Tars can even identify the small elements on the page and overlapped elements, but it failed for gpt-4o.

2.5 Additional input for LLM

UI-Tars only makes a decision based on the screenshot(mimicking human vision), but gpt-4o requires building a partial HTML tree, which may slow down gpt-4o due to the increasing size of the HTML code.

2.6 Costs

If we deploy UI-Tars to our own infrastructure, then to achieve the same result as gpt-4o or an even better result, you can save 50% - 75% costs.

3. Conclusions

Overall, if we plan to apply AI in real day-to-day work, I believe UI-Tars can do a better job than gpt-4o in the above context.

However, how to speed up UI-Tars's reasoning, will be one of the challenges in the near future.

DEV Community