In this article, I will discuss the practical application of large language models (LLMs) in combination with traditional automation tools like Python/Selenium to improve test reliability.
The article consists of the following sections:
- What are self-healing tests?
- Hardware configuration
- Software configuration
- Verifying the local LLM API
- Integrating with tests
- Limitations
- Future prospects
What are self-healing tests
Automated tests must be reliable to prevent false positives. Such tests increase trust in their results and enable deeper integration of automated testing into processes. Test reliability can be improved by addressing the key issues that arise during their execution. All these issues lead to false positives:
- Changing properties of elements in the tested application.
- Unstable infrastructure.
- Excessive speed of the automation tool.
Based on these common issues, we define self-healing tests as those that can automatically adapt their behavior when a problem occurs. The focus of the current implementation is on solving issue #1.
Hardware configuration
The table below provides measurements of two model performance parameters on our configurations, along with a description of the configurations themselves.
Due to limited hardware availability and experimental usage, we use only one server in the first configuration
Software configuration
- The model lmstudio-community/Qwen2.5-7B-Instruct-MLX-4bit from HuggingFace (convenient to download via LMStudio - see item 3 below).
- The OpenAI pip package.
- LMStudio – an "IDE" that is convenient for prompt debugging, model parameter configuration, offline execution, and setting up a "model server."
Verifying the local LLM API
(venv) user@MacBook-Pro-Admin-2 web2 % cat tttt
from openai import OpenAI
LLM_URL = 'http://<Internal_IP_address>:1234/v1'
LLM_MODEL = 'qwen2.5-7b-instruct-mlx'
def call_llm(request):
client = OpenAI(base_url=LLM_URL, api_key="lm-studio")
completion = client.chat.completions.create(
model=LLM_MODEL,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": request,
},
],
},
],
)
return completion.choices[0].message.content
print(call_llm('Yes or no?'))
(venv) user@MacBook-Pro-Admin-2 web2 %
(venv) user@MacBook-Pro-Admin-2 web2 % python tttt
I'm sorry, but your question is not clear. Could you please provide more details or context?
Integrating with tests
All the logic begins with the get_object method, which is accessible to the final PageObjects through the base PageObject. The green items are new to the standard scheme.
Thanks to RLS (Run-time Locators Storage), all subsequent tests within the test run will not fail on the problematic element. Another important aspect is the organization of the design-time locator storage. These are separate classes that are connected to functional PageObjects. They look something like this:
class LLogin:
@staticmethod
def L_I18N_TEXTFIELD_LOGIN(lang=''):
"""login input field"""
return ('xpath', f'//*[@e2e-id="I intentionally broke this locator"]')
Thanks to the naming convention for locators (the name starts with the prefix L_) and the convention for working with page objects (access to them is only done through Page.get_object(L_I18N_TEXTFIELD_LOGIN)), we can extract the locator name from the stack trace and save it in the Run-time Locators Storage.
Thanks to the docstring of the locator method ("""login input field"""), we have an elegant solution for where to store the human-readable description of the locator, which is then used for LLM inference.
Below is a raw log demonstrating self-healing in action. See WARNING level steps
2025-03-02 23:28:57 STEP WEB Client: Pick language
2025-03-02 23:28:57 STEP WEB Client: Authorize
2025-03-02 23:29:03 WARNING Web element with locator "('xpath', '//*[@e2e-id="I intentionally broke it"]')" not found within timeout, trying AI locator
2025-03-02 23:29:03 WARNING Problematic locator is L_I18N_TEXTFIELD_LOGIN for pages.login.PVersion2Login
2025-03-02 23:29:03 WARNING AI will try to find element locator using description: "login input field"
2025-03-02 23:29:36 WARNING Store AI-locator in cache for: "pages.login.PVersion2Login.L_I18N_TEXTFIELD_LOGIN" for subsequent tests
2025-03-02 23:29:36 WARNING Using AI-calculated locator "('xpath', '//input[@e2e-id="login-page.login-form.login-input"]')"
2025-03-02 23:29:36 STEP WEB Client: Enter code
2025-03-02 23:29:36 STEP Mailbox: Get confirmation code
PASSED [ 33%]
-------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------
2025-03-02 23:29:40 STEP ~~~~~END test_login_2fa~~~~~
tests/test_login.py::test_logout
--------------------------------------------------------------------------------------------- live log setup ----------------------------------------------------------------------------------------------
2025-03-02 23:29:40 STEP ~~~~~START test_logout~~~~~
---------------------------------------------------------------------------------------------- live log call ----------------------------------------------------------------------------------------------
2025-03-02 23:29:50 STEP WEB Client: Pick language
2025-03-02 23:29:51 STEP WEB Client: Authorize
2025-03-02 23:29:51 WARNING Using AI-calculated locator from cache: "('xpath', '//input[@e2e-id="login-page.login-form.login-input"]')"
2025-03-02 23:29:55 STEP WEB Client: Open profile
2025-03-02 23:29:55 STEP WEB Client: Logout
2025-03-02 23:29:55 STEP ASSERT: User sees auth page
For now, we are not making automatic commits to replace old locators with new ones generated by the LLM. Instead, we display them in the Allure report:
Limitations
Simultaneous requests to the LLM for inference increase the TTFT (Time To First Token) for subsequent requests. For example, if you make three requests at once on our initial configuration (TTFT=35), the response for the first request will arrive in 35 seconds (OK), for the second in ~70 seconds, and for the third in ~105 seconds. This resembles a queue.
If you decide to use this approach with a configuration that results in a long TTFT, especially when many tests are running or when almost all locators fail, everything will queue up, and your Gitlab-like system will terminate the test pipeline due to a timeout.
Under limited hardware resources, you cannot simply connect all locators to the LLM — doing so could potentially create a queue due to multi-threaded test execution, even if your product is stable and developers allocate special properties for automated tests (like e2e-id in our case). To minimize this issue, a simple strategy can be employed: automatically count how many times each locator is used in a test run and connect only the most frequently used ones. For instance, when almost all your tests start from the login page, you will get the highest usage count for locators related to the login input field, password input field, and the "Login" button. By connecting these, you can prevent the failure of all tests at the very beginning.
Context size refers to the number of tokens in the prompt. The larger the context size, the higher the memory consumption. The smaller the context size, the more you will need to optimize to provide the LLM with a suitable portion of the HTML for inference. In our case, we preprocess the data—we remove unnecessary parts from the entire page_source provided by Selenium, such as style and script tags along with their content. We will likely further reduce the size for pages with a large number of elements. Our context size is 10,000 tokens. According to Qwen's documentation, 1 token is approximately 3-4 characters of English text.
Future prospects
Products that are in relatively active development can now have UI automated tests thanks to the self-healing approach.
Automated testing of relatively stable products will become more reliable due to the adaptability to changes in the application.
Automated testing of relatively stable products can cover more scenarios (for example, in testing desktop platforms, where access to system windows is sometimes required and limited by automation tools) through the integration of multimodal LLMs or a combination of several simpler models.
Top comments (0)