The Future of E2E Testing: How to Overcome Flakiness with Natural Language + LLMs

#webdev #ai #testing #frontend

To start, Playwright is an amazing tool. Major props to Microsoft for starting it and all the open source maintainers who have kept it up since. OSS is truly the backbone of the internet.

But it seems obvious to us that LLMs, and in particular multimodal LLMs with vision, are going to majorly shake up the e2e testing paradigm. And mostly likely web scraping too, but I’ll be focusing on e2e testing in this blog.

But what’s not obvious to me is what form that future is going to take.

To extend or to replace

I think the biggest unknown to me is whether Playwright will continue to be used and if so, in what capacity. I could see this transformation playing out in 3 distinct ways:

LLMs become heavily used to generate Playwright code, but tests are still built and executed using traditional selectors under the hood
Tests are built in natural language (read: no selectors) and are executed via browser agents that leverage multimodal LLMs with vision
A hybrid of both, where Playwright becomes augmented with LLMs and you have the ability to execute tests via traditional means or kick out to vision-based browser agents

At Magnitude, we are betting the future goes in the direction of number 2 - full replacement. To us, it seems obvious that as LLMs continue to improve, the efficiency gains they afford will become undeniable. Both in drastically reduced test maintenance and increased ability to find bugs.

However, a part that is still unknown to us is the question of determinism in test cases in this new paradigm.

The tricky question of determinism

Flakiness has long been the biggest problem in e2e test suites. But by solving that problem, you instead introduce a new one - nondeterminism. Deterministic test cases mean that once you establish a test case, you expect it to run the same way every single time. To take the same actions.

With natural language test cases and no underlying selectors, this becomes an interesting challenge. How do you achieve determinism while still leveraging the adaptability of vision-based browser agents? How do you take advantage of the efficiency gains in the new paradigm while still satisfying what people have come to expect in the old paradigm?

At Magnitude, we’ve decided to approach this problem in the following way:

Instruct users on how to best write natural language test cases that we know leads to more deterministic behavior while retaining flexibility
Allow the first test run of a test case to be the “path-finding” mission where we allow our testing agent to choose the best actions based on the natural language steps in the test case
Once the path has been established, heavily guide the testing agent to use those same actions in future test runs, while still using vision abilities to better find bugs
If the testing agent runs into a bug while using the same actions, allow it to reason through the bug and determine whether it is a legit and to stop and report it, or to become dynamic again and try a new path through the test case

We find this to be the best approach to maintaining determinism while still allowing easy natural language test creation and increased bug finding ability.

This approach could evolve in the future with more experimentation, and I’d love to hear if others have thought through this problem and how to solve it.

What will the adoption curve look like

The final question we grapple with is what will adoption for this new technology look like. It seems like solutions that allow Playwright code generation will be more quickly adopted, since they fit in more cleanly with the existing paradigm and what teams have come to expect. But that eventually adoption for these tools will slow in favor of the more AI-native vision-based approach once the benefits become evident.

But this transition could take time. More forward-thinking organizations will likely take risks on adopting the new paradigm first, and will be rewarded in the long-term.

I have more thoughts on end users, price, use cases, and interface but I’ll save those for a part 2. If you want to play around with what this new paradigm might look like, feel free to install the Magnitude SDK and get a free API key here

DEV Community

The Future of E2E Testing: How to Overcome Flakiness with Natural Language + LLMs

To extend or to replace

The tricky question of determinism

What will the adoption curve look like

Top comments (0)

Read next

How to Build Responsive Websites with HTML and CSS

The Hidden Truth About ELIZA the Tech World Doesn’t Want You to Know

Table Augmented Generation: Enhancing LLMs with Structured Data

How to Start Your #BuildInPublic Journey (Even If You’re Just Getting Started)