AI and test automation: pitfalls when generating end-to-end tests

Georg Dörgeloh May 20, 2025

As a business expert like a business analyst or product owner, you’ve probably been here before: you describe what seems like a straightforward test case and think, “This can’t be that difficult… just click here, enter something there, and check the result.” Then comes sprint planning, where you discover that test automation will take far more development effort than expected, and the implementation gets cut or drastically simplified because of time constraints.

Then ChatGPT burst onto the scene. Suddenly everyone thought, “AI will fix everything!” Finally, we wouldn’t have to wrestle with flaky tests, CSS selectors, and endless debugging loops.

But as with most things that sound too good to be true, the devil is in the details. In this article, I’ll show you why AI can indeed be a game-changer for test automation, but it’s not a magic wand - and what you can do to harness its real benefits without falling into common traps.

Why We All Dream of AI-Driven Tests

Before diving into the technical details, let’s consider why we want to use AI for testing in the first place. The reason is simple: end-to-end tests are incredibly valuable yet extremely time-consuming.

The real dilemma is what I call the “expert-programmer gap.” As a business analyst or product owner, you know exactly how the application should work. You understand the business processes, requirements, and user scenarios better than any developer. You could define perfect test cases - if only you could program them yourself! As this article on test automation shows, the biggest challenge isn’t figuring out “what” to test, but “how” to translate your acceptance criteria into automated test scripts.

To bridge this gap, we first need to understand what constitutes a high-quality E2E test from a technical perspective. Ideally, it should:

Mimic real user interactions
Use stable element selectors
Be robust against minor UI changes
Be maintainable and understandable

Sounds simple? Unfortunately, it’s anything but.

Three Ways AI Might (Maybe) Save Your Tests

1. The Code Generator Approach: “Write Me a Test!”

The most obvious approach is asking an LLM to generate test code. This could be through general-purpose assistants like ChatGPT or Claude, or directly in your development environment with tools like Cursor, GitHub Copilot, or similar AI-powered code editors. You describe what you want to test in plain language, and the AI generates Playwright, Cypress, or Selenium code.

// AI-generated E2E test
test('User can login and update profile', async ({ page }) => {
  await page.goto('https://example.com');
  await page.fill('input[name="username"]', 'testuser');
  await page.fill('input[name="password"]', 'password123');
  await page.click('button[type="submit"]');
  await page.waitForNavigation();
  // and so on...
});

Does it work? Partially. Modern AI-powered editors like Cursor can even access and understand your application’s source code. For developers already familiar with the codebase, this approach can produce workable results.

The problems: For business analysts and product owners, this approach remains largely inaccessible. Even for developers, significant limitations exist:

Context limitation: Large codebases quickly overwhelm the LLM’s context window
No visual context: The AI sees code but not the running application
Poor runtime understanding: Element visibility, timing issues, and dynamic UI states remain invisible

It’s like trying to guide someone through assembling complex IKEA furniture over the phone - even with instructions, they miss the spatial awareness you’d have in person.

2. The Full-Time AI Tester: “Let AI Drive the Browser!”

The next logical step: let the AI do the testing itself! In this approach, an LLM takes control of a browser, receives a test case in natural language and executes it independently.

The AI sees the screen via screenshots, analyzes the DOM and decides what to do next at each step. Every test run repeats this entire process from scratch.

Does it work? Surprisingly, yes—and often quite well! Modern LLMs like GPT-4 with Vision or Claude can navigate complex websites with impressive accuracy. This approach particularly appeals to business analysts and product owners since you can describe tests in natural language without writing a single line of code.

The problems: As impressive as this approach is, it comes with serious practical drawbacks:

Resource hunger: Each test run requires dozens of expensive LLM API calls, making it slow and costly - a significant problem for regular regression testing
Unpredictability: The same test might execute differently each time, undermining reliability
Scaling issues: Running dozens of tests in your CI pipeline quickly becomes prohibitively expensive and slow
Poor reproducibility: When errors occur, it’s hard to recreate the exact sequence since the AI might make different decisions each time

It’s like hiring a brilliant but painfully slow and expensive QA person who starts from scratch every time and never learns from previous experience.

3. The Hybrid Approach: “Let AI Record Tests, Then We’ll Run Them!”

The third approach combines AI’s strengths with conventional test automation’s efficiency:

AI executes the test once and records all actions
These actions get translated into standard test code
Future executions use this code without AI involvement
When UI changes, AI can regenerate the test if needed

Does it work? This approach offers a compelling compromise, especially for business analysts and software testers: You describe tests in natural language, AI handles the technical implementation, and the result can run repeatedly without additional AI costs. When the application changes, you can have the AI update tests without programming knowledge.

The challenges: Translating AI actions into robust, maintainable test code isn’t trivial and requires technical expertise. A good tool for this approach should give product owners and business analysts an easy way to review, edit, and manage tests without touching code directly.

The “Did We Overestimate This?” Phase: Three Surprising Problems with AI Test Automation

As a product owner or business analyst, you might have enthusiastically announced, “With AI, we can triple our test coverage and reduce developer dependencies!” But anyone who’s experimented with AI-driven tests knows the sobering feeling when initial excitement fades. Here are three common problems that don’t require deep technical knowledge to understand:

1. The Context Trap: When the AI’s Memory Fails

LLMs have limited context windows. For a complex website, the DOM alone can be several megabytes. Add screenshots and test steps, and you’ll quickly exceed any reasonable context limit.

AI: "Sorry, I can't remember what we were testing. This DOM is too massive for my context..."

It’s like trying to play complex chess while only remembering the last three moves.

2. The Perception Gap: “I See Something You Don’t See”

Here’s a fascinating problem: The AI sees a search button with a magnifying glass icon, but the DOM contains nothing about “search” or “magnifying glass” - just an SVG with no semantic meaning. How should the AI know which DOM element corresponds to which visual element?

3. The Identification Challenge: “What Do We Call This Thing?”

Once the AI correctly identifies an element, it needs to create a reliable selector for it. This selector must be unique and stable not just for this test run, but for all future runs - the only way to record a script that won’t break with minor UI changes.

Advanced Solution Strategies

How do we overcome these hurdles? Here’s where things get interesting:

1. Multi-Agent Architecture to Bridge the Perception Gap

Instead of burdening a single AI agent with everything, we can deploy specialized agents:

A visual agent with powerful vision capabilities analyzes screenshots
A DOM agent with a smaller, faster model identifies elements
A coordination agent maintains the big picture and test strategy

The challenge lies in orchestrating communication between these agents efficiently. Like any team, more specialization means more coordination overhead. A good AI test automation system handles this orchestration elegantly.

2. Intelligent Context Optimization

To tackle the context problem, we need effective compression without losing critical information. Instead of processing the entire DOM, we can remove unimportant attributes or elements and only pass relevant parts of the DOM. Instead of the full DOM, the AI could only work with a ‘semantic DOM’ that contains the most important elements.

Additionally, the LLM doesn’t need the entire conversation history for every request, as typical AI chats would do. Previous steps, reasoning, and results can be summarized concisely.

3. Robust Selector Strategies

To generate reliable selectors, we need a strategy or algorithm that reliably determines the best selector based on a defined priority. Like this:

First look for data-testid attributes (ideal when available)
Use semantic attributes like labels and ARIA roles
Try relative positioning to known elements
Only as a last resort: use absolute positions or complex XPath selectors

The Future: Where Do We Go From Here?

For those who master these challenges, exciting possibilities await - particularly for business analysts and product owners:

Requirements integration: Generate tests directly from your Jira stories with a single click
Automatic validation: AI checks user stories for testability before development starts
Self-improving systems: Tests that learn from failures and get smarter over time
Seamless ecosystem integration: Complement your existing tools and processes
Expanded coverage: Test more scenarios without additional resources

Imagine being able to automatically generate and run E2E tests right after creating a user story, without waiting for developers or QA resources!

Conclusion: AI as a Bridge Across the Expert-Programmer Gap

AI test automation isn’t a magical solution that creates perfect tests at the push of a button. Rather, it’s a powerful tool that can bridge the “expert-programmer gap” - if we understand and work with its limitations:

The biggest challenges are technical: context limitation, perception problems and reliable element identification
Solutions lie in orchestrating specialized AI agents, optimizing context, and implementing robust selector strategies
Success comes not from full automation, but from combining AI capabilities with your domain expertise

As a business analyst or product owner, you know the requirements better than anyone. With the right AI tools, you can now translate that expertise directly into automated tests - without writing code yourself. The future of test automation isn’t about replacing human expertise, but augmenting it with intelligent AI systems.